Friday, February 4, 2011

Filtering Normalized Data from a PostgreSQL Table


One of the more difficult challenges that I've come up against in the past day of coding is filtering normalized data structures (namely "Names" from an HL7v3 message). Here is the problem.

What's in a name?

Apparently quite a bit... Most systems are happy with First, Last, maybe middle name. However in HL7v3 (and other systems I imagine) flexibility is the word. So, if you're named John Jacob Jingleheimer-Schmidt HL7v3 messaging can portray that. However any receiving system also needs to be able to store this data and query it. So if you register a record as John Jacob Jingleheimer-Schmidt and someone searches for John Jingleheimer-Schmidtor
John Schmidt
that record needs to be returned.

Enter HL7v3 EN names:
<name>
    <prefix>Mr.</prefix>
    <given>John</given>
    <given>Jacob</given>
    <family>Jingleheimer</family>
    <delimiter>-</delimiter>
    <family>Schmidt</family>
</name>

Yes, they look great, and they can in fact capture all elements of a name that could possibly exist, however in a relational database they're a nightmare, especially for querying.

Storing the ENStoring the EN is easy enough. You just have to keep in mind that a name is really a container for name parts. With that in mind, we can create complex names through correlating individual tuples in a database. In the PostgreSHR project I'm working on right now, we store them in a table that looks a lot like this:
Name Part IDName Part ValueName Part TypeCorrelation Id
1MrPFX1
2JohnGIV1
3JacobGIV1
4JingleheimerFAM1
5-DEL1
6SchmidtFAM1


The correlation ID can be used when de-serializing the name components back into a name container. This particular relational table works well for storingthe
name data, however (as you can imagine) it is quite difficult to query with SQL.

The concept behind the queryRecently I've come up with a nifty solution for querying this data. It's actually quite simple and works really well (tried it on a 10,000 tuple set). Basically, you iterate over the parts of the name you're looking for (your needle) and intersect each result set with the previous. So, let's say we have a database with four people: John Smith, Sally Smith, Billy Jim Smith and Sarah Sally Smith. Our name component table will look like this:
Name Part IDName Part ValueName Part TypeCorrelation Id
1JohnGIV1
2SmithFAM1
3SallyGIV2
4SmithFAM2
5BillyGIV3
6JimGIV3
7SmithFAM3
8SarahGIV4
9SallyGIV4
10SmithFAM4


And let's pretend we want to query for Smith, Sarah. We know that our result should be Sarah Sally Smith as that is the most appropriate match. We can use this algorithm to find all the parts:

SELECT correlation FROM name table INTO tResult1 WHERE name part = filters[0].name part AND name value = filters[0].name value;FOR EACH filter IN filters
SELECT correlation FROM name table INTO tResult2 WHERE name part = filter.name part AND name value = filter.name value
    INTERSECT ALL
SELECT * FROM tResult1;
DELETE FROM tResult1;
INSERT INTO tResult1 SELECT * FROM tResult2;
DELETE FROM tResult2;
END FOR EACH
RETURN tResult1;

Let's trace through each iteration:
IterationFiltersFiltertResult1tResult2
0{ FAM:Smith, GIV:Sarah }{ 1, 2, 3, 4 }
1FAM:Smith{ 1, 2, 3, 4 }
2GIV: Sarah{ 1, 2, 3, 4 }{ 4 }
3{ 4 }


The Actual Code

The algorithm is easy enough, but what about the PL/PGSQL code for PostgreSQL. Well, after thinking for quite a while, I came up with this:

CREATE OR REPLACE FUNCTION FIND_NAME_SET(    NAME_CMP_VAL_IN    IN VARCHAR[],
    NAME_CMP_CLS_IN IN VARCHAR[]
) RETURNS TABLE (NAME_SET_ID DECIMAL(20,0)) AS
$$
DECLARE
    CNCMP    DECIMAL;
BEGIN


    -- CREATE TEMP TBLS
    IF((SELECT COUNT(*) FROM information_schema.tables WHERE table_name = 'tmp_name_res_tbl') = 1) THEN
        DROP TABLE TMP_NAME_RES_TBL;
        DROP TABLE TMP_NAME_RET_TBL;
    END IF;
    
    CREATE LOCAL TEMPORARY TABLE TMP_NAME_RES_TBL (
        NAME_SET_ID    DECIMAL(20,0)
    ) ON COMMIT DROP;
    CREATE LOCAL TEMPORARY TABLE TMP_NAME_RET_TBL (
        NAME_SET_ID    DECIMAL(20,0)
    ) ON COMMIT DROP;
    
    -- FILTER THE FIRST SET
    INSERT INTO TMP_NAME_RET_TBL
        SELECT DISTINCT A.NAME_SET_ID
        FROM NAME_CMP_TBL A
        WHERE
            NAME_CMP_VALUE = NAME_CMP_VAL_IN[1]
        AND
            NAME_CMP_CLS = NAME_CMP_CLS_IN[1];


    -- SUBSEQUENT
    IF(ARRAY_LOWER(NAME_CMP_VAL_IN, 1) < ARRAY_UPPER(NAME_CMP_VAL_IN, 1)) THEN
        FOR CNCMP IN ARRAY_LOWER(NAME_CMP_VAL_IN, 1) + 1 .. ARRAY_UPPER(NAME_CMP_VAL_IN, 1) LOOP
            INSERT INTO TMP_NAME_RES_TBL
                SELECT DISTINCT A.NAME_SET_ID
                FROM NAME_CMP_TBL A
                WHERE
                    NAME_CMP_VALUE = NAME_CMP_VAL_IN[CNCMP]
                AND
                    NAME_CMP_CLS = NAME_CMP_CLS_IN[CNCMP]
            INTERSECT ALL
                SELECT A.NAME_SET_ID FROM TMP_NAME_RET_TBL A;


            -- ADD RESULTS
            DELETE FROM TMP_NAME_RET_TBL;
            INSERT INTO TMP_NAME_RET_TBL
                SELECT *
                FROM TMP_NAME_RES_TBL;
            DELETE FROM TMP_NAME_RES_TBL;
        END LOOP;
    END IF;
    
    RETURN QUERY SELECT * FROM TMP_NAME_RET_TBL;
END;
$$ LANGUAGE plpgsql;

Notice that I create the TEMPORARY tables TMP_NAME_RES_TBL and TMP_NAME_RET_TBL as my tResult1 and tResult2 tables. Also, I'm using PostgreSQL's Array structures to facilitate passing in the name components and name classifications. Not exactly the best solution but there is some method to this madness, namely that it is easier to call this function via:

SELECT * FROM FIND_NAME_SET(ARRAY['Smith','Sally'], ARRAY['FAM','GIV']);

Hope that helps anyone who has to query normalized data using an de-normalized set of criteria!

No comments:

Post a Comment