One of the more difficult challenges that I've come up against in the past day of coding is filtering normalized data structures (namely "Names" from an HL7v3 message). Here is the problem.
What's in a name?
Apparently quite a bit... Most systems are happy with First, Last, maybe middle name. However in HL7v3 (and other systems I imagine) flexibility is the word. So, if you're named John Jacob Jingleheimer-Schmidt HL7v3 messaging can portray that. However any receiving system also needs to be able to store this data and query it. So if you register a record as John Jacob Jingleheimer-Schmidt and someone searches for John Jingleheimer-Schmidtor
John Schmidt that record needs to be returned.
Enter HL7v3 EN names:
<name>
<prefix>Mr.</prefix>
<given>John</given>
<given>Jacob</given>
<family>Jingleheimer</family>
<delimiter>-</delimiter>
<family>Schmidt</family>
</name>
Yes, they look great, and they can in fact capture all elements of a name that could possibly exist, however in a relational database they're a nightmare, especially for querying.
Storing the ENStoring the EN is easy enough. You just have to keep in mind that a name is really a container for name parts. With that in mind, we can create complex names through correlating individual tuples in a database. In the PostgreSHR project I'm working on right now, we store them in a table that looks a lot like this:
Name Part ID | Name Part Value | Name Part Type | Correlation Id |
1 | Mr | PFX | 1 |
2 | John | GIV | 1 |
3 | Jacob | GIV | 1 |
4 | Jingleheimer | FAM | 1 |
5 | - | DEL | 1 |
6 | Schmidt | FAM | 1 |
The correlation ID can be used when de-serializing the name components back into a name container. This particular relational table works well for storingthe
name data, however (as you can imagine) it is quite difficult to query with SQL.
The concept behind the queryRecently I've come up with a nifty solution for querying this data. It's actually quite simple and works really well (tried it on a 10,000 tuple set). Basically, you iterate over the parts of the name you're looking for (your needle) and intersect each result set with the previous. So, let's say we have a database with four people: John Smith, Sally Smith, Billy Jim Smith and Sarah Sally Smith. Our name component table will look like this:
Name Part ID | Name Part Value | Name Part Type | Correlation Id |
1 | John | GIV | 1 |
2 | Smith | FAM | 1 |
3 | Sally | GIV | 2 |
4 | Smith | FAM | 2 |
5 | Billy | GIV | 3 |
6 | Jim | GIV | 3 |
7 | Smith | FAM | 3 |
8 | Sarah | GIV | 4 |
9 | Sally | GIV | 4 |
10 | Smith | FAM | 4 |
And let's pretend we want to query for Smith, Sarah. We know that our result should be Sarah Sally Smith as that is the most appropriate match. We can use this algorithm to find all the parts:
SELECT correlation FROM name table INTO tResult1 WHERE name part = filters[0].name part AND name value = filters[0].name value;FOR EACH filter IN filters
SELECT correlation FROM name table INTO tResult2 WHERE name part = filter.name part AND name value = filter.name value
INTERSECT ALL
SELECT * FROM tResult1;
DELETE FROM tResult1;
INSERT INTO tResult1 SELECT * FROM tResult2;
DELETE FROM tResult2;
END FOR EACH
RETURN tResult1;
Let's trace through each iteration:
Iteration | Filters | Filter | tResult1 | tResult2 |
0 | { FAM:Smith, GIV:Sarah } | { 1, 2, 3, 4 } | ||
1 | FAM:Smith | { 1, 2, 3, 4 } | ||
2 | GIV: Sarah | { 1, 2, 3, 4 } | { 4 } | |
3 | { 4 } |
The Actual Code
The algorithm is easy enough, but what about the PL/PGSQL code for PostgreSQL. Well, after thinking for quite a while, I came up with this:
CREATE OR REPLACE FUNCTION FIND_NAME_SET( NAME_CMP_VAL_IN IN VARCHAR[],
NAME_CMP_CLS_IN IN VARCHAR[]
) RETURNS TABLE (NAME_SET_ID DECIMAL(20,0)) AS
$$
DECLARE
CNCMP DECIMAL;
BEGIN
-- CREATE TEMP TBLS
IF((SELECT COUNT(*) FROM information_schema.tables WHERE table_name = 'tmp_name_res_tbl') = 1) THEN
DROP TABLE TMP_NAME_RES_TBL;
DROP TABLE TMP_NAME_RET_TBL;
END IF;
CREATE LOCAL TEMPORARY TABLE TMP_NAME_RES_TBL (
NAME_SET_ID DECIMAL(20,0)
) ON COMMIT DROP;
CREATE LOCAL TEMPORARY TABLE TMP_NAME_RET_TBL (
NAME_SET_ID DECIMAL(20,0)
) ON COMMIT DROP;
-- FILTER THE FIRST SET
INSERT INTO TMP_NAME_RET_TBL
SELECT DISTINCT A.NAME_SET_ID
FROM NAME_CMP_TBL A
WHERE
NAME_CMP_VALUE = NAME_CMP_VAL_IN[1]
AND
NAME_CMP_CLS = NAME_CMP_CLS_IN[1];
-- SUBSEQUENT
IF(ARRAY_LOWER(NAME_CMP_VAL_IN, 1) < ARRAY_UPPER(NAME_CMP_VAL_IN, 1)) THEN
FOR CNCMP IN ARRAY_LOWER(NAME_CMP_VAL_IN, 1) + 1 .. ARRAY_UPPER(NAME_CMP_VAL_IN, 1) LOOP
INSERT INTO TMP_NAME_RES_TBL
SELECT DISTINCT A.NAME_SET_ID
FROM NAME_CMP_TBL A
WHERE
NAME_CMP_VALUE = NAME_CMP_VAL_IN[CNCMP]
AND
NAME_CMP_CLS = NAME_CMP_CLS_IN[CNCMP]
INTERSECT ALL
SELECT A.NAME_SET_ID FROM TMP_NAME_RET_TBL A;
-- ADD RESULTS
DELETE FROM TMP_NAME_RET_TBL;
INSERT INTO TMP_NAME_RET_TBL
SELECT *
FROM TMP_NAME_RES_TBL;
DELETE FROM TMP_NAME_RES_TBL;
END LOOP;
END IF;
RETURN QUERY SELECT * FROM TMP_NAME_RET_TBL;
END;
$$ LANGUAGE plpgsql;
Notice that I create the TEMPORARY tables TMP_NAME_RES_TBL and TMP_NAME_RET_TBL as my tResult1 and tResult2 tables. Also, I'm using PostgreSQL's Array structures to facilitate passing in the name components and name classifications. Not exactly the best solution but there is some method to this madness, namely that it is easier to call this function via:
SELECT * FROM FIND_NAME_SET(ARRAY['Smith','Sally'], ARRAY['FAM','GIV']);
Hope that helps anyone who has to query normalized data using an de-normalized set of criteria!
No comments:
Post a Comment