Linking Occupational Injury and Illness Databases
Problem
The federal government wanted to link information from one worker-injury database to another.
The U.S. Bureau of Labor Statistics (BLS) wanted to use data from the Occupational Safety and Health Administration’s Injury Tracking Application (ITA) to complete items on the Survey of Occupational Injuries and Illnesses (SOII). The goal was to reduce the burden on SOII participants and improve the accuracy of the derived data elements.
Solution
NORC developed a methodology for matching linked worker-injury data.
NORC evaluated and standardized data elements across the files using a geocoding routine we developed. We based the routine on comparisons to the U.S. Census Bureau’s TIGER database.
Another critical element of this analysis was the comparison of organization names. There are several ways that the same name can be transcribed on a database, so we standardize the name elements to produce an effective comparison. Because certain tokens in organizations occur commonly, we weighted each common token according to the inverse of its rarity to improve comparison.
To integrate the results of geocoding and organization name comparisons, we built the record linkage process on the Fellegi-Sunter paradigm. We estimated M and U probabilities (field agreement proportion for matched and unmatched pairs, respectively) using a custom-designed machine learning routine that minimized the distance between the expected and actual frequencies by agreement pattern (i.e., among comparison variables). This linkage process resulted in each pair being assigned an estimated match probability. We conducted the analysis over several blocking passes and summarized the results for each pair. Pairs with a sufficiently high estimated match probability were retained for the delivered database.
We also developed an approach to measure the similarity of organization names. We used a self-developed method and coding to evaluate links based on agreement patterns to get the best fit to the frequency of comparison agreement values and assign a probability of match validity. This model allowed variable interactions comparisons to be considered in the probability model. We used a modified version of SAS’s PROC GEOCODE to standardize addresses for use in linkage.
Result
NORC produced a file with likely matches in the databases.
The linkage analysis results included a data file that showed likely or potential matched pairs from ITA (i.e., the record most likely the true match) for each SOII record. With each matched pair on the file, we provided an estimated probability of true match status (i.e., that the records represent the same organization). The expectation is that BLS will use this file by accepting all links above a probability threshold they choose as pairs highly likely to match. We delivered data file documentation and a report summarizing the development process to BLS.
Related Tags
Project Leads
-
Stephen Cohen
Senior FellowProject Director -
Dean Resnick
Principal Data ScientistPrincipal Investigator