Nội dung text 13. RECORD LINKAGE SYSTEM.pdf
PHARMD GURU Page 3 FLOW OF INFORMATION IN RLS: STANDARDIZATION: In every data there exist many manual errors and non-matching abbreviations etc. which may present themselves as separate data without actually being so First step : To clean and standardize the data Ex: For input data belonging to Mr. William Marcus Smith, entries could have been made by different individuals as: Smith W. M. William M. Smith W.M. Smith W.M. Smithe etc. BLOCKING: In order to reduce the search space (i.e. the number of record pairs to be compared). To group similar records together, called blocks or clusters. The data sets are split into smaller blocks and only records within the same blocks are compared. Ex: Instead of making detailed comparisons of all 90 billion pairs from two lists of 300,000 records representing all businesses in a State of the U.S., it may be sufficient to consider the set of 30 million pairs that agree on U.S. Postal ZIP code. MATCHING: 1) Exact Matching: Linkage of data for the same unit (e.g., establishment) from different files.
PHARMD GURU Page 4 Uses identifiers such as name, address, or tax unit number. 2) Statistical Matching: Attempts to link files that may have few units in common. Linkages are based on similar characteristics rather than unique identifying information. REQUIREMENTS FOR DEFINING A RLS: The types of linkages required, whether the linkages is performed in batch and/or interactive mode. The security provisions for confidential data files. The speed of operation needed. The volume of records that can be linked with the system. The initial cost of software including licensing and maintenance costs. Whether the software is bundled with other software packages. The simplicity and flexibility in defining the rules used for linkages. The accuracy and statistical defensibility of the product. The availability of documentation and training, and The maintenance and support of the software. USES: The system is used to improve data quality and coverage, for long term medical follow up of cohorts, for creating patient-oriented rather than event-oriented data, for building new data sources, and for a range of other statistical purposes. It helps create statistically relevant source of 'new' information. Answers research questions relating to genetics, occupational and environmental health and medical research. DRAWBACKS: Issues of privacy and confidentiality. Policies for conducting studies using such systems must be transparent. APPLICATIONS: Duplication in data in minimized. Powerful tool for generating more value out of existing databases.