Austrian Journal of Statistics (Apr 2016)
Methodology and Applications of Building a National File of Health and Mortality data
Abstract
National collections of historical administrative and other health data can number hundreds of millions of records, with new data being added at the rate of tens of millions of records each year. Although improvements in computing and storage technology have to some extent kept pace with this accelerating growth in the datasets, there has been little development over the past few decades in the way in which probabilistic record linkage is undertaken, particularly in respect of the match acceptance thresholds and the clerical review processes, which are required to make decisions about matches which are doubtful. This paper describes the major features of the Oxford Record Linkage Study (ORLS), and the developments in probabilistic matching methods and the use of intelligent and data mining methodologies to select potential links between pairs of records. The ORLS linked file was developed using a collection of linkable abstracts that comprise a health region in the United Kingdom. The ORLS file contains 12 million records for 6 million people and spans 39 years. This dataset is used for the preparation of person linked health services statistics, and for epidemiological and health services research. The policy of the ORLS is to comprehensively link all the records rather than prepare links on an ad-hoc basis. The ORLS have been developing improved techniques for deterministic and probabilistic linkage and developing algorithms for reducing the amount of clerical review, which is time consuming, expensive, and of variable quality. The methodology has been extended and refined for matching and linking other large UK government datasets, in particular the National Health Service Central Register (60+ million records), a number of disease and local authority registers, and more recently, for the development of a UK National File of Linked Hospital Episode Statistics and Mortality data. This file spans 4 years and currently holds 52 million records and will increase by 14 million records per annum. Since the implementation of the Data Protection Act (1998) in the UK, all names and address have been stripped from the health files. Matching and linkage is undertaken using the national NHS number and other partial identifiers. The matching methodology described in this paper is for linking such datasets using various combinations of the partial identifiers.