Automatic identification of variables in epidemiological datasets using logic regression

Matthias W. Lorenz; Negin Ashtiani Abdi; Frank Scheckenbach; Anja Pflug; Alpaslan Bülbül; Alberico L. Catapano; Stefan Agewall; Marat Ezhov; Michiel L. Bots; Stefan Kiechl; Andreas Orth; on behalf of the PROG-IMT study group

doi:10.1186/s12911-017-0429-1

BMC Medical Informatics and Decision Making (Apr 2017)

Automatic identification of variables in epidemiological datasets using logic regression

Matthias W. Lorenz,
Negin Ashtiani Abdi,
Frank Scheckenbach,
Anja Pflug,
Alpaslan Bülbül,
Alberico L. Catapano,
Stefan Agewall,
Marat Ezhov,
Michiel L. Bots,
Stefan Kiechl,
Andreas Orth,
on behalf of the PROG-IMT study group

Affiliations

Matthias W. Lorenz: Department of Neurology, University Clinic Frankfurt
Negin Ashtiani Abdi: Faculty of Computer Science and Engineering, Frankfurt University of Applied Sciences
Frank Scheckenbach: Department of Neurology, University Clinic Frankfurt
Anja Pflug: Department of Neurology, University Clinic Frankfurt
Alpaslan Bülbül: Department of Neurology, University Clinic Frankfurt
Alberico L. Catapano: IRCSS Multimedica
Stefan Agewall: Institute of Clinical Sciences, University of Oslo
Marat Ezhov: Atherosclerosis Department, Cardiology Research Center
Michiel L. Bots: University Medical Center Utrecht
Stefan Kiechl: Department of Neurology, Medical University Innsbruck
Andreas Orth: Faculty of Computer Science and Engineering, Frankfurt University of Applied Sciences
on behalf of the PROG-IMT study group

DOI: https://doi.org/10.1186/s12911-017-0429-1
Journal volume & issue: Vol. 17, no. 1
pp. 1 – 11

Abstract

Read online

Abstract Background For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. Methods For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. Results In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. Conclusions We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.

Published in BMC Medical Informatics and Decision Making

ISSN: 1472-6947 (Online)
Publisher: BMC
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: http://bmcmedinformdecismak.biomedcentral.com

About the journal

Abstract

Keywords