GMS Medizinische Informatik, Biometrie und Epidemiologie (Mar 2021)

Comparative evaluation of automated information extraction from pathology reports in three German cancer registries

  • Schulz, Stefan,
  • Fix, Sonja,
  • Klügl, Peter,
  • Bachmayer, Tamira,
  • Hartz, Tobias,
  • Richter, Martin,
  • Herm-Stapelberg, Nils,
  • Daumke, Philipp

DOI
https://doi.org/10.3205/mibe000215
Journal volume & issue
Vol. 17, no. 1
p. Doc01

Abstract

Read online

Feeding cancer registries with data extracted from textual reports, while maintaining a high level of data quality, has always been a labour-intensive task, due to the heterogeneity of the sources. The support of this task by IT solutions is expected to accelerate and optimise this process. To this end, the commercial text mining system Averbis Health Discovery was tailored to extract information from free text at the cancer registry of the federal state of Baden-Württemberg. The following entity types were extracted from German-language pathology reports: tumour localisation and morphology, pTNM, grading, (sentinel) nodes examined and affected, laterality and R-class. According to the entity type, several machine learning approaches as well as rules were used for the tumour types breast, prostate, colorectal and skin. Whereas for the pilot site, F values ranged between 0.800 and 0.996, values dropped when applying the extraction pipeline to two new sites (cancer registries Rhineland-Palatinate and Lower Saxony), for morphology from 0.950 to 0.657 and 0.933, and for localisation (topography) from 0.902 to 0.675 and 0.768. There was much less difference with R-class and lymph node counts. A thorough error analysis revealed numerous issues that explain these differences, such as different workflows between the sites, disagreements between textual and coded content as well as different handlings of missing values.

Keywords