PLOS Digital Health (Aug 2022)

DrNote: An open medical annotation service

  • Johann Frei,
  • Iñaki Soto-Rey,
  • Frank Kramer

Journal volume & issue
Vol. 1, no. 8

Abstract

Read online

In the context of clinical trials and medical research medical text mining can provide broader insights for various research scenarios by tapping additional text data sources and extracting relevant information that is often exclusively present in unstructured fashion. Although various works for data like electronic health reports are available for English texts, only limited work on tools for non-English text resources has been published that offers immediate practicality in terms of flexibility and initial setup. We introduce DrNote, an open source text annotation service for medical text processing. Our work provides an entire annotation pipeline with its focus on a fast yet effective and easy to use software implementation. Further, the software allows its users to define a custom annotation scope by filtering only for relevant entities that should be included in its knowledge base. The approach is based on OpenTapioca and combines the publicly available datasets from WikiData and Wikipedia, and thus, performs entity linking tasks. In contrast to other related work our service can easily be built upon any language-specific Wikipedia dataset in order to be trained on a specific target language. We provide a public demo instance of our DrNote annotation service at https://drnote.misit-augsburg.de/. Author summary Since much highly relevant information in healthcare and clinical research is exclusively stored as unstructured text, retrieving and processing such data poses a major challenge. Novel data-driven text processing methods require large amounts of annotated data in order to exceed non data-driven methods’ performance. In the medical domain, such data is not publicly available and restricted access is limited due to federal privacy regulations. We circumvent this issue by developing an annotation pipeline that works on sparse data and retrieves the training data from publicly available data sources. The fully automated pipeline can be easily adapted by third parties for custom use cases or directly applied within minutes for medical use cases. It significantly lowers the barrier for fast analysis of unstructured clinical text data in certain scenarios.