Image Text Extraction and Natural Language Processing of Unstructured Data from Medical Reports

Ivan Malashin; Igor Masich; Vadim Tynchenko; Andrei Gantimurov; Vladimir Nelyub; Aleksei Borodulin

doi:10.3390/make6020064

Machine Learning and Knowledge Extraction (Jun 2024)

Image Text Extraction and Natural Language Processing of Unstructured Data from Medical Reports

Ivan Malashin,
Igor Masich,
Vadim Tynchenko,
Andrei Gantimurov,
Vladimir Nelyub,
Aleksei Borodulin

Affiliations

Ivan Malashin: Artificial Intelligence Technology Scientific and Education Center, Bauman Moscow State Technical University, 105005 Moscow, Russia
Igor Masich: Artificial Intelligence Technology Scientific and Education Center, Bauman Moscow State Technical University, 105005 Moscow, Russia
Vadim Tynchenko: Artificial Intelligence Technology Scientific and Education Center, Bauman Moscow State Technical University, 105005 Moscow, Russia
Andrei Gantimurov: Artificial Intelligence Technology Scientific and Education Center, Bauman Moscow State Technical University, 105005 Moscow, Russia
Vladimir Nelyub: Artificial Intelligence Technology Scientific and Education Center, Bauman Moscow State Technical University, 105005 Moscow, Russia
Aleksei Borodulin: Artificial Intelligence Technology Scientific and Education Center, Bauman Moscow State Technical University, 105005 Moscow, Russia

DOI: https://doi.org/10.3390/make6020064
Journal volume & issue: Vol. 6, no. 2
pp. 1361 – 1377

Abstract

Read online

This study presents an integrated approach for automatically extracting and structuring information from medical reports, captured as scanned documents or photographs, through a combination of image recognition and natural language processing (NLP) techniques like named entity recognition (NER). The primary aim was to develop an adaptive model for efficient text extraction from medical report images. This involved utilizing a genetic algorithm (GA) to fine-tune optical character recognition (OCR) hyperparameters, ensuring maximal text extraction length, followed by NER processing to categorize the extracted information into required entities, adjusting parameters if entities were not correctly extracted based on manual annotations. Despite the diverse formats of medical report images in the dataset, all in Russian, this serves as a conceptual example of information extraction (IE) that can be easily extended to other languages.

Published in Machine Learning and Knowledge Extraction

ISSN: 2504-4990 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware
Website: https://www.mdpi.com/journal/make

About the journal

Abstract

Keywords