Visual Object Detection with DETR to Support Video-Diagnosis Using Conference Tools

Attila Biró; Katalin Tünde Jánosi-Rancz; László Szilágyi; Antonio Ignacio Cuesta-Vargas; Jaime Martín-Martín; Sándor Miklós Szilágyi

doi:10.3390/app12125977

Applied Sciences (Jun 2022)

Visual Object Detection with DETR to Support Video-Diagnosis Using Conference Tools

Attila Biró,
Katalin Tünde Jánosi-Rancz,
László Szilágyi,
Antonio Ignacio Cuesta-Vargas,
Jaime Martín-Martín,
Sándor Miklós Szilágyi

Affiliations

Attila Biró: Department of Electrical Engineering and Information Technology, George Emil Palade University of Medicine, Pharmacy, Science, and Technology of Targu Mures, Str. Nicolae Iorga, Nr. 1, 540088 Targu Mures, Romania
Katalin Tünde Jánosi-Rancz: Computational Intelligence Research Group, Sapientia Hungarian University of Transylvania, 540485 Targu Mures, Romania
László Szilágyi: Computational Intelligence Research Group, Sapientia Hungarian University of Transylvania, 540485 Targu Mures, Romania
Antonio Ignacio Cuesta-Vargas: Department of Physiotherapy, University of Malaga, 29071 Malaga, Spain
Jaime Martín-Martín: Biomedical Research Institute of Malaga (IBIMA), 29590 Malaga, Spain
Sándor Miklós Szilágyi: Department of Electrical Engineering and Information Technology, George Emil Palade University of Medicine, Pharmacy, Science, and Technology of Targu Mures, Str. Nicolae Iorga, Nr. 1, 540088 Targu Mures, Romania

DOI: https://doi.org/10.3390/app12125977
Journal volume & issue: Vol. 12, no. 12
p. 5977

Abstract

Read online

Real-time multilingual phrase detection from/during online video presentations—to support instant remote diagnostics—requires near real-time visual (textual) object detection and preprocessing for further analysis. Connecting remote specialists and sharing specific ideas is most effective using the native language. The main objective of this paper is to analyze and propose—through DEtection TRansformer (DETR) models, architectures, hyperparameters—recommendation, and specific procedures with simplified methods to achieve reasonable accuracy to support real-time textual object detection for further analysis. The development of real-time video conference translation based on artificial intelligence supported solutions has a relevant impact in the health sector, especially on clinical practice via better video consultation (VC) or remote diagnosis. The importance of this development was augmented by the COVID-19 pandemic. The challenge of this topic is connected to the variety of languages and dialects that the involved specialists speak and that usually needs human translator proxies which can be substituted by AI-enabled technological pipelines. The sensitivity of visual textual element localization is directly connected to complexity, quality, and the variety of collected training data sets. In this research, we investigated the DETR model with several variations. The research highlights the differences of the most prominent real-time object detectors: YOLO4, DETR, and Detectron2, and brings AI-based novelty to collaborative solutions combined with OCR. The performance of the procedures was evaluated through two research phases: a 248/512 (Phase1/Phase2) record train data set, with a 55/110 set of validated data instances for 7/10 application categories and 3/3 object categories, using the same object categories for annotation. The achieved score breaks the expected values in terms of visual text detection scope, giving high detection accuracy of textual data, the mean average precision ranging from 0.4 to 0.65.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords