A Lightweight Visual Understanding System for Enhanced Assistance to the Visually Impaired Using an Embedded Platform

Adel Jalal Yousif; Mohammed H. Al-Jammas

doi:10.24237/djes.2024.17310

Diyala Journal of Engineering Sciences (Sep 2024)

A Lightweight Visual Understanding System for Enhanced Assistance to the Visually Impaired Using an Embedded Platform

Adel Jalal Yousif,
Mohammed H. Al-Jammas

Affiliations

Adel Jalal Yousif: Department of Computer Engineering, University of Mosul, Mosul, Iraq
Mohammed H. Al-Jammas: College of Electronics Engineering, Ninevah University, Mosul, Iraq

DOI: https://doi.org/10.24237/djes.2024.17310
Journal volume & issue: Vol. 17, no. 3

Abstract

Read online

Visually impaired individuals often face significant challenges in navigating their environments due to limited access to visual information. To address this issue, a portable, cost-effective assistive tool is proposed to operate on a low-power embedded system such as the Jetson Nano. The novelty of this research lies in developing an efficient, lightweight video captioning model within constrained resources to ensure its compatibility with embedded platforms. This research aims to enhance the autonomy and accessibility of visually impaired people by providing audio descriptions of their surroundings through the processing of live-streaming videos. The proposed system utilizes two distinct lightweight deep learning modules: an object detection module based on the state-of-the-art YOLOv7 model, and a video captioning module that utilizes both the Video Swin Transformer and 2D-CNN for feature extraction, along with the Transformer network for caption generation. The goal of the object detection module is for providing real-time multiple object identification in the surrounding environment of the blind while the video captioning module is to provide detailed descriptions of the entire visual scenes and activities including objects, actions, and relationships between them. The user interacts via a headphone with the proposed system using a specific audio command to trigger the corresponding module even object detection or video captioning and receiving an audio description output for the visual contents. The system demonstrates satisfactory results, achieving inference speeds between 0.11 to 1.1 seconds for object detection and 0.91 to 1.85 seconds for video captioning, evaluated through both quantitative metrics and subjective assessments.

Published in Diyala Journal of Engineering Sciences

ISSN: 1999-8716 (Print); 2616-6909 (Online)
Publisher: University of Diyala
Country of publisher: Iraq
LCC subjects: Technology: Engineering (General). Civil engineering (General): Engineering machinery, tools, and implements; Technology: Engineering (General). Civil engineering (General): Mechanics of engineering. Applied mechanics; Technology: Electrical engineering. Electronics. Nuclear engineering; Technology: Chemical technology: Chemical engineering; Technology: Engineering (General). Civil engineering (General): Environmental engineering
Website: http://djes.info/index.php/djes/index

About the journal

Abstract

Keywords