IEEE Access (Jan 2025)
MathVision: An Accessible Intelligent Agent for Visually Impaired People to Understand Mathematical Equations
Abstract
2.2 billion people worldwide suffer from some form of vision impairment, according to the World Health Organization. Children with vision impairment and visual impairment may experience impaired physical, linguistic, and cognitive development, resulting in reduced levels of academic accomplishment. Many visually impaired people are working in the education sector whether they are students or teachers. Without external assistance reading of mathematical equations in images for visually impaired people is very challenging due to the complexity of notations, symbols, and variables. This paper presents a model named MathVision which converts the mathematical equation into voice. This voice is quite helpful for visually impaired people to understand mathematical equations. The proposed model is comprised of YOLOv7 object detection architecture to detect and categorize mathematical equations inside images into four distinct types: limits, trigonometry, integration, and an additional category. The input image is divided into a grid by the YOLOv7 model, and each grid cell is responsible for finding equations that fall into its respective category. bounding box coordinates, object labels, and probability scores are predicted for each equation. In the next stage, a fine-tuned DenseNet is utilized for detailed feature extraction from mathematical equation images. This involves optimizing a pre-trained DenseNet model to capture intricate patterns specific to equations. The fine-tuned DenseNet enhances overall accuracy in equation detection and categorization within the system. In the subsequent phase, an attention mechanism-based LSTM network is employed to generate natural language descriptions for mathematical equations. During the decoding process, the model is better able to focus on pertinent portions of the equation due to the integration of attention. The LSTM architecture, chosen for its effectiveness with sequential data, is trained on a dataset containing paired examples of equations and corresponding human-generated descriptions. Fine-tuning includes optimizing hyperparameters for the task, and evaluation metrics such as the BLEU score are used to assess the model’s performance in generating accurate and contextually relevant textual representations for the detected mathematical content. Our text-to-speech system takes input in the form of a natural language sentence generated by the LSTM model and converts it to the voice. This TTS using natural language processing analyzes and processes the text then it converts this processed text into speech using digital signal processing technology. A platform-independent pyttsx3 python library is used for converting text into speech. It also works offline which is the main reason for using this library in this research work. As there was no dataset available of mathematical equations with their natural language description, we created a custom dataset. We conducted real-world experiments in various visually impaired schools to see whether visually impaired students can understand mathematical equations by hearing the voice. These experiments prove that the MathVision Model is an efficient way for visually impaired students to read and write mathematical equations by listening to the voice of equations generated by proposed model.INDEX TERMS Mathematical equations, fine-tuned, YOLO v7, convolution neural network, attention mechanism, long short term memory, neural text to speech, technological development.
Keywords