IEEE Access (Jan 2024)
QueryMintAI: Multipurpose Multimodal Large Language Models for Personal Data
Abstract
QueryMintAI, a versatile multimodal Language Learning Model (LLM) designed to address the complex challenges associated with processing various types of user inputs and generating corresponding outputs across different modalities. The proliferation of diverse data formats, including text, images, videos, documents, URLs, and audio recordings, necessitates an intelligent system capable of understanding and responding to user queries effectively. Existing models often exhibit limitations in handling multimodal inputs and generating coherent outputs across different modalities. The proposed QueryMintAI framework leverages state-of-the-art language models such as GPT-3.5 Turbo, DALL-E-2, TTS-1 and Whisper v2 among others, to enable seamless interaction with users across multiple modalities. By integrating advanced natural language processing (NLP) techniques with domain-specific models, QueryMintAI offers a comprehensive solution for text-to-text, text-to-image, text-to-video, and text-to-audio conversions. Additionally, the system supports document processing, URL analysis, image description, video summarization, audio transcription, and database querying, catering to diverse user needs and preferences. The proposed model addresses several limitations observed in existing approaches, including restricted modality support, lack of adaptability to various data formats, and limited response generation capabilities. QueryMintAI overcomes these challenges by employing a combination of advanced NLP algorithms, deep learning architectures, and multimodal fusion techniques.
Keywords