IEEE Access (Jan 2024)
Feature-Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment Services
Abstract
The fairness & bias of narrow coverage of AI becomes another challenge for AI researchers. If a commercial AI trains with a biased dataset, there will be severe gender or racial fairness and bias issues. Since the researchers use primary language datasets to train AI, the broad audience cannot be satisfied if a novel LLM (Large Language Model) AI shows a knowledge or creativity limitation on their specific spoken language. Narrow coverage of the LLMs can lead the audience to misinterpretation and confusion if the service involves STT (Speech-To-Text). In this paper, to overcome this issue of data diversity, we propose the idea that the embedded, extracted features have captured semantic proximity information that can be useful to mitigate diversity issues. This project focused on the Korean language food dataset for STT services, where a narrow-trained A.I. is prone to show its limitations, such as lifestyle-related elements. To present our proof of concept, we trained a baseline model, GPT2, with the Korean Wikipedia dataset in 2022. Then, we employed DistilBERT and KoBERT for comparison. The extracted hidden_state_output features from each model were utilized to build feature-extraction-based text search engines. We used the same idea of Local Sensitive Hashing (LSH) but effectively located a similar hash by applying transposed weights. We also present conventional classification benchmarks for performance comparison using top-k measurements, times for training and memory & disc consumptions. In the discussion, we proposed that our idea can mitigate the diversity problem without re-training the model and tokenizer.
Keywords