Feature-Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment Services

Yongwoo Jeong; Jiseon Yang; In Ho Choi; Juyeon Lee

doi:10.1109/ACCESS.2024.3373470

IEEE Access (Jan 2024)

Feature-Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment Services

Yongwoo Jeong,
Jiseon Yang,
In Ho Choi,
Juyeon Lee

Affiliations

Yongwoo Jeong: ORCiD; Rowan Inc., Seoul, Republic of Korea
Jiseon Yang: ORCiD; Rowan Inc., Seoul, Republic of Korea
In Ho Choi: Rowan Inc., Seoul, Republic of Korea
Juyeon Lee: Rowan Inc., Seoul, Republic of Korea

DOI: https://doi.org/10.1109/ACCESS.2024.3373470
Journal volume & issue: Vol. 12
pp. 48145 – 48157

Abstract

Read online

The fairness & bias of narrow coverage of AI becomes another challenge for AI researchers. If a commercial AI trains with a biased dataset, there will be severe gender or racial fairness and bias issues. Since the researchers use primary language datasets to train AI, the broad audience cannot be satisfied if a novel LLM (Large Language Model) AI shows a knowledge or creativity limitation on their specific spoken language. Narrow coverage of the LLMs can lead the audience to misinterpretation and confusion if the service involves STT (Speech-To-Text). In this paper, to overcome this issue of data diversity, we propose the idea that the embedded, extracted features have captured semantic proximity information that can be useful to mitigate diversity issues. This project focused on the Korean language food dataset for STT services, where a narrow-trained A.I. is prone to show its limitations, such as lifestyle-related elements. To present our proof of concept, we trained a baseline model, GPT2, with the Korean Wikipedia dataset in 2022. Then, we employed DistilBERT and KoBERT for comparison. The extracted hidden_state_output features from each model were utilized to build feature-extraction-based text search engines. We used the same idea of Local Sensitive Hashing (LSH) but effectively located a similar hash by applying transposed weights. We also present conventional classification benchmarks for performance comparison using top-k measurements, times for training and memory & disc consumptions. In the discussion, we proposed that our idea can mitigate the diversity problem without re-training the model and tokenizer.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords