Distance Matters: Euclidean Embedding Distances for Improved Language Model Generalization and Adaptability

Sultan Alshamrani

doi:10.1109/ACCESS.2024.3434612

IEEE Access (Jan 2024)

Distance Matters: Euclidean Embedding Distances for Improved Language Model Generalization and Adaptability

Sultan Alshamrani

Affiliations

Sultan Alshamrani: ORCiD; Department of Computer Science, Saudi Electronic University, Riyadh, Saudi Arabia

DOI: https://doi.org/10.1109/ACCESS.2024.3434612
Journal volume & issue: Vol. 12
pp. 103583 – 103593

Abstract

Read online

Large language models (LLMs) have revolutionized natural language processing (NLP), enabling machines to process, understand and generate human-like text with high accuracy. However, the current practices in training and evaluating these models often overlook the relationship between the embeddings of training and testing samples, leading to potential overfitting and limited generalization capabilities. This paper introduces a new approach to enhancing the performance, reliability, and generalization of LLMs by curating training and testing samples based on the Euclidean distances between their embeddings. The central hypothesis is that training models on samples with high Euclidean distances between training and testing embeddings, coupled with evaluations spanning diverse distances, will improve the models’ robustness and adaptability to inputs diverging from the training data distribution. The comprehensive evaluation across multiple datasets and architectures shows that models trained on samples with high Euclidean distances from the testing samples generally exhibit superior generalization and robustness compared to those trained on low-distance samples. The proposed evaluation methodology, assessing performance across a range of distances, provides a more reliable measure of a model’s true adaptability. This study provides insights into the relationship between training data diversity and model reliability, paving the way for more robust and generalizable LLMs.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords