IEEE Access (Jan 2024)
Distance Matters: Euclidean Embedding Distances for Improved Language Model Generalization and Adaptability
Abstract
Large language models (LLMs) have revolutionized natural language processing (NLP), enabling machines to process, understand and generate human-like text with high accuracy. However, the current practices in training and evaluating these models often overlook the relationship between the embeddings of training and testing samples, leading to potential overfitting and limited generalization capabilities. This paper introduces a new approach to enhancing the performance, reliability, and generalization of LLMs by curating training and testing samples based on the Euclidean distances between their embeddings. The central hypothesis is that training models on samples with high Euclidean distances between training and testing embeddings, coupled with evaluations spanning diverse distances, will improve the models’ robustness and adaptability to inputs diverging from the training data distribution. The comprehensive evaluation across multiple datasets and architectures shows that models trained on samples with high Euclidean distances from the testing samples generally exhibit superior generalization and robustness compared to those trained on low-distance samples. The proposed evaluation methodology, assessing performance across a range of distances, provides a more reliable measure of a model’s true adaptability. This study provides insights into the relationship between training data diversity and model reliability, paving the way for more robust and generalizable LLMs.
Keywords