Facilitating Large Language Model Russian Adaptation with Learned Embedding Propagation

Михаил Тихомиров; Даниил Чернышев

doi:10.17323/jle.2024.22224

Journal of Language and Education (Dec 2024)

Facilitating Large Language Model Russian Adaptation with Learned Embedding Propagation

Михаил Тихомиров,
Даниил Чернышев

Affiliations

Михаил Тихомиров: Lomonosov Moscow State University, Moscow, Russia
Даниил Чернышев: Lomonosov Moscow State University, Moscow, Russia

DOI: https://doi.org/10.17323/jle.2024.22224
Journal volume & issue: Vol. 10, no. 4

Abstract

Read online

Background: Recent advancements in large language model (LLM) technologies have introduced powerful open-source instruction-tuned LLMs that match the text generation quality of leading models like GPT-4. Despite accelerating LLM adoption in sensitive-information environments, the lack of disclosed training data hinders replication and makes these achievements exclusive to specific models. Purpose: Given the multilingual nature of the latest iteration of open-source LLMs, the benefits of training language-specific LLMs diminish, leaving computational efficiency as the sole guaranteed advantage of this computationally-expensive procedure. This work aims to address the language-adaptation limitations posed by restricted access to high-quality instruction-tuning data, offering a more cost-effective pipeline. Method: To tackle language-adaptation challenges, we introduce Learned Embedding Propagation (LEP), a novel method with lower training data requirements and minimal disruption of existing LLM knowledge. LEP employs an innovative embedding propagation technique, bypassing the need for instruction-tuning and directly integrating new language knowledge into any instruct-tuned LLM variant. Additionally, we developed Darumeru, a new benchmark for evaluating text generation robustness during training, specifically tailored for Russian adaptation. Results: We applied the LEP method to adapt LLaMa-3-8B and Mistral-7B for Russian, testing four different vocabulary adaptation scenarios. Evaluation demonstrates that LEP achieves competitive performance levels, comparable to OpenChat 3.5 and LLaMa-3-8B-Instruct. Further improvements were observed through self-calibration and additional instruction-tuning steps, enhancing task-solving capabilities beyond the original models. Conclusion: LEP offers a viable and efficient alternative to traditional language-specific instruction-tuning, significantly reducing the costs associated with language adaptation while maintaining or surpassing the performance benchmarks set by contemporary LLMs.

Published in Journal of Language and Education

ISSN: 2411-7390 (Online)
Publisher: National Research University Higher School of Economics
Country of publisher: Russian Federation
LCC subjects: Education; Language and Literature: Philology. Linguistics
Website: https://jle.hse.ru/about

About the journal

Abstract

Keywords