ChatGPT Versus Modest Large Language Models: An Extensive Study on Benefits and Drawbacks for Conversational Search

Guido Rocchietti; Cosimo Rulli; Franco Maria Nardini; Cristina Ioana Muntean; Raffaele Perego; Ophir Frieder

doi:10.1109/ACCESS.2025.3529741

IEEE Access (Jan 2025)

ChatGPT Versus Modest Large Language Models: An Extensive Study on Benefits and Drawbacks for Conversational Search

Guido Rocchietti,
Cosimo Rulli,
Franco Maria Nardini,
Cristina Ioana Muntean,
Raffaele Perego,
Ophir Frieder

Affiliations

Guido Rocchietti: ORCiD; ISTI-CNR, Pisa, Italy
Cosimo Rulli: ISTI-CNR, Pisa, Italy
Franco Maria Nardini: ORCiD; ISTI-CNR, Pisa, Italy
Cristina Ioana Muntean: ORCiD; ISTI-CNR, Pisa, Italy
Raffaele Perego: ORCiD; ISTI-CNR, Pisa, Italy
Ophir Frieder: ORCiD; Georgetown University, Washington, DC, USA

DOI: https://doi.org/10.1109/ACCESS.2025.3529741
Journal volume & issue: Vol. 13
pp. 15253 – 15271

Abstract

Read online

Large Language Models (LLMs) are effective in modeling text syntactic and semantic content, making them a strong choice to perform conversational query rewriting. While previous approaches proposed NLP-based custom models, requiring significant engineering effort, our approach is straightforward and conceptually simpler. Not only do we improve effectiveness over the current state-of-the-art, but we also curate the cost and efficiency aspects. We explore the use of pre-trained LLMs fine-tuned to generate quality user query rewrites, aiming to reduce computational costs while maintaining or improving retrieval effectiveness. As a first contribution, we study various prompting approaches — including zero, one, and few-shot methods — with ChatGPT (e.g., gpt-3.5-turbo). We observe an increase in the quality of rewrites leading to improved retrieval. We then fine-tuned smaller open LLMs on the query rewriting task. Our results demonstrate that our fine-tuned models, including the smallest with 780 million parameters, achieve better performance during the retrieval phase than gpt-3.5-turbo. To fine-tune the selected models, we used the QReCC dataset, which is specifically designed for query rewriting tasks. For evaluation, we used the TREC CAsT datasets to assess the retrieval effectiveness of the rewrites of both gpt-3.5-turbo and our fine-tuned models. Our findings show that fine-tuning LLMs on conversational query rewriting datasets can be more effective than relying on generic instruction-tuned models or traditional query reformulation techniques.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords