Jisuanji kexue yu tansuo (Dec 2024)
CFB:Financial Large Models Evaluation Methods
Abstract
As the potential applications of large language models (LLMs) in the financial sector continue to emerge, evaluating the performance of financial LLMs becomes increasingly important. However, current financial evaluation methods face limitations such as singular evaluation tasks, insufficient coverage of evaluation datasets, and contamination of benchmark data. Consequently, the potential of LLMs in the financial domain has not been fully explored. To address these issues, this paper proposes the Chinese financial benchmark (CFB) for evaluating financial LLMs. The CFB encompasses 36 datasets, covers 24 financial tasks, and involves 7 evaluation tasks: question answering, terminology explanation, text generation, text translation, classification task, voice recognition, and predictive decision. It also establishes corresponding benchmarks. The new approach of the CFB includes a broader range of tasks and data, the introduction of a benchmark decontamination method based on LLMs, and three evaluation methods: instruction fine-tuning, knowledge retrieval enhancement, and prompt engineering. The evaluation of 12 LLMs, including GPT-4o, ChatGPT, and Gemini, reveals that though LLMs excel in information extraction and text analysis, they struggle with advanced reasoning and complex tasks. GPT-4o performs exceptionally in information extraction and stock trading, whereas Gemini excels in text generation and prediction. Instruction fine-tuning improves LLMs’ performance in text analysis but offers limited benefits for complex tasks.
Keywords