Teknika (Jul 2025)
Fine-Hybrid: Integration of BM25 And Finetuned SBERT to Enhance Search Relevance
Abstract
Legal information retrieval, particularly for tax law documents, faces significant challenges due to specialized terminology, complex hierarchical structures, and formal language patterns that existing search approaches inadequately address. Current methods either rely on lexical matching or use general semantic models, creating a critical gap in effectively retrieving relevant tax law information. This research develops a novel hybrid search system to enhance search result relevance for the General Provisions and Tax Procedures (KUP) dataset by integrating a lexical-based search method (BM25) with semantic search using Sentence-BERT (SBERT) that has been fine-tuned using a taxation corpus. Our methodology encompasses several innovative components: development of synthetic data using a two-stage LLM prompting approach for SBERT fine-tuning, implementation of a comprehensive query normalization system with taxation-specific terminology mapping, and integration of lexical and semantic results through Reciprocal Rank Fusion (RRF). We evaluate system performance with inputs from tax domain experts, demonstrating that the Fine-hybrid model consistently outperforms individual search methods, achieving a Precision@N of 66.021% and Average Recall of 76.51%. Our approach addresses the specific challenges of tax document retrieval while providing a generalizable framework applicable to other specialized domains with similar characteristics. This research contributes both theoretical advancements in hybrid search methodologies for legal documents and practical solutions for improving tax information accessibility, with implications for enhancing administrative efficiency and taxpayer compliance.
Keywords