Proceedings of the XXth Conference of Open Innovations Association FRUCT (Apr 2024)
Fast and Precise Convolutional Jaro and Jaro-Winkler Similarity
Abstract
In the domain of character-based approximate string matching, edit distances such as Levenshtein have remained predominant despite their quadratic time complexity. This reality has prompted the adoption of more efficient metrics like Jaro and Jaro-Winkler. However, these methods often overlook the significance of character order within the matching window, which can adversely affect accuracy. For the first time, we introduce a novel class of character-based approximate string matching algorithms that leverage a convolutional kernel, surpassing the performance of existing state-of-the-art unsupervised character-based approximate string matching algorithms. This paper presents Convolutional Jaro (ConvJ) and Convolutional Jaro-Winkler (ConvJW), innovative similarity metrics designed to overcome these shortcomings. ConvJ and ConvJW utilize a convolutional approach with Gaussian weighting to effectively capture the positional proximity of matching characters, resulting in a more precise similarity evaluation. This method not only achieves computational efficiency comparable to that of Jaro and Jaro-Winkler but also surpasses the state-of-the-art in terms of F1-score, demonstrating faster execution times compared to the conventional Jaro and Jaro-Winkler implementations across various datasets. Our extensive experimental analysis highlights the exceptional performance of ConvJ and ConvJW across a range of datasets. Remarkably, ConvJ exhibits a 7x faster execution time than the fast Jaro implementation and exceeds the state-of-the-art F1-score by a significant margin of 10% more than Jaro. By setting a new benchmark in unsupervised character-based approximate string matching, our research shows the new way for future exploration and development in this field. The ConvJ and ConvJW algorithms, characterized by their quasilinear time complexity and improved accuracy, provide a solid foundation for the advancement of string matching techniques. These developments hold promise for a broad spectrum of applications in data mining, bioinformatics, and related areas.
Keywords