Mathematics (Oct 2024)

Assessing Scientific Text Similarity: A Novel Approach Utilizing Non-Negative Matrix Factorization and Bidirectional Encoder Representations from Transformer

  • Zhixuan Jia,
  • Wenfang Tian,
  • Wang Li,
  • Kai Song,
  • Fuxin Wang,
  • Congjing Ran

DOI
https://doi.org/10.3390/math12213328
Journal volume & issue
Vol. 12, no. 21
p. 3328

Abstract

Read online

The patent serves as a vital component of scientific text, and over time, escalating competition has generated a substantial demand for patent analysis encompassing areas such as company strategy and legal services, necessitating fast, accurate, and easily applicable similarity estimators. At present, conducting natural language processing(NLP) on patent content, including titles, abstracts, etc., can serve as an effective method for estimating similarity. However, the traditional NLP approach has some disadvantages, such as the requirement for a huge amount of labeled data and poor explanation of deep-learning-based model internals, exacerbated by the high compression of patent content. On the other hand, most knowledge-based deep learning models require a vast amount of additional analysis results as training variables in similarity estimation, which are limited due to human participation in the analysis part. Thus, in this research, addressing these challenges, we introduce a novel estimator to enhance the transparency of similarity estimation. This approach integrates a patent’s content with international patent classification (IPC), leveraging bidirectional encoder representations from transformers (BERT), and non-negative matrix factorization (NMF). By integrating these techniques, we aim to improve knowledge discovery transparency in NLP across various IPC dimensions and incorporate more background knowledge into context similarity estimation. The experimental results demonstrate that our model is reliable, explainable, highly accurate, and practically usable.

Keywords