Comparing Word Representation BERT and RoBERTa in  Keyphrase Extraction using TgGAT

Novi Yusliani; Aini Nabilah; Muhammad Raihan Habibullah; Annisa Darmawahyuni; Ghita Athalina

doi:10.29207/resti.v9i2.6279

Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) (Mar 2025)

Comparing Word Representation BERT and RoBERTa in Keyphrase Extraction using TgGAT

Novi Yusliani,
Aini Nabilah,
Muhammad Raihan Habibullah,
Annisa Darmawahyuni,
Ghita Athalina

Affiliations

Novi Yusliani: Universitas Sriwijaya
Aini Nabilah: Universitas Sriwijaya
Muhammad Raihan Habibullah: Universitas Sriwijaya
Annisa Darmawahyuni: Universitas Sriwijaya
Ghita Athalina: Universitas Sriwijaya

DOI: https://doi.org/10.29207/resti.v9i2.6279
Journal volume & issue: Vol. 9, no. 2
pp. 250 – 257

Abstract

Read online

In this digital era, accessing vast amounts of information from websites and academic papers has become easier. However, efficiently locating relevant content remains challenging due to the overwhelming volume of data. Keyphrase Extraction Systems automate the process of generating phrases that accurately represent a document’s main topics. These systems are crucial for supporting various natural language processing tasks, such as text summarization, information retrieval, and representation. The traditional method of manually selecting key phrases is still common but often proves inefficient and inconsistent in summarizing the main ideas of a document. This study introduces an approach that integrates pre-trained language models, BERT and RoBERTa, with Topic-Guided Graph Attention Networks (TgGAT) to enhance keyphrase extraction. TgGAT strengthens the extraction process by combining topic modelling with graph-based structures, providing a more structured and context-aware representation of a document’s key topics. By leveraging the strengths of both graph-based and transformer-based models, this research proposes a framework that improves keyphrase extraction performance. This is the first to apply graph-based and PLM methods for keyphrase extraction in the Indonesian language. The results revealed that BERT outperformed RoBERTa, with precision, recall, and F1-scores of 0.058, 0.070, and 0.062, respectively, compared to RoBERTa’s 0.026, 0.030, and 0.027. The result shows that BERT with TgGAT obtained more representative keyphrases than RoBERTa with TgGAT. These findings underline the benefits of integrating graph-based approaches with pre-trained models for capturing both semantic relationships and topic relevance.

Published in Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi)

ISSN: 2580-0760 (Online)
Publisher: Ikatan Ahli Informatika Indonesia
Country of publisher: Indonesia
LCC subjects: Technology: Engineering (General). Civil engineering (General): Systems engineering; Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: http://jurnal.iaii.or.id/index.php/RESTI

About the journal

Abstract

Keywords