Sentence Boundary Extraction from Scientific Literature of Electric Double Layer Capacitor Domain: Tools and Techniques

Md. Saef Ullah Miah; Junaida Sulaiman; Talha Bin Sarwar; Ateeqa Naseer; Fasiha Ashraf; Kamal Zuhairi Zamli; Rajan Jose

doi:10.3390/app12031352

Applied Sciences (Jan 2022)

Sentence Boundary Extraction from Scientific Literature of Electric Double Layer Capacitor Domain: Tools and Techniques

Md. Saef Ullah Miah,
Junaida Sulaiman,
Talha Bin Sarwar,
Ateeqa Naseer,
Fasiha Ashraf,
Kamal Zuhairi Zamli,
Rajan Jose

Affiliations

Md. Saef Ullah Miah: Faculty of Computing, College of Computing and Applied Sciences, Universiti Malaysia Pahang, Pekan 26600, Malaysia
Junaida Sulaiman: Faculty of Computing, College of Computing and Applied Sciences, Universiti Malaysia Pahang, Pekan 26600, Malaysia
Talha Bin Sarwar: Faculty of Computing, College of Computing and Applied Sciences, Universiti Malaysia Pahang, Pekan 26600, Malaysia
Ateeqa Naseer: Department of Software Engineering, School of Systems and Technology, University of Management and Technology, Lahore 54782, Pakistan
Fasiha Ashraf: Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore 54782, Pakistan
Kamal Zuhairi Zamli: Faculty of Computing, College of Computing and Applied Sciences, Universiti Malaysia Pahang, Pekan 26600, Malaysia
Rajan Jose: Faculty of Industrial Sciences & Technology, Universiti Malaysia Pahang, Gambang 26300, Malaysia

DOI: https://doi.org/10.3390/app12031352
Journal volume & issue: Vol. 12, no. 3
p. 1352

Abstract

Read online

Given the growth of scientific literature on the web, particularly material science, acquiring data precisely from the literature has become more significant. Material information systems, or chemical information systems, play an essential role in discovering data, materials, or synthesis processes using the existing scientific literature. Processing and understanding the natural language of scientific literature is the backbone of these systems, which depend heavily on appropriate textual content. Appropriate textual content means a complete, meaningful sentence from a large chunk of textual content. The process of detecting the beginning and end of a sentence and extracting them as correct sentences is called sentence boundary extraction. The accurate extraction of sentence boundaries from PDF documents is essential for readability and natural language processing. Therefore, this study provides a comparative analysis of different tools for extracting PDF documents into text, which are available as Python libraries or packages and are widely used by the research community. The main objective is to find the most suitable technique among the available techniques that can correctly extract sentences from PDF files as text. The performance of the used techniques Pypdf2, Pdfminer.six, Pymupdf, Pdftotext, Tika, and Grobid is presented in terms of precision, recall, f-1 score, run time, and memory consumption. NLTK, Spacy, and Gensim Natural Language Processing (NLP) tools are used to identify sentence boundaries. Of all the techniques studied, the Grobid PDF extraction package using the NLP tool Spacy achieved the highest f-1 score of 93% and consumed the least amount of memory at 46.13 MegaBytes.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords