A HYBRID APPROACH FOR SCHOLARLY INFORMATION EXTRACTION

Zalán BODÓ; Lehel CSATÓ

doi:10.24193/subbi.2017.2.01

Studia Universitatis Babes-Bolyai: Series Informatica (Dec 2017)

A HYBRID APPROACH FOR SCHOLARLY INFORMATION EXTRACTION

Zalán BODÓ,
Lehel CSATÓ

Affiliations

Zalán BODÓ: Faculty of Mathematics and Computer Science, Babeș-Bolyai University, Cluj-Napoca, Romania. Email: [email protected]
Lehel CSATÓ: Faculty of Mathematics and Computer Science, Babeș-Bolyai University, Cluj-Napoca, Romania. Email: [email protected]

DOI: https://doi.org/10.24193/subbi.2017.2.01
Journal volume & issue: Vol. 62, no. 2

Abstract

Read online

Metadata extraction from documents forms an essential part of web or desktop search systems. Similarly, digital libraries that index scholarly literature require to find and extract the title, the list of authors and other publication-related information from an article. We present a hybrid approach for metadata extraction, combining classification and clustering to extract the desired information without the need of a conventional labeled dataset for training. An important asset of the proposed method is that the resulting clustering parameters can be used in other problems, e.g. document layout analysis.

information extraction, metadata, machine learning.

Published in Studia Universitatis Babes-Bolyai: Series Informatica

ISSN: 1224-869X (Print); 2065-9601 (Online)
Publisher: Babes-Bolyai University, Cluj-Napoca
Country of publisher: Romania
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://www.cs.ubbcluj.ro/~studia-i/

About the journal

Abstract

Keywords