Studia Universitatis Babes-Bolyai: Series Informatica (Dec 2017)

A HYBRID APPROACH FOR SCHOLARLY INFORMATION EXTRACTION

  • Zalán BODÓ,
  • Lehel CSATÓ

DOI
https://doi.org/10.24193/subbi.2017.2.01
Journal volume & issue
Vol. 62, no. 2

Abstract

Read online

Metadata extraction from documents forms an essential part of web or desktop search systems. Similarly, digital libraries that index scholarly literature require to find and extract the title, the list of authors and other publication-related information from an article. We present a hybrid approach for metadata extraction, combining classification and clustering to extract the desired information without the need of a conventional labeled dataset for training. An important asset of the proposed method is that the resulting clustering parameters can be used in other problems, e.g. document layout analysis.

Keywords