Distributed Web-Scale Infrastructure For Crawling, Indexing And Search With Semantic Support

Stefan Dlugolinsky; Martin Seleng; Michal Laclavik; Ladislav Hluchy

doi:10.7494/csci.2012.13.4.5

Computer Science (Jan 2012)

Distributed Web-Scale Infrastructure For Crawling, Indexing And Search With Semantic Support

Stefan Dlugolinsky,
Martin Seleng,
Michal Laclavik,
Ladislav Hluchy

Affiliations

Stefan Dlugolinsky: Institute of Informatics, Slovak Academy of Sciences, Bratislava
Martin Seleng: Institute of Informatics, Slovak Academy of Sciences, Bratislava
Michal Laclavik: Institute of Informatics, Slovak Academy of Sciences, Bratislava
Ladislav Hluchy: Institute of Informatics, Slovak Academy of Sciences, Bratislava

DOI: https://doi.org/10.7494/csci.2012.13.4.5
Journal volume & issue: Vol. 13, no. 4
p. 5

Abstract

Read online

In this paper, we describe our work in progress in the scope of web-scale informationextraction and information retrieval utilizing distributed computing. Wepresent a distributed architecture built on top of the MapReduce paradigm forinformation retrieval, information processing and intelligent search supportedby spatial capabilities. Proposed architecture is focused on crawling documentsin several different formats, information extraction, lightweight semantic annotationof the extracted information, indexing of extracted information andfinally on indexing of documents based on the geo-spatial information foundin a document. We demonstrate the architecture on two use cases, where thefirst is search in job offers retrieved from the LinkedIn portal and the second issearch in BBC news feeds and discuss several problems we had to face duringthe implementation. We also discuss spatial search applications for both casesbecause both LinkedIn job offer pages and BBC news feeds contain a lot of spatialinformation to extract and process.

Published in Computer Science

ISSN: 1508-2806 (Print); 2300-7036 (Online)
Publisher: AGH University of Science and Technology Press
Country of publisher: Poland
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://journals.agh.edu.pl/csci

About the journal

Abstract

Keywords