Learning Document Similarity Using Natural Language Processing

Paola Merlo; James Henderson; Gerold Schneider; Eric Wehrli

doi:10.13092/lo.17.788

Linguistik Online (Dec 2003)

Learning Document Similarity Using Natural Language Processing

Paola Merlo,
James Henderson,
Gerold Schneider,
Eric Wehrli

Affiliations

Paola Merlo
James Henderson
Gerold Schneider
Eric Wehrli

DOI: https://doi.org/10.13092/lo.17.788
Journal volume & issue: Vol. 17, no. 5

Abstract

Read online

The recent considerable growth in the amount of easily available on-line text has brought to the foreground the need for large-scale natural language processing tools for text data mining. In this paper we address the problem of organizing documents into meaningful groups according to their content and to visualize a text collection, providing an overview of the range of documents and of their relationships, so that they can be browsed more easily. We use Self-Organizing Maps (SOMs) (Kohonen 1984). Great efficiency challenges arise in creating these maps. We study linguistically-motivated ways of reducing the representation of a document to increase efficiency and ways to disambiguate the words in the documents.

Published in Linguistik Online

ISSN: 1615-3014 (Online)
Publisher: Bern Open Publishing
Country of publisher: Switzerland
LCC subjects: Language and Literature: Philology. Linguistics: Computational linguistics. Natural language processing; Language and Literature: Philology. Linguistics: Language. Linguistic theory. Comparative grammar
Website: https://bop.unibe.ch/linguistik-online/

About the journal