Space Efficient Data Structures for N-gram Retrieval

Fotios Kounelis; Christos Makris

doi:10.3934/medsci.2017.4.426

AIMS Medical Science (Oct 2017)

Space Efficient Data Structures for N-gram Retrieval

Fotios Kounelis,
Christos Makris

Affiliations

Fotios Kounelis: Department of Computer Engineering and Informatics, University of Patras, 26500 Greece
Christos Makris: Department of Computer Engineering and Informatics, University of Patras, 26500 Greece

DOI: https://doi.org/10.3934/medsci.2017.4.426
Journal volume & issue: Vol. 4, no. 4
pp. 426 – 440

Abstract

Read online

A significant problem in computer science is the management of large data strings and a great number of works dealing with the specific problem has been published in the scientific literature. In this article, we use a technique to store efficiently biological sequences, making use of data structures like suffix trees and inverted files and also employing techniques like n-grams, in order to improve previous constructions. In our attempt, we drastically reduce the space needed to store the inverted indexes, by representing the substrings that appear more frequently in a more compact inverted index. Our technique is based on n-gram indexing, providing us the extra advantage of indexing sequences that cannot be separated in words. Moreover, our technique combines classical one level with two-level n-gram inverted file indexing. Our results suggest that the new proposed algorithm can compress the data more efficiently than previous attempts.

Published in AIMS Medical Science

ISSN: 2375-1576 (Online)
Publisher: AIMS Press
Country of publisher: United States
LCC subjects: Medicine: Medicine (General)
Website: http://www.aimspress.com/journal/medicalScience

About the journal

Abstract

Keywords