Linguistic measures of chemical diversity and the “keywords” of molecular collections

Michał Woźniak; Agnieszka Wołos; Urszula Modrzyk; Rafał L. Górski; Jan Winkowski; Michał Bajczyk; Sara Szymkuć; Bartosz A. Grzybowski; Maciej Eder

doi:10.1038/s41598-018-25440-6

Scientific Reports (May 2018)

Linguistic measures of chemical diversity and the “keywords” of molecular collections

Michał Woźniak,
Agnieszka Wołos,
Urszula Modrzyk,
Rafał L. Górski,
Jan Winkowski,
Michał Bajczyk,
Sara Szymkuć,
Bartosz A. Grzybowski,
Maciej Eder

Affiliations

Michał Woźniak: Institute of Polish Language, Polish Academy of Sciences
Agnieszka Wołos: Institute of Organic Chemistry, Polish Academy of Sciences
Urszula Modrzyk: Institute of Polish Language, Polish Academy of Sciences
Rafał L. Górski: Institute of Polish Language, Polish Academy of Sciences
Jan Winkowski: Institute of Polish Language, Polish Academy of Sciences
Michał Bajczyk: Institute of Organic Chemistry, Polish Academy of Sciences
Sara Szymkuć: Institute of Organic Chemistry, Polish Academy of Sciences
Bartosz A. Grzybowski: Institute of Organic Chemistry, Polish Academy of Sciences
Maciej Eder: Institute of Polish Language, Polish Academy of Sciences

DOI: https://doi.org/10.1038/s41598-018-25440-6
Journal volume & issue: Vol. 8, no. 1
pp. 1 – 10

Abstract

Read online

Abstract Computerized linguistic analyses have proven of immense value in comparing and searching through large text collections (“corpora”), including those deposited on the Internet – indeed, it would nowadays be hard to imagine browsing the Web without, for instance, search algorithms extracting most appropriate keywords from documents. This paper describes how such corpus-linguistic concepts can be extended to chemistry based on characteristic “chemical words” that span more than traditional functional groups and, instead, look at common structural fragments molecules share. Using these words, it is possible to quantify the diversity of chemical collections/databases in new ways and to define molecular “keywords” by which such collections are best characterized and annotated.

Published in Scientific Reports

ISSN: 2045-2322 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://www.nature.com/srep/

About the journal