Journal of Cheminformatics (Oct 2023)

Bloom filters for molecules

  • Jorge Medina,
  • Andrew D. White

DOI
https://doi.org/10.1186/s13321-023-00765-1
Journal volume & issue
Vol. 15, no. 1
pp. 1 – 6

Abstract

Read online

Abstract Ultra-large chemical libraries are reaching 10s to 100s of billions of molecules. A challenge for these libraries is to efficiently check if a proposed molecule is present. Here we propose and study Bloom filters for testing if a molecule is present in a set using either string or fingerprint representations. Bloom filters are small enough to hold billions of molecules in just a few GB of memory and check membership in sub milliseconds. We found string representations can have a false positive rate below 1% and require significantly less storage than using fingerprints. Canonical SMILES with Bloom filters with the simple FNV (Fowler-Noll-Voll) hashing function provide fast and accurate membership tests with small memory requirements. We provide a general implementation and specific filters for detecting if a molecule is purchasable, patented, or a natural product according to existing databases at https://github.com/whitead/molbloom .

Keywords