Fuelling the Digital Chemistry Revolution with Language Models
Antonio Cardinale,
Alessandro Castrogiovanni,
Theophile Gaudin,
Joppe Geluykens,
Teodoro Laino,
Matteo Manica,
Daniel Probst,
Philippe Schwaller,
Aleksandros Sobczyk,
Alessandra Toniato,
Alain C. Vaucher,
Heiko Wolf,
Federico Zipoli
Affiliations
Antonio Cardinale
IBM Research Europe - Zurich, Säumerstrasse 4, Rüschlikon, CH-8803, Switzerland
Alessandro Castrogiovanni
IBM Research Europe - Zurich, Säumerstrasse 4, Rüschlikon, CH-8803, Switzerland
Theophile Gaudin
IBM Research Europe - Zurich, Säumerstrasse 4, Rüschlikon, CH-8803, Switzerland
Joppe Geluykens
IBM Research Europe - Zurich, Säumerstrasse 4, Rüschlikon, CH-8803, Switzerland
Teodoro Laino
IBM Research Europe - Zurich, Säumerstrasse 4, Rüschlikon, CH-8803, Switzerland; National Center for Competence in Research-Catalysis (NCCR-Catalysis), Zurich, Switzerland
Matteo Manica
IBM Research Europe - Zurich, Säumerstrasse 4, Rüschlikon, CH-8803, Switzerland
Daniel Probst
IBM Research Europe - Zurich, Säumerstrasse 4, Rüschlikon, CH-8803, Switzerland
Philippe Schwaller
IBM Research Europe - Zurich, Säumerstrasse 4, Rüschlikon, CH-8803, Switzerland; National Center for Competence in Research-Catalysis (NCCR-Catalysis), Zurich, Switzerland
Aleksandros Sobczyk
IBM Research Europe - Zurich, Säumerstrasse 4, Rüschlikon, CH-8803, Switzerland; National Center for Competence in Research-Catalysis (NCCR-Catalysis), Zurich, Switzerland
Alessandra Toniato
IBM Research Europe - Zurich, Säumerstrasse 4, Rüschlikon, CH-8803, Switzerland; National Center for Competence in Research-Catalysis (NCCR-Catalysis), Zurich, Switzerland
Alain C. Vaucher
IBM Research Europe - Zurich, Säumerstrasse 4, Rüschlikon, CH-8803, Switzerland; National Center for Competence in Research-Catalysis (NCCR-Catalysis), Zurich, Switzerland
Heiko Wolf
IBM Research Europe - Zurich, Säumerstrasse 4, Rüschlikon, CH-8803, Switzerland
Federico Zipoli
IBM Research Europe - Zurich, Säumerstrasse 4, Rüschlikon, CH-8803, Switzerland; National Center for Competence in Research-Catalysis (NCCR-Catalysis), Zurich, Switzerland
The RXN for Chemistry project, initiated by IBM Research Europe – Zurich in 2017, aimed to develop a series of digital assets using machine learning techniques to promote the use of data-driven methodologies in synthetic organic chemistry. This research adopts an innovative concept by treating chemical reaction data as language records, treating the prediction of a synthetic organic chemistry reaction as a translation task between precursor and product languages. Over the years, the IBM Research team has successfully developed language models for various applications including forward reaction prediction, retrosynthesis, reaction classification, atom-mapping, procedure extraction from text, inference of experimental protocols and its use in programming commercial automation hardware to implement an autonomous chemical laboratory. Furthermore, the project has recently incorporated biochemical data in training models for greener and more sustainable chemical reactions. The remarkable ease of constructing prediction models and continually enhancing them through data augmentation with minimal human intervention has led to the widespread adoption of language model technologies, facilitating the digitalization of chemistry in diverse industrial sectors such as pharmaceuticals and chemical manufacturing. This manuscript provides a concise overview of the scientific components that contributed to the prestigious Sandmeyer Award in 2022