Attaining Scientific Accessibility for Machine Learning Models

Creating FAIR Data Standards

Eight years ago in 2016, a group of scientists collaborated on an article in Nature1 to design and promote FAIR standards around scientific data. FAIR stands for – Findability, Accessibility, Interoperability, Reusability. These principles sought to guide a movement of making scientific data more FAIR, especially given the huge expansion of journals online and the creation of more and more scientific data through expedited electronic tests.

Talk to a scientist from the days of tight page limits for not just articles, but supporting information, or even farther back to those who had to interpret hand drawn figures, and you will find that conveying data in an accurate, let alone a reusable form, was not a given.

Even with greatly improved publishing technology, scientists nevertheless found themselves squinting at graphs to try to get any sense of precision values out of a bar the authors claimed was measured to 3 significant decimals.

The goal of those science professionals in creating the FAIR standards was to push for data being provided in machine-readable and raw formats such that data could not only be conveyed accurately, but such that other scientists could use that data to more reliably reproduce the results and if necessary, check the author’s work. For some, this last point was initially a concern. Would they be accused of fraud if they made an honest mistake in performing a  calculation and someone reproducing it from the raw data uncovered this? As reality has panned out, the answer to that concern turned out to be no. The online nature of journals had made submitting corrections much simpler – meaning the actual result of such standards has led to a higher integrity in the field.

The FAIR standards have since been widely adopted, with most journals requiring data to be provided in either supporting informational documents, or through download links hosted on sites like Zenodo or Github.

Accessibility for Artificial Intelligence and Machine Learning in Science

Fast-forward to today, scientific fields have seen a massive adoption of artificial intelligence/machine learning (AI/ML) into nearly every sub-discipline. Scientists regularly hear from grant officers that they should be including the use of AI tools into their studies to get funded, and just listen to any major public company’s quarterly investment calls to hear the pressure on companies to integrate AI.

Despite some of the negatives this pressure has led to (dilution from other relevant science, firing employees to pivot towards AI, etc.) this push has also led to some great successes in the ML for science space. These include examples like Alphafold2 enabling new studies that previously had to wait years for synthesis and crystal structures, to new docking algorithms like DiffDock3 that permit global docking in much faster execution times, to ProteinMPNN4 which performs 3D template based mutations of proteins to the same broadly folded structure. Certainly, the impact could not be clearer given the 2024 Nobel prizes in chemistry and physics. Machine learning could not perform so well on scientific problems without the accessibility and machine readability of data – a major result of the FAIR data standards.

Yet, as scientists try to adopt and validate published ML models, we now find ourselves in a similar situation as those who created the FAIR standards years ago. Common problems include: only text-based model descriptions are provided; the untrained model architecture as code is provided and/or the data is not provided; the model can only be downloaded on specific computers; installation or running instructions are not provided; the model is reliant on software libraries without specifying which versions; or the models can only be run on expensive hardware.

It is for this reason that we must now push for FAIR model standards – especially accessibility. So, what exactly does making a model accessible entail?

  1. Models must be provided in a coded and trained form.
  2. Instructions for installation and inference (including an inference example) should be provided.
  3. Installation instructions need to detail the versions and OS’s on which the methods have been tested.
  4. Where appropriate, training data should be open sourced so other models can be compared fairly based on training on the same data.
  5. When possible, models should not be designed such that they require hardware an average user – preferably any user – would not have access to.

BMaps – An Accessible Platform By Design – Integrates Validated ML Tools

At Conifer Point, we are using our fragment based drug design platform – BMaps – to further the accessibility of models we have been able to independently validate and think would be useful for the community. Through providing a hosted solution to accessing these tools with links to the documentation, these models are made accessible according to the 5 principles above,  and users gain easy access to these powerful technologies.

Already integrated is DiffDock3, an ML global docking methodology based on diffusion techniques that enables quick docking in ~ 1 minute. The authors of DiffDock exemplify the accessibility goals for models we outlined above. Another ML model coming soon to BMaps is GiFE5, a molecular size agnostic linear function for the prediction of quantum mechanical Gibbs free energies. This new functionality will permit users to predict binding free energies of fully solvated protein ligand complexes at density functional theorem level accuracy in force field times. A preprint publication has already been released, with a publication and Github repo coming after full release within BMaps.

With new AI-powered features coming soon that will make BMaps even easier to use for everyone from a first-time to a veteran user, we are thrilled to offer a highly accessible web-based platform where traditional computational chemistry and ML models are all easily accessed and utilized to design improved medicines.

Conifer Point would be thrilled to partner with scientists who wish to make their models accessible by integrating them into BMaps. You can do this by reaching out to info@coniferpoint.com to learn more!

(1) Wilkinson, M. D.; Dumontier, M.; Aalbersberg, I. J.; Appleton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J.-W.; da Silva Santos, L. B.; Bourne, P. E.; others The FAIR Guiding Principles for scientific data management and stewardship. Scientific data 2016, 3, 1-9.

(2) Yang, Z.; Zeng, X.; Zhao, Y.; Chen, R. AlphaFold2 and its applications in the fields of biology and medicine. Signal Transduction and Targeted Therapy 2023, 8, 115.

(3) Corso, G.; St¨ark, H.; Jing, B.; Barzilay, R.; Jaakkola, T. DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. 2023; https://arxiv.org/abs/2210.01776.

(4) Dauparas, J.; Anishchenko, I.; Bennett, N.; Bai, H.; Ragotte, R. J.; Milles, L. F.; Wicky, B. I.; Courbet, A.; de Haas, R. J.; Bethel, N.; others Robust deep learning–based protein sequence design using ProteinMPNN. Science 2022, 378, 49–56.

(5) Freeze, J.; Batista, V. GiFE: A Molecular-Size Agnostic and Understandable Gibbs Free Energy Function. chemarxiv 2023

These concepts were first presented in November 2023 at the Molecular Machine Learning Conference at MIT Jameel Clinic

DiffDock Now Implemented In Boltzmann Maps

DiffDock Diffusion-Based AI Model Available for Quick Protein-Ligand Docking in Boltzmann Maps

Seeking to democratize the latest tooling for computational drug discovery, the Boltzmann Maps team is proud to announce the integration of the AI model DiffDock [1]. In comparison on the PDBBind ligand docking task, DiffDock achieved a 38% top-1 success rate for binding ligands within 2A RMSD of the crystal docking site. This outperformed traditional Glide docking at 23% and other leading deep learning methods at 20% [1].Furthermore, docking runs take less than a minute for most protein-ligand combinations, transforming the possibilities for traditional computational workflows.

Visualization of DiffDock results in BMaps

Features

Now available in BMaps as an additional option alongside our previously implemented AutoDock Vina capabilities, DiffDock in Boltzmann Maps now supports:

  • Fast rigid, full-protein docking of ligands
  • Visualization of up to the 10 best poses for each docked compound
  • Energy minimization and scoring of docked poses with calculated physiochemical properties.

Use DiffDock Today!

Freely play around with DiffDock today:

  1. Log into Boltzmann Maps
  2. Bring in a protein and a compound
  3. Use options in the compound menu to dock.

Then, you can follow up your docking runs of multiple compounds by minimizing the docked geometries and calculating the energy score for each pose to discover which compound binds best to your protein of interest! Additionally, you can compare results from AutoDock Vina and hot spot analysis side by side to gain confidence in your results through the use of lateral methods.

Next Steps

This is the first post in a series of blogs on DiffDock.
See the next blog post for a full tutorial on using this tool in BMaps.

For greater detail about the theory and implementation of DiffDock, we recommend the original publication and associated Github. Stay tuned for an upcoming blog post breaking down this theory.

[1] Corso, Gabriele, et al. “Diffdock: Diffusion steps, twists, and turns for molecular docking.” arXiv preprint arXiv:2210.01776(2022).

Need help with your computational chemistry and biology tasks? Conifer Point, maker of Boltzmann Maps, proudly offers CRO services.

Alphafold AI-generated protein structure to empower Boltzmann Maps Fragment-based Drug Design

Since its first public test in 2018, Alphafold has made great strides in providing the scientific community with highly accurate AI-generated protein structure predictions. A recent database release by Alphafold contained over 200 million entries and boasts broad coverage of the UniProt protein sequence and annotation repository. Boltzmann Maps puts the power of the Alphafold database directly in the hands of users and allows for advanced analysis of protein structures.

Continue reading Alphafold AI-generated protein structure to empower Boltzmann Maps Fragment-based Drug Design

Pharmacophore screening using Pharmit in Boltzmann Maps

Boltzmann Maps is pleased to introduce an integration with Pharmit as an option for pharmacophore screening. Pharmit is a search tool for finding small molecule inhibitors that bind to a target of interest. The tool searches libraries for compounds with desired features in the right geometry. Boltzmann Maps integration allows the user to send a protein-ligand system from BMaps to Pharmit for search based on the compound’s features or other user-specified features. Pharmit’s nine built-in libraries include almost 250M compound entries, and the 1,059 publicly accessible user-contributed libraries contain another 45M entries.

Pharmit can be accessed via the Export button on the bottom right of the BMaps web app.

Continue reading Pharmacophore screening using Pharmit in Boltzmann Maps

100% PDB Availability and Automation of Protein Preparation

With the new release of Boltzmann Maps comes enhanced reliability for protein structure loading and automation of protein preparation for energy minimization, docking and fragment simulations. The entirety of the Protein Data Bank (PDB) is now available to view in Boltzmann Maps. 

As an example, log into BMaps to view PDB ID 3n7h: https://www.boltzmannmaps.com/structure/3n7h. The PDB featured this mosquito odorant binding protein in complex with DEET (DE3 ligand) as one of the “Molecules of the Month” for June 2023.

Continue reading 100% PDB Availability and Automation of Protein Preparation

Compound Energy Minimization with OpenMM

The Boltzmann Maps web app now employs GPU-accelerated OpenMM software for compound energy minimization in the context of a protein. The reported energies include van der Waals and electrostatic energies between compound and protein, as well as the change of a compound’s internal energies between the unbound and bound configurations (stress). These energy reports are a key metric for evaluating and comparing compounds and modifications. OpenMM integration allows Boltzmann Maps to provide this data with improved quality and speed.

OpenMM is an open-source toolkit for molecular simulation. It is highly flexible with its custom functions and has high performance, especially on recent GPUs. More information can be found at: https://openmm.org.

Continue reading Compound Energy Minimization with OpenMM

Starting structure-based design with sample compounds

Structure-based drug design starts with a compound positioned on the surface of a protein. Often, the crystal ligand in a PDB structure provides the natural starting point. But what if there is no ligand? This question took on increased urgency as the Boltzmann Maps team prepared fragment simulations on COVID-19 structures; many of the initial entries in the PDB did not have ligands. In response to this, Boltzmann Maps now provides Sample Compounds. We docked 20K+ small molecules from our libraries against hotspots on each structure and selected a handful to show in BMaps. The resulting compounds are generally commercially available, chemically diverse, and are reasonable starting points for structure-based design. They can then be used to explore new designs, using BMaps’ energy analysis and fragment data.

Sample Compounds are available for almost all of the SARS-CoV-2 structures in Boltzmann Maps. More are coming!

View a COVID-19 structure with sample compounds now >>. Or, log in to start exploring your own structure-based design modifications.

Continue reading Starting structure-based design with sample compounds

New: Fragment maps for 10 SARS-CoV-2 proteins

Fragment maps for ten SARS-CoV-2 virus proteins are now available in the BMaps web app to accelerate the design of COVID-19 therapeutics. These structures include the main protein protease (NSP5), the Spike protein (S), the receptor binding domain (RDB) of the S protein, and several NS (non-structural) proteins NSP3, NSP9, NSP10, NSP15, NSP16. Available for each protein are druggability sites, water molecule maps, and a starting set of 117 chemical fragment binding maps.

Continue reading New: Fragment maps for 10 SARS-CoV-2 proteins

Fragment Maps for Coronavirus (6LU7) Available

Start designing your COVID-19 protease inhibitors using BMaps with an expanded set of 221 fragments. The example below started with a benzimidazole fragment, then grew to a benzene-CF3 via intermediate linkers. To get started with your own possibilities, view the BMaps prepared 6LU7 structure. And stay tuned for more coronavirus structures currently in BMaps fragment simulation (6LVN, 6VSB).

To learn more about the coronavirus, visit https://www.cdc.gov/coronavirus/2019-ncov/ or https://en.wikipedia.org/wiki/Coronavirus_disease_2019.

Image of an example coronavirus protease inhibitor, assembled from Boltzmann Maps simulated fragment maps.
CoV-2 Structures

Latest: More fragment grow results & “Rings-in-Drugs”

The latest BMaps update has support for a broader set of fragment linking options when growing with fragments, significantly expanding the opportunities for compound modifications. By accessing more fragments in new sub-pockets, the new linking features provide lots of new ideas for improvements to your compounds, ranked by fragment binding scores. In addition to simple bond linking, there is now:

  • methylene linking (–C– methane or single carbon);
  • ethane linking (–C–C– 2 carbons);
  • acetylene linking (–C≡C– 2 carbons connected by a triple bond).

Continue reading Latest: More fragment grow results & “Rings-in-Drugs”