Creating FAIR Data Standards
Eight years ago in 2016, a group of scientists collaborated on an article in Nature1 to design and promote FAIR standards around scientific data. FAIR stands for – Findability, Accessibility, Interoperability, Reusability. These principles sought to guide a movement of making scientific data more FAIR, especially given the huge expansion of journals online and the creation of more and more scientific data through expedited electronic tests.
Talk to a scientist from the days of tight page limits for not just articles, but supporting information, or even farther back to those who had to interpret hand drawn figures, and you will find that conveying data in an accurate, let alone a reusable form, was not a given.
Even with greatly improved publishing technology, scientists nevertheless found themselves squinting at graphs to try to get any sense of precision values out of a bar the authors claimed was measured to 3 significant decimals.
The goal of those science professionals in creating the FAIR standards was to push for data being provided in machine-readable and raw formats such that data could not only be conveyed accurately, but such that other scientists could use that data to more reliably reproduce the results and if necessary, check the author’s work. For some, this last point was initially a concern. Would they be accused of fraud if they made an honest mistake in performing a calculation and someone reproducing it from the raw data uncovered this? As reality has panned out, the answer to that concern turned out to be no. The online nature of journals had made submitting corrections much simpler – meaning the actual result of such standards has led to a higher integrity in the field.
The FAIR standards have since been widely adopted, with most journals requiring data to be provided in either supporting informational documents, or through download links hosted on sites like Zenodo or Github.
Accessibility for Artificial Intelligence and Machine Learning in Science
Fast-forward to today, scientific fields have seen a massive adoption of artificial intelligence/machine learning (AI/ML) into nearly every sub-discipline. Scientists regularly hear from grant officers that they should be including the use of AI tools into their studies to get funded, and just listen to any major public company’s quarterly investment calls to hear the pressure on companies to integrate AI.
Despite some of the negatives this pressure has led to (dilution from other relevant science, firing employees to pivot towards AI, etc.) this push has also led to some great successes in the ML for science space. These include examples like Alphafold2 enabling new studies that previously had to wait years for synthesis and crystal structures, to new docking algorithms like DiffDock3 that permit global docking in much faster execution times, to ProteinMPNN4 which performs 3D template based mutations of proteins to the same broadly folded structure. Certainly, the impact could not be clearer given the 2024 Nobel prizes in chemistry and physics. Machine learning could not perform so well on scientific problems without the accessibility and machine readability of data – a major result of the FAIR data standards.
Yet, as scientists try to adopt and validate published ML models, we now find ourselves in a similar situation as those who created the FAIR standards years ago. Common problems include: only text-based model descriptions are provided; the untrained model architecture as code is provided and/or the data is not provided; the model can only be downloaded on specific computers; installation or running instructions are not provided; the model is reliant on software libraries without specifying which versions; or the models can only be run on expensive hardware.
It is for this reason that we must now push for FAIR model standards – especially accessibility. So, what exactly does making a model accessible entail?
- Models must be provided in a coded and trained form.
- Instructions for installation and inference (including an inference example) should be provided.
- Installation instructions need to detail the versions and OS’s on which the methods have been tested.
- Where appropriate, training data should be open sourced so other models can be compared fairly based on training on the same data.
- When possible, models should not be designed such that they require hardware an average user – preferably any user – would not have access to.
BMaps – An Accessible Platform By Design – Integrates Validated ML Tools
At Conifer Point, we are using our fragment based drug design platform – BMaps – to further the accessibility of models we have been able to independently validate and think would be useful for the community. Through providing a hosted solution to accessing these tools with links to the documentation, these models are made accessible according to the 5 principles above, and users gain easy access to these powerful technologies.
Already integrated is DiffDock3, an ML global docking methodology based on diffusion techniques that enables quick docking in ~ 1 minute. The authors of DiffDock exemplify the accessibility goals for models we outlined above. Another ML model coming soon to BMaps is GiFE5, a molecular size agnostic linear function for the prediction of quantum mechanical Gibbs free energies. This new functionality will permit users to predict binding free energies of fully solvated protein ligand complexes at density functional theorem level accuracy in force field times. A preprint publication has already been released, with a publication and Github repo coming after full release within BMaps.
With new AI-powered features coming soon that will make BMaps even easier to use for everyone from a first-time to a veteran user, we are thrilled to offer a highly accessible web-based platform where traditional computational chemistry and ML models are all easily accessed and utilized to design improved medicines.
Conifer Point would be thrilled to partner with scientists who wish to make their models accessible by integrating them into BMaps. You can do this by reaching out to info@coniferpoint.com to learn more!
(1) Wilkinson, M. D.; Dumontier, M.; Aalbersberg, I. J.; Appleton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J.-W.; da Silva Santos, L. B.; Bourne, P. E.; others The FAIR Guiding Principles for scientific data management and stewardship. Scientific data 2016, 3, 1-9.
(2) Yang, Z.; Zeng, X.; Zhao, Y.; Chen, R. AlphaFold2 and its applications in the fields of biology and medicine. Signal Transduction and Targeted Therapy 2023, 8, 115.
(3) Corso, G.; St¨ark, H.; Jing, B.; Barzilay, R.; Jaakkola, T. DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. 2023; https://arxiv.org/abs/2210.01776.
(4) Dauparas, J.; Anishchenko, I.; Bennett, N.; Bai, H.; Ragotte, R. J.; Milles, L. F.; Wicky, B. I.; Courbet, A.; de Haas, R. J.; Bethel, N.; others Robust deep learning–based protein sequence design using ProteinMPNN. Science 2022, 378, 49–56.
(5) Freeze, J.; Batista, V. GiFE: A Molecular-Size Agnostic and Understandable Gibbs Free Energy Function. chemarxiv 2023
These concepts were first presented in November 2023 at the Molecular Machine Learning Conference at MIT Jameel Clinic