New AI tool classifies the effects of 71 million “missense” mutations
Uncovering the root causes of disease is one of the greatest challenges in human genetics. With millions of possible mutations and limited experimental data, the mystery remains as to which ones could give rise to disease. This knowledge is crucial to accelerate diagnosis and develop life-saving treatments.
Today we publish a catalog of “missense” mutations where researchers can learn more about their possible effects. Missense variants are genetic mutations that can affect the function of human proteins. In some cases, they can lead to diseases such as cystic fibrosis, sickle cell disease or cancer.
The AlphaMissense catalog was developed using AlphaMissense, our new AI model that classifies missense variants. In an article published in Science, we show that it classified 89% of the 71 million possible missense variants as likely pathogenic or likely benign. In contrast, only 0.1% have been confirmed by human experts.
AI tools that can accurately predict the effect of variants have the power to accelerate research in areas ranging from molecular biology to clinical and statistical genetics. Experiments to discover disease-causing mutations are expensive and labor-intensive – each protein is unique and each experiment must be designed separately, which can take months. Using AI predictions, researchers can preview results for thousands of proteins at once, which can help prioritize resources and speed up more complex studies.
We have made all of our predictions freely available to the research community and have open sourced them. model code for AlphaMissense.
What is a missense variant?
A missense variant is a single-letter substitution in DNA that results in a different amino acid in a protein. If you think of DNA as a language, changing a letter can change a word and completely alter the meaning of a sentence. In this case, a substitution changes the translated amino acid, which can affect the function of a protein.
The average person wears over 9,000 missense variations. Most are benign and have little or no effect, but others are pathogenic and can seriously disrupt protein function. Missense variants can be used in the diagnosis of rare genetic diseases, where a few missense variants, or even just one, can directly cause a disease. They are also important for studying complex diseases, such as type 2 diabetes, which can be caused by a combination of many different types of genetic changes.
Classification of missense variants is an important step in understanding which of these protein changes could give rise to disease. Of the more than 4 million missense variants already observed in humans, only 2% have been annotated as pathogenic or benign by experts, or around 0.1% of the 71 million possible missense variants. The remainder are considered “variants of unknown significance” due to the lack of experimental or clinical data on their impact. With AlphaMissense, we now have the clearest picture yet by classifying 89% of variants using a threshold that yielded 90% accuracy on a database of known disease variants.
Pathogenic or benign: how AlphaMissense classifies variants
AlphaMissense is based on our revolutionary model AlphaFold, which predicted the structures of almost every protein known to science from their amino acid sequences. Our adapted model can predict the pathogenicity of missense variants altering individual amino acids of proteins.
To train AlphaMissense, we refined AlphaFold on labels distinguishing variants observed in closely related human and primate populations. Commonly observed variants are treated as benign and never-seen variants are treated as pathogenic. AlphaMissense does not predict the change in protein structure upon mutation or other effects on protein stability. Instead, it leverages databases of associated protein sequences and the structural context of variants to produce a score between 0 and 1, roughly assessing the likelihood that a variant is pathogenic. Continuous scoring allows users to choose a threshold for classifying variants as pathogenic or benign that matches their accuracy requirements.
AlphaMissense performs state-of-the-art predictions on a wide range of genetic and experimental benchmarks, all without explicit training on these data. Our tool outperformed other computational methods when used to rank variants in ClinVar, a public archive of data on the relationship between human variants and disease. Our model was also the most accurate method for predicting laboratory results, showing that it is consistent with different ways of measuring pathogenicity.
Building a Community Resource
AlphaMissense leverages AlphaFold to deepen the world’s understanding of proteins. A year ago we published 200 million protein structures predicted using AlphaFold, which helps millions of scientists around the world accelerate research and pave the way for new discoveries. We look forward to seeing how AlphaMissense can help solve open questions at the heart of genomics and in the biological sciences.
We have made AlphaMissense predictions freely available to the scientific community. In collaboration with EMBL-EBI, we are also making them more usable by researchers thanks to Ensemble variant effect predictor.
In addition to our missense mutation lookup table, we have shared expanded predictions of all possible substitutions of 216 million unique amino acid sequences across more than 19,000 human proteins. We also included the average prediction for each gene, which is similar to measuring the evolutionary constraint of a gene: it indicates how essential the gene is to the survival of the organism.
Accelerate research into genetic diseases
A key step in translating this research is collaborating with the scientific community. We are working in partnership with Genomics England to explore how these predictions could help study the genetics of rare diseases. Genomics England cross-referenced AlphaMissense results with variant pathogenicity data previously clustered with human participants. Their evaluation confirmed that our predictions are accurate and consistent, providing another concrete benchmark for AlphaMissense.
Although our predictions are not designed for direct clinical use – and should be interpreted alongside other lines of evidence – this work has the potential to improve the diagnosis of rare genetic diseases and help discover new genes. causing diseases.
Ultimately, we hope that AlphaMissense, combined with other tools, will allow researchers to better understand diseases and develop new treatments that can save lives.