It took decades of painstaking research to map the structure of just 17 per cent of the proteins used within the human body, but less than a year for UK-based AI company DeepMind to raise that figure to 98.5 per cent. The company is making all this data freely available, which could lead to rapid advances in the development of new drugs.
Determining the complex, crumpled shape of proteins based on the sequence of amino acids that make them has been a huge scientific hurdle. Some amino acids are attracted to others, some are repelled by water, and the chains form intricate shapes that are hard to calculate accurately. Understanding these structures enables new, highly targeted drugs to be designed that bind to specific parts of proteins.
Genetic research had long provided the ability to determine the sequence of a protein, but an efficient way of finding the shape – crucial to understanding its properties – has proven elusive. Although supercomputers and distributed computing projects have been effective, they have failed to make significant progress.
Advertisement
DeepMind published research last year that proved that AI can solve the problem quickly. Its AlphaFold neural network was trained on sections of previously solved protein shapes and learned to deduce the structure of new sequences, which were then checked against experimental data.
Since then, the company has been applying and refining the technology to thousands of proteins, beginning with the human proteome, proteins relevant to covid-19 and others that will most benefit immediate research. It is now releasing the results in a database created in partnership with the European Molecular Biology Laboratory.
DeepMind has mapped the structure of 98.5 per cent of the 20,000 or so proteins in the human body. For 35.7 per cent of these, the algorithm gave a confidence of over 90 per cent accuracy in predicting its shape.
The company has released more than 350,000 protein structure predictions in total, including those for 20 additional model organisms that are important for biological research, from Escherichia coli to yeast. The team hopes that within months it can add almost every sequenced protein known to science – more than 100 million structures.
John Moult at the University of Maryland says the rise of AI in the area of protein folding had been a “profound surprise”.
“It’s revolutionary in a sense that’s hard to get your head around,” he says. “If you’re working on some rare disease and you never had a structure, now you’ll be able to go and look at structural information which was basically very, very hard or impossible to get before.”
Demis Hassabis, chief executive and founder of DeepMind, says that AlphaFold – which is composed of around 32 separate algorithms and has been made open source – is now solving protein shapes in minutes or, in some cases, seconds using hardware no more sophisticated than a standard graphics card.
“It takes one [graphics processing unit] a few minutes to fold one protein, which of course would have taken years of experimental work,” he says. “We’re just going to put this treasure trove of data out there. It’s a little bit mind blowing in a way because going from the breakthrough of creating a system that can do that to actually producing all the data has only been a matter of months. We hope it’s going to become a sort of standard tool that all biologists around the world use.”
The team also added a confidence measure to all structure predictions, which Hassabis says he felt was vital given that the results will be the basis for research efforts. Hassabis believes that some portion of human proteins for which the predicted structure had lower confidence scores could be down to errors in the sequence or perhaps “something intrinsic about the biology”, such as proteins that are inherently disordered or unpredictable. The 1.5 per cent remaining of the human proteome which no structure has been published for were proteins with sequences longer than 2700 segments, which were excluded for the time being to minimise runtime.
Topics: