Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
AlphaFold 2 is here: what’s behind the structure prediction miracle (www.blopig.com) similar stories update story
241 points by couteiral | karma 125 | avg karma 25.0 2021-07-20 04:15:07 | hide | past | favorite | 98 comments



view as:

> Like most bioinformatics programs, AlphaFold 2 comes equipped with a “preprocessing pipeline”, which is the discipline’s lingo for “a Bash script that calls some other codes”.

Having Bioinformatics people requiring to stray a long way from their core competency to learn a scripting language from the 80's to write glue code seems... suboptimal. How many hours of expert time has been wasted figuring out how to split a string in bash?

Can us software people build a better tool to eliminate the need for this?


I think the tools already exist just that people are conservative with their choices.

(Obligatory xkcd; you know the one)

Not really?


MLops is a pretty hot area right now. The industry is trying to figure out how to engineer these things in a robust way. It’s not ubiquitous at all yet. There are lots of tools that wrap K8s and help you train models, but trying in DataOps in a robust way…. I haven’t seen the definitive answer yet.

Seriously please tell me if you are founding this company so I can invest.


Pretty much anyone that works with data has to clean their data, learning to use the tools to do that is important. Whether that's bash, perl, R, python, ... doesn't really matter that much. If they already know bash then bash is a good tool since they can now focus on their data instead of wasting time on learning a new tool to do the same thing.

It’s sad. Bio people really need friends in software to boost their research. Unfortunately they do a codeacademy python course for 5 min and try to get their projects going. Sometimes they succeed, sometimes they fail. But they don’t really have much time to dedicate to properly learn software dev and it’s not what they are into anyway, it’s a necessity.

I think we could create something like a GitHub of bio projects that need help and people assist with hopes of getting their names in a paper


I think DeepMind went one step further and solved the entire problem for them. They don't even have to touch their keyboards anymore

Sure there are better alternatives, but the advantage of bash / shell scripting is that it’s very easy to glue a whole collection of tools together, and that expertise in this transfers well between domains.

They probably could have achieved the same by invoking things in Python, but it would have been slower and not achieved a lot, other than “not using shell scripts”.

And once you go down the path of optimizing this enough, you’ll end up reinventing shell scripts altogether.


Well, AlphaFold 2 generates MSA by invoking things in Python: https://github.com/deepmind/alphafold/blob/main/alphafold/da.... So the article is actually mistaken on this point.

From looking at the code, Bash looks pretty clean.

I also use Bash and AWK for preprocessing a lot.


I used to be that guy as well, till a college convinced me that anything I can do in Bash, Awk I could probably do easier in Perl. Then everyone sort of drifted to Python. I get that if you never used Perl it’s pointless to learn it if you’re already in the Python stack, but… damn Perl’s regular expressions and how it’s so baked into the syntax of the language makes using regex in Python seem like going back to the Stone Age.

Bash is easier to explain and use than, eg, explaining ppl how to use Python subprocess Module Launch different apps, capture their output, etc.

I find astonishing how bad Python is as a bash replacement.

I often rather write an argument parser in bash than use Python if I have to invoke a bunch of commands.


Python is bad, but bash is worse as soon as you need any kind of logic.

> explaining ppl how to use Python subprocess Module Launch different apps, capture their output

There's no shame in using `os.system`.


Well, technically subprocess.check_call if you need to capture output.

There is a growing trend to include Docker (or Singularity, which is more compatible with HPC architectures common in bioinformatics) images alongside codes. In particular, AlphaFold 2 does provide a Dockerfile, and they even include a Python "launcher script" hiding all the details of running the code.

Sadly, this is very uncommon in the community. In a bioinformatics meeting, the sentence "I spent X days setting up Y software" will not raise many eyebrows


I work in power systems, and the situation is similar. Maybe worse because paper authors often come up with new computational techniques but don’t implement them in code (much less code with a dockerfile).

I look with much jealousy over at the computer science field where papers often include code, multiple versions under version control, automated tests, setup/docker scripts, and demonstration workflows and interfaces.


* Can us software people build a better tool to eliminate the need for this?

Most probably, not. Bash is currently the sweet spot. It is actually the best tool for this job. Any other option comes with increased complexity, will make the whole software less stable.


> How many hours of expert time has been wasted figuring out how to split a string in bash?

Probably a good many more than would be needed to learn how to split a string in Python.


Among the bioinformatics folks I know, bash is already a core competency. If you’re using your average biologist as your mental model, you’re thinking of the wrong people.

I don't see this as a bad thing. Shell scripts are a great way to prototype text processing pipelines using multiple smaller components.

Awesome.

I wrote a thesis on protein structure prediction in 1995. We weren't very good at it then. Amazing to see this.


I remember that the scientific game "Fold It" was also quite exciting when it came out 12-13 years ago or so, since populations of players could get results beyond what either specialists or computer systems could achieve. I guess one could argue that an AI such as this could be compared to a large automated population of trained players trying to solve a 3D puzzle.

I did some undergraduate research on this around 1999. At the time we were trying to prove that we could throw more firepower at the problem by building a beowulf cluster to solve the problem. After a bit of tweaking, we were able to get more performance than a single machine, but soon seti@home was released and to me at least the writing was on the wall that we were not taking the most optimal approach.

In hindsight though, we were so far off from both an algorithmic perspective and a hardware perspective to actually achieving meaningful results. I am glad, 20 years later it seems real progress is being made. I haven't really followed the folding@home project in many many years, but its not clear to me much came out of it that was all that useful, at least not in practical terms.


Not sure about folding@home, but the lab that runs rosetta@home released a paper earlier this month claiming they have a new algorithm with comparable results to AlphaFold2: https://science.sciencemag.org/content/early/2021/07/19/scie...

I don't believe this new approach runs on their distributed compute network, but its cool to see some good competition.


Thank you Google...thank you!

Why did they open source it? Wouldn’t this model be very valuable to the pharma industry?

It is written in the article. A few reasons could be pressure from the publishing magazine, emerging open source implementations of the same idea, and the fact that this is still far from easy to comercialize.

Don't forget internal pressure.

A lot of people will say "Unless you opensource my work and that of my colleagues, I quit".

When faced with all your best people threatening to quit, you might just opensource that work. It turns out you still have an advantage by being ~1 year ahead on applying it to anything, and having all the people who know how it works on your staff.


Another reason could be that whoever wants to run it will very likely run it in the cloud, and there's a chance they'd run it in the Google Cloud. A machine similar to the one they mention on the alphafold github page (12 vCPU, 1 GPU, 85 GB) costs you between $1 and $4/hour.

> Why did they open source it? Wouldn’t this model be very valuable to the pharma industry?

This is a question we should remember when we feel like condemning big corporations for monopolizing AI. HuggingFace lists 12,257 models in its zoo, many coming from FAANG. You can start one in 3 lines of Python, or fine-tune it with a little more effort.


As an outsider it seems Google has a much more academia friendly culture than other megacap tech companies. Guessing the talent which this culture draws are likely more adamant about their work being open sourced.

Because the core competency is not the model or code, but the people and organization that enable this project (and perhaps computing infrastructure as well?). The pharma industry will try to catch up of course, but they will also likely try to establish collaboration with DeepMind. This could be a good first step for Google into the medical/pharma business.

If I were to speculate:

1. It is inline with their vision/mission of the organization, advancing science. 2. Differentiate themselves from OpenAI, which despite the name, is not really big on open source.




To me the most interesting part of the article is the cometary on where basic research is going to happen in the future. The fear is that if it only happens in large companies, then the unbiased pool of experts society relies on will be smaller and less informed. Along with the issue of nobody being around for the slog of defining a field, setting up databases, competitions and standards. These are what allow well funded corporate labs to apply their skills and compute and blow a problem out of the water. The problem is, would they do the work to define an unknown problem in the first place?

> unbiased pool of experts

Sounds like an oxymoron these days.


Always has been

> DeepMind claimed that they used “128 TPUv3 cores or roughly equivalent to ~100-200 GPUs”. Although this amount of compute seems beyond the wildest dreams of most academic researchers...

So, we're talking like what? Maybe $100K to $300K of hardware? Wet biology labs often have multiple pieces of $100K+ equipment at their disposal. Why shouldn't computational labs too?


Yeah but it's also putting it together and properly utlizing it, which takes specialist knowledge.

This was never really a bottleneck for science. One needs to realize first that something can be done, then it will be done.

The cost of computing will also go down in the future and for government funds this sounds like a drop in the sea when they are building multi billion dollar particle accelerators.


Imagine building a multi billion dollar computing cluster solely for research.

I guess the main reason it hasn't been done is that the deprecation is still huge due to chip advancements.



You can rent a 32 TPUv3 pod from Google Cloud at $32 per hour. So 128 pod would be roughly 150 per hour. $1K gives you 8 hours of training time.

https://cloud.google.com/tpu/pricing#pod-pricing


Is that not just for the final training run though?

All the experimentation and fine tuning probably meant thousands of trials which may have been significantly bigger scale before they got the model optimised ...


What are the big implications of being good at predicting protein structures?

The goal all along has been to design proteins with a specific structure.

This can be applied to just about any area of biology. You could design novel antigens to combat disease, and then easily mass-produce them. Or just inject the RNA to have the body produce them.

But the applications are boundless, from genetically modifying crops, to anti-aging, and more.

It is also one of the key pathways to molecular nanotechnology, where instead of building arbitrary structures out of amino acids, we increase the range of arbitrary molecules we can design, build, and produce in quantity.


Is it the structure that's important? Or is the structure just a way to combine certain amino acids in a stable manner and its the combination of acids that we care about? Or is structure just a way of saying a specific permutation of amino acids?

The structure is the whole point. As I understand it, you can link together nearly arbitrary sequences of amino acids. But a random string of AAs will just result in a jumbled protein that doesn't do anything useful.

Specific structures are useful in all manner of ways, from cleaving a DNA molecule at a specific point, enzymes for breaking apart molecules, etc.

Very, very useful.


>Very, very useful.

Just to frame it a particular way, biological systems are basically solved nanotechnology, extremely good, self-sustaining, resilient little machines that have spent a long time optimizing to be better and better. But all the designs are preset, if we can crack the code and design our own little machines, then amazing things like more plastic-like cellulose could be made, all sorts of problems are suddenly far easier to solve. But also a lot of new problems emerge that weren't even imaginable before, since the code being cracked is a big chunk of the code of life itself. So, yknow, playing God and all, so there probably will be some negative consequences of this too.


Yes, I agree with all this.

Generally speaking molecular nanotechnology will solve all the "intractable" problems we as a society face today: climate change, poverty, biological death from old age / disease / cancer, and more.

We could also create tools of destruction so vast, it can be hard to contemplate.


Structure is function at this level. Many proteins simply provide binding sites for a specific molecule. Others have multiple binding sites and combine molecules together into larger ones. All of these interactions are governed by the positions of the atoms in the protein, creating a 3D "lock and key" model for specific molecules to fit.

That remains to be seen. People hope it will lead to new treatments for diseases of all kinds. Whether or not that materializes is big question mark.

If we can accurately predict protein structures (particularly multiple structures, or structures reflecting what the conformation is in cells), then we can do a couple things:

  - better predict drug binding to proteins (massive benefits if accurate)

  - better understand the functional outcomes of missense mutations on proteins

  - study protein-protein interactions

  - and in general, just gain a better understanding of biology (which is driven by proteins and their reactions/interactions)

More ominously, this makes it easier for the gain-of-function researchers to more accurately engineer their viruses to bind to human receptors.

Or vaccine researchers to more accurately engineer antibodies/drugs to bind to viruses or cancer cells.


So, unsurprisngly, it appears that applying a transformer to multiple sequence alignments extracts somewhat more spatial information about proteins than we had been able to previously squeeze out.

It's pretty clear at this point that the work led to a large improvement in psp scores, but there's literally nothing else groundbreaking about it; I don't mean that in a bad way, except to criticize all the breathless press about applications and pharma.


Well it did gave geoundbreking results, it is weird to see people dismisses it as "Not groundbrraking enough".

it was a nice improvement. that's fine. But it's ultimately just statistical modelling based on deep evolutionary information. It only works on homology modelling, it doesn't actually solve the larger protein structure prediction problem. Therefore it's not groundbreaking but a significant improvement.

It's perfectly reasonable to describe a very large improvement as groundbreaking.

I respectfully disagree. AlphaFold 2 demonstrated almost perfect performance for a multitude of proteins for which no meaningful templates were available -- hence, it was not doing homology modelling as it is generally understood, but ab initio protein structure prediction.

What I would support is that AlphaFold 2 does not solve the protein folding problem: how, as opposed to what to, a protein folds.


how could they do ab initio? They depend on multiple sequence alignments.

If I'm mistaken about this then I'll happily take back what I said, but there's no way that AF2 could work wihtout MSAs, therefore, it is not ab initio.

Ah, OK checked the paper again. They're working on the "template" category which means there is structure-sequence information... maybe CASP organizers consider this ab initio ? The paper never mentions anything about ab initio predicitons. Is that what you're saying, that template methods are ab initio?


Just in case there is a confusion: there is a difference between available sequences (~300 million in standard protein sequence repositories) and structures (~170k structures in the PDB, perhaps about ~120k that are structurally non-redundant). A large amount of CASP14 targets have no available templates; in fact, many of them represented previously unseen topologies. However, all of them had some (in most cases, many) available sequences.

The commonly accepted definition of homology modelling implies using a known structure ("template") as a scaffold to model the protein's topology. Since there are many CASP14 targets without appropriate templates, AlphaFold 2 simply cannot "just do homology modelling".

I do take the point that the correct term is "free modelling" (it does not have, or does not use, any good structure as a template), and not "ab initio modelling" (it uses physics to fold the protein), though. A deep enough MSA is generally a requirement.


Again, it's entirely possible I missed some very subtle point in AF2's system, but my understanding is that each target AF2 predicted had an underlying structural template covering the majority of the domain and the mapping was established through the MSA.

IE, any MSAs would always include alignments to known protein structures. Are you saying their MSAs don't include alignments to known protein structures?

(the reason I'm asking all this is because if I'm mistaken, then AF2 did do something "interesting", but everything in the paper says that everything they did is template based. If they are just folding proteins using MSAs without alignments to protein structures, that's far more interesting. I don't think they did that.

edit: I've now reread the paper again, and I believe their claim of making predictions where there is no structural homology is incorrect from a technical perspective. I've communicated this to both the CASP organizers (whom I know) and DeepMind.


Yes: they predict structures using MSAs, without alignments to known protein structures in a majority of the cases.

OK if that's truly accurate, then they did make a significant accomplishment. However, I'm 99% certain (from reading the paper) that they actually do have alignments to structures, but the similarity is very low.

It would help if you coould point to one of the alignmennts they made that has no underlying structure (even a template fragment) support.

I reread the methods section, https://static-content.springer.com/esm/art%3A10.1038%2Fs415...

They train jointly on the results of genetic search and template search (template search). Can you show an example of a prediction made using only genetic search and not template search. Those templates are fastas made from PDB files, which, while not homology modelling, is definitely not "ab initio".


I've been in communication with several different teams and leaders at CASP and I've confirmed that this does appear to be the case.

I'm going to be a bit skeptical but if that's the case, then it really is a significant improvement. Glad to see that with just the idea, the academic community was able to reach near parity in a short time, demonstrating there was nothing unique to DM except their huge amount of compute, storage, and talent, and this would have happened in the next CASP anyway.


> it was not doing homology modelling as it is generally understood, but ab initio protein structure prediction.

Maybe according to the current definition of the term, which has drifted over the years. Homology modeling and "ab initio" structure prediction have been drifting toward each other for a long time. These days, the categories are separated by (an essentially arbitrary) sequence identity threshold. If you have a protein sequence with high homology to some other protein with a structure, then you're homology modeling. If you have no matches at all, you're doing "ab initio". In the middle, you have a gray area where you can mix the approaches and call it whatever you like.

This is not a pedantic point. If your method requires homology -- however distant and fragmented -- in order to work, then you're always limited to the knowledge in the database. Maybe we've sampled enough of protein space to get the major folds, but certainly, the databases don't have enough information to get the small details right.

I have never been a huge believer in the idea that we can go directly from protein sequence to protein structure simply using a mathematical model of physics, but that is the original meaning of "ab initio structure prediction", and if you could do it, it would be far more valuable than alphafold. At risk of making a trivially nerd-snipable metaphor, it's kind of like the difference between google translate and a theoretical model of human intelligence that understands concepts and can generate language. The latter is obviously immensely more capable than the former.


If CASP is calling methods that use any sequence similarity (the grey area) 'ab initio', that's disingenuous and intellectually dishonest.

ab initio means from nothing, and at most, you're allowed to have physically inspired force fields, not sequence similarity to known structures. I put a lot of effort into improving the state of the art in that area, but ultimately concluded it made more sense to concentrate experimental structural determination in the area that was most useful- in proteins that had unknown folds or no known homology (see https://scholar.google.com/citations?view_op=view_citation&h... for some previous work I did in this area).


> If CASP is calling methods that use any sequence similarity (the grey area) 'ab initio', that's disingenuous and intellectually dishonest.

The category is given the name, not the methods. People can use any method they like to solve the structures. The organizers are not zealots.

The ab initio portion of CASP consists of proteins that the organizers know have low sequence identity to anything in the existing databases. They represent proteins that are "difficult" to solve using what any practitioner might call homology modeling. That doesn't mean that you can't use a method that takes into account the biological databases -- and essentially all of the good methods do!

For example, the Rosetta method has competed in both the homology modeling and the ab initio categories for many years. They mix a bit of both -- using homology models to get the fold, and fragment insertion to model the floppy bits.

I haven't paid close attention to CASP in a long time, but I assume the competitor list still has tons of entries from people who cling tightly to the purist vision of ab initio modeling. They don't tend to do very well.


OK, be aware the person you're correcting has: competed in CASP (on a competitor team with Sali), and published papers with Baker on Rosetta methods (my paper is cited in the most RoseTTA paper).

"They mix a bit of both -- using homology models to get the fold, and fragment insertion to model the floppy bits."

that's the best description of what I believe AF2 is doing, but that AF2 is being marketed as not depending on any sequence similarity.

If the CASP folks really are saying "if you have 20% sequence identity and use the structure from that alignment it's ab initio"... that's really just totally misleading.

Of course, even ab initio methods are parameterized on biological information; for example, I used AMBER to do MD simulations and many of the force field terms were determined using spectroscopic data from fragments of biological models. That, however is ab initio, because nothing even as large as a single amino acid is parameterized.

I'm not saying there's anything wrong with homology modelling, or that the purist vision of ab initio is right. For practical purposes, exploiting subtle structural information through sequence alignment is a very nice way to save enormous amounts of computer time.


> OK, be aware the person you're correcting has: competed in CASP (on a competitor team with Sali), and published papers with Baker on Rosetta methods (my paper is cited in the most RoseTTA paper).

OK, great. Me too. I'm not saying anything controversial here. Right from the top of the "ab initio" tab on predictioncenter.org:

"Modeling proteins with no or marginal similarity to existing structures (ab initio, new fold, non-template or free modeling) is the most challenging task in tertiary structure prediction."


I think the more important question to resolve here is: did AlphaFold change anything with respect to structure prediction that enabled them to make accurate predictions in the complete absence of sequence similarity to proteins with known structure?

My understanding is no, they did the equivalent of template modelling, which uses sequence/structure relationships (that are more subtle than the ones you get from homology modelling).

I'm less interested in reconciling my internal mental model of psp wiht CASPs, than I am in understanding if AF2 is somehow able to get all the necessarily structural constraints through coevolution of amino acid pairs entirely without some (direct or indirect) learned relationship between the sequence similarity to known structures (be it even short fragments like helices).

If they really did do that, and nobody did it before, that's great and I will happily promote the DM work, as it supports what I said when I did CASP: ML and MD will eventually win, although in a way that exploits the rich sequence evolutionary information we have, rather than predominantly by having an accurate force field and good smapling methods.


It seems amazing to me what the transformer can learn to SOTA levels, not just language but also images, video, code, math and proteins. Replacing so much handmade neural architecture with just one thing that does it all, that was an amazing step forward.

Just for my reference, what percentage of known but unfolded proteins (a wild guess is good enough), would you consider to be ab initio? How many don't have parts in any database?

Here we go again with the hyperbole...very tiring.

How does work like this get funded? It's awesome, but it seems so far removed from... let's say "profit". And there are several teams competing in these things. Are there places that really fund advanced work like this, or is it mostly graduate student underpaid labor?

Government funding (eg. DARPA) or from large corporations that have skunkworks teams (Google, IBM, Microsoft, etc.)

Most of the US researchers who do CASP are funded by NIH or NSF. Some are funded by private foundations, or are independently wealthy. Typically, as a "principal investigator" (postdoc, professor, scientist at a national lab) you write a proposal saying "here's my preivous work, here's the next obvious step, plz give monies so I can feed the dean's fund and pay for my grad students to manage my modest closet cluster".

A group of your competitors then trashes your proposal in a group and if you've properly massaged the right backs, you get a pittance, which permits you to struggle to keep up with all your promises.


Yikes that sounds bleak.

Yet, there were still 136 human teams who competed in CASP14 (https://predictioncenter.org/casp14/docs.cgi?view=groupsbyna...), including DeepMind. Even if a significant fraction of these projects were done piggy-backing another grant, this work does receive research funding.

Be fair. Many of those rows contain duplicate names (identical teams), so the count is much smaller.

It's just corporations burn money to show off.

https://venturebeat.com/2020/12/27/deepminds-big-losses-and-...


Cool. How many years closer has this brought us to a $1 pill that extends life span by 1 year? Cos let's face it that's the only prize that really matters in this whole field

So... could anyone with experience in the area give an estimate of how much the likelihood of an unstoppable, untraceable "DIY" bioweapon appearing in the next decade has increased thanks to this?

So I don't have experience in the area, but I'd give it about 0%.

To paraphrase Derek Lowe a lot (see, e.g., https://blogs.sciencemag.org/pipeline/archives/2021/03/19/ai...), there are several hard problems in biology, and the kind of progress embodied in AlphaFold isn't progress towards the rate-limiting problems. And many of the things that make drugs hard to develop are going to carry over into making bioweapons hard to develop.


I'm not an expert, but in a recent article about the new mRNA synthesis techniques they were asked the same question. The answer was there's already lots of potential bioweapons and many simpler techniques for producing them, so these new technologies don't change the danger level much.

You can already make a pretty terrifying bioweapon with GoF research / CRISPR / etc. Better ability to design your own proteins doesn't move the needle much, and is still much harder than the other methods.

I believe it's already straightforward for a decent bio lab at a large University to synthesize dangerous viruses. So it would only be a matter of time until they could selectively add or subtract from the genetic code to create exceptionally dangerous mutations. Don't need any protein folding to do that.

https://www.theguardian.com/science/2014/jun/11/crazy-danger...


I wonder if this is the result of us having significantly better understanding of our biology or the major advancements in AI and computer performance. Or both?

I'm curious how this will effect the Folding@Home project; will it open up possibilities or will this type of approach be ill-suited to long-distance distributed volunteer effort because of the memory footprint (and bandwidth?) needed?

Legal | privacy