Hacker Read

Hacker Read top | best | new | newcomments | leaders | about | bookmarklet

login

		AlphaFold 2 is here: what’s behind the structure prediction miracle (www.blopig.com) similar stories update story
		241 points by couteiral \| karma 125 \| avg karma 25.0 2021-07-20 04:15:07 \| hide \| past \| favorite \| 98 comments

view as:

londons_explore | karma 35497 | avg karma 2.72 2021-07-20 05:53:15 | [–] similar comments

> Like most bioinformatics programs, AlphaFold 2 comes equipped with a “preprocessing pipeline”, which is the discipline’s lingo for “a Bash script that calls some other codes”.

Having Bioinformatics people requiring to stray a long way from their core competency to learn a scripting language from the 80's to write glue code seems... suboptimal. How many hours of expert time has been wasted figuring out how to split a string in bash?

Can us software people build a better tool to eliminate the need for this?

johnnycerberus | karma 469 | avg karma 1.85 2021-07-20 05:56:09 | [–] similar comments

I think the tools already exist just that people are conservative with their choices.

ethbr0 | karma 17741 | avg karma 3.56 2021-07-20 06:02:03 | [–] similar comments

(Obligatory xkcd; you know the one)

FartyMcFarter | karma 5189 | avg karma 3.75 2021-07-20 06:04:38 | [–] similar comments

Not really?

londons_explore | karma 35497 | avg karma 2.72 2021-07-20 06:13:31 | [–] similar comments

https://xkcd.com/927/

pbronez | karma 1660 | avg karma 2.0 2021-07-20 06:25:33 | [–] similar comments

MLops is a pretty hot area right now. The industry is trying to figure out how to engineer these things in a robust way. It’s not ubiquitous at all yet. There are lots of tools that wrap K8s and help you train models, but trying in DataOps in a robust way…. I haven’t seen the definitive answer yet.

Seriously please tell me if you are founding this company so I can invest.

cinntaile | karma 4441 | avg karma 1.74 2021-07-20 06:23:21 | [–] similar comments

Pretty much anyone that works with data has to clean their data, learning to use the tools to do that is important. Whether that's bash, perl, R, python, ... doesn't really matter that much. If they already know bash then bash is a good tool since they can now focus on their data instead of wasting time on learning a new tool to do the same thing.

culopatin | karma 1736 | avg karma 1.5 2021-07-20 06:25:25 | [–] similar comments

It’s sad. Bio people really need friends in software to boost their research. Unfortunately they do a codeacademy python course for 5 min and try to get their projects going. Sometimes they succeed, sometimes they fail. But they don’t really have much time to dedicate to properly learn software dev and it’s not what they are into anyway, it’s a necessity.

I think we could create something like a GitHub of bio projects that need help and people assist with hopes of getting their names in a paper

ackbar03 | karma 1818 | avg karma 2.1 2021-07-20 06:26:47 | [–] similar comments

I think DeepMind went one step further and solved the entire problem for them. They don't even have to touch their keyboards anymore

stingraycharles | karma 11619 | avg karma 4.34 2021-07-20 06:29:12 | [–] similar comments

Sure there are better alternatives, but the advantage of bash / shell scripting is that it’s very easy to glue a whole collection of tools together, and that expertise in this transfers well between domains.

They probably could have achieved the same by invoking things in Python, but it would have been slower and not achieved a lot, other than “not using shell scripts”.

And once you go down the path of optimizing this enough, you’ll end up reinventing shell scripts altogether.

sanxiyn | karma 14687 | avg karma 3.61 2021-07-20 06:42:49 | [–] similar comments

Well, AlphaFold 2 generates MSA by invoking things in Python: https://github.com/deepmind/alphafold/blob/main/alphafold/da.... So the article is actually mistaken on this point.

knuthsat | karma 774 | avg karma 2.01 2021-07-20 06:40:36 | [–] similar comments

From looking at the code, Bash looks pretty clean.

I also use Bash and AWK for preprocessing a lot.

laichzeit0 | karma 2359 | avg karma 2.36 2021-07-20 06:56:14 | [–] similar comments

I used to be that guy as well, till a college convinced me that anything I can do in Bash, Awk I could probably do easier in Perl. Then everyone sort of drifted to Python. I get that if you never used Perl it’s pointless to learn it if you’re already in the Python stack, but… damn Perl’s regular expressions and how it’s so baked into the syntax of the language makes using regex in Python seem like going back to the Stone Age.

volta83 | karma 1480 | avg karma 3.42 2021-07-20 06:41:42 | [–] similar comments

Bash is easier to explain and use than, eg, explaining ppl how to use Python subprocess Module Launch different apps, capture their output, etc.

I find astonishing how bad Python is as a bash replacement.

I often rather write an argument parser in bash than use Python if I have to invoke a bunch of commands.

hortense | karma 297 | avg karma 4.87 2021-07-20 07:08:08 | [–] similar comments

Python is bad, but bash is worse as soon as you need any kind of logic.

> explaining ppl how to use Python subprocess Module Launch different apps, capture their output

There's no shame in using `os.system`.

maliker | karma 1676 | avg karma 4.67 2021-07-20 08:28:51 | [–] similar comments

Well, technically subprocess.check_call if you need to capture output.

couteiral | karma 125 | avg karma 25.0 2021-07-20 07:10:02 | [–] similar comments

There is a growing trend to include Docker (or Singularity, which is more compatible with HPC architectures common in bioinformatics) images alongside codes. In particular, AlphaFold 2 does provide a Dockerfile, and they even include a Python "launcher script" hiding all the details of running the code.

Sadly, this is very uncommon in the community. In a bioinformatics meeting, the sentence "I spent X days setting up Y software" will not raise many eyebrows

maliker | karma 1676 | avg karma 4.67 2021-07-20 08:26:59 | [–] similar comments

I work in power systems, and the situation is similar. Maybe worse because paper authors often come up with new computational techniques but don’t implement them in code (much less code with a dockerfile).

I look with much jealousy over at the computer science field where papers often include code, multiple versions under version control, automated tests, setup/docker scripts, and demonstration workflows and interfaces.

dfas231 | karma 2 | avg karma 2.0 2021-07-20 07:10:38 | [–] similar comments

* Can us software people build a better tool to eliminate the need for this?

Most probably, not. Bash is currently the sweet spot. It is actually the best tool for this job. Any other option comes with increased complexity, will make the whole software less stable.

cabalamat | karma 6928 | avg karma 2.63 2021-07-20 08:11:31 | [–] similar comments

> How many hours of expert time has been wasted figuring out how to split a string in bash?

Probably a good many more than would be needed to learn how to split a string in Python.

6gvONxR4sf7o | karma 9787 | avg karma 3.43 2021-07-20 10:43:09 | [–] similar comments

Among the bioinformatics folks I know, bash is already a core competency. If you’re using your average biologist as your mental model, you’re thinking of the wrong people.

jweather | karma 167 | avg karma 1.29 2021-07-21 08:18:14 | [–] similar comments

I don't see this as a bad thing. Shell scripts are a great way to prototype text processing pipelines using multiple smaller components.

Quarrel | karma 780 | avg karma 3.82 2021-07-20 06:39:29 | [–] similar comments

Awesome.

I wrote a thesis on protein structure prediction in 1995. We weren't very good at it then. Amazing to see this.

VSerge | karma 679 | avg karma 4.72 2021-07-20 09:46:30 | [–] similar comments

I remember that the scientific game "Fold It" was also quite exciting when it came out 12-13 years ago or so, since populations of players could get results beyond what either specialists or computer systems could achieve. I guess one could argue that an AI such as this could be compared to a large automated population of trained players trying to solve a 3D puzzle.

kevstev | karma 1961 | avg karma 4.02 2021-07-20 09:54:14 | [–] similar comments

I did some undergraduate research on this around 1999. At the time we were trying to prove that we could throw more firepower at the problem by building a beowulf cluster to solve the problem. After a bit of tweaking, we were able to get more performance than a single machine, but soon seti@home was released and to me at least the writing was on the wall that we were not taking the most optimal approach.

In hindsight though, we were so far off from both an algorithmic perspective and a hardware perspective to actually achieving meaningful results. I am glad, 20 years later it seems real progress is being made. I haven't really followed the folding@home project in many many years, but its not clear to me much came out of it that was all that useful, at least not in practical terms.

ac29 | karma 6332 | avg karma 2.16 2021-07-20 14:49:23 | [–] similar comments

Not sure about folding@home, but the lab that runs rosetta@home released a paper earlier this month claiming they have a new algorithm with comparable results to AlphaFold2: https://science.sciencemag.org/content/early/2021/07/19/scie...

I don't believe this new approach runs on their distributed compute network, but its cool to see some good competition.

bamboozled | karma 7012 | avg karma 1.83 2021-07-20 07:02:26 | [–] similar comments

Thank you Google...thank you!

maliker | karma 1676 | avg karma 4.67 2021-07-20 08:22:13 | [–] similar comments

Why did they open source it? Wouldn’t this model be very valuable to the pharma industry?

gostsamo | karma 4487 | avg karma 3.15 2021-07-20 08:35:51 | [–] similar comments

It is written in the article. A few reasons could be pressure from the publishing magazine, emerging open source implementations of the same idea, and the fact that this is still far from easy to comercialize.

londons_explore | karma 35497 | avg karma 2.72 2021-07-20 09:18:26 | [–] similar comments

Don't forget internal pressure.

A lot of people will say "Unless you opensource my work and that of my colleagues, I quit".

When faced with all your best people threatening to quit, you might just opensource that work. It turns out you still have an advantage by being ~1 year ahead on applying it to anything, and having all the people who know how it works on your staff.

credit_guy | karma 7328 | avg karma 2.53 2021-07-20 09:21:17 | [–] similar comments

Another reason could be that whoever wants to run it will very likely run it in the cloud, and there's a chance they'd run it in the Google Cloud. A machine similar to the one they mention on the alphafold github page (12 vCPU, 1 GPU, 85 GB) costs you between $1 and $4/hour.

visarga | karma 12425 | avg karma 1.65 2021-07-20 12:46:50 | [–] similar comments

> Why did they open source it? Wouldn’t this model be very valuable to the pharma industry?

This is a question we should remember when we feel like condemning big corporations for monopolizing AI. HuggingFace lists 12,257 models in its zoo, many coming from FAANG. You can start one in 3 lines of Python, or fine-tune it with a little more effort.

alphabetting | karma 5065 | avg karma 9.23 2021-07-20 13:04:37 | [–] similar comments

As an outsider it seems Google has a much more academia friendly culture than other megacap tech companies. Guessing the talent which this culture draws are likely more adamant about their work being open sourced.

summerlight | karma 3205 | avg karma 3.48 2021-07-20 14:17:05 | [–] similar comments

Because the core competency is not the model or code, but the people and organization that enable this project (and perhaps computing infrastructure as well?). The pharma industry will try to catch up of course, but they will also likely try to establish collaboration with DeepMind. This could be a good first step for Google into the medical/pharma business.

swayson | karma 145 | avg karma 1.73 2021-07-20 15:57:59 | [–] similar comments

If I were to speculate:

1. It is inline with their vision/mission of the organization, advancing science. 2. Differentiate themselves from OpenAI, which despite the name, is not really big on open source.

swazzy | karma 488 | avg karma 14.35 2021-07-20 08:07:28 | [–] similar comments

https://github.com/deepmind/alphafold

kevincox | karma 12943 | avg karma 3.08 2021-07-20 08:16:06 | [–] similar comments

Apache 2.0

https://github.com/deepmind/alphafold/blob/main/LICENSE

Game_Ender | karma 1502 | avg karma 3.41 2021-07-20 08:38:02 | [–] similar comments

To me the most interesting part of the article is the cometary on where basic research is going to happen in the future. The fear is that if it only happens in large companies, then the unbiased pool of experts society relies on will be smaller and less informed. Along with the issue of nobody being around for the slog of defining a field, setting up databases, competitions and standards. These are what allow well funded corporate labs to apply their skills and compute and blow a problem out of the water. The problem is, would they do the work to define an unknown problem in the first place?

mrfusion | karma 16153 | avg karma 2.36 2021-07-20 12:31:56 | [–] similar comments

> unbiased pool of experts

Sounds like an oxymoron these days.

suetoniusp | karma 32 | avg karma 1.88 2021-07-20 12:34:12 | [–] similar comments

Always has been

jonas21 | karma 12247 | avg karma 7.68 2021-07-20 14:37:53 | [–] similar comments

> DeepMind claimed that they used “128 TPUv3 cores or roughly equivalent to ~100-200 GPUs”. Although this amount of compute seems beyond the wildest dreams of most academic researchers...

So, we're talking like what? Maybe $100K to $300K of hardware? Wet biology labs often have multiple pieces of $100K+ equipment at their disposal. Why shouldn't computational labs too?

Synaesthesia | karma 6311 | avg karma 2.37 2021-07-20 14:42:13 | [–] similar comments

Yeah but it's also putting it together and properly utlizing it, which takes specialist knowledge.

tiborsaas | karma 7097 | avg karma 4.75 2021-07-20 16:42:05 | [–] similar comments

This was never really a bottleneck for science. One needs to realize first that something can be done, then it will be done.

The cost of computing will also go down in the future and for government funds this sounds like a drop in the sea when they are building multi billion dollar particle accelerators.

marvin | karma 9832 | avg karma 3.38 2021-07-20 18:41:00 | [–] similar comments

Imagine building a multi billion dollar computing cluster solely for research.

I guess the main reason it hasn't been done is that the deprecation is still huge due to chip advancements.

ma2rten | karma 7335 | avg karma 3.56 2021-07-20 18:55:41 | [–] similar comments

There is something like this: https://bigscience.huggingface.co/en/#!index.md

extropy | karma 753 | avg karma 2.85 2021-07-20 16:50:33 | [–] similar comments

You can rent a 32 TPUv3 pod from Google Cloud at $32 per hour. So 128 pod would be roughly 150 per hour. $1K gives you 8 hours of training time.

https://cloud.google.com/tpu/pricing#pod-pricing

zmmmmm | karma 16633 | avg karma 4.87 2021-07-20 19:37:17 | [–] similar comments

Is that not just for the final training run though?

All the experimentation and fine tuning probably meant thousands of trials which may have been significantly bigger scale before they got the model optimised ...

spywaregorilla | karma 6714 | avg karma 2.33 2021-07-20 08:39:48 | [–] similar comments

What are the big implications of being good at predicting protein structures?

ansible | karma 5563 | avg karma 2.43 2021-07-20 08:57:29 | [–] similar comments

The goal all along has been to design proteins with a specific structure.

This can be applied to just about any area of biology. You could design novel antigens to combat disease, and then easily mass-produce them. Or just inject the RNA to have the body produce them.

But the applications are boundless, from genetically modifying crops, to anti-aging, and more.

It is also one of the key pathways to molecular nanotechnology, where instead of building arbitrary structures out of amino acids, we increase the range of arbitrary molecules we can design, build, and produce in quantity.

spywaregorilla | karma 6714 | avg karma 2.33 2021-07-20 11:15:05 | [–] similar comments

Is it the structure that's important? Or is the structure just a way to combine certain amino acids in a stable manner and its the combination of acids that we care about? Or is structure just a way of saying a specific permutation of amino acids?

ansible | karma 5563 | avg karma 2.43 2021-07-20 11:36:27 | [–] similar comments

The structure is the whole point. As I understand it, you can link together nearly arbitrary sequences of amino acids. But a random string of AAs will just result in a jumbled protein that doesn't do anything useful.

Specific structures are useful in all manner of ways, from cleaving a DNA molecule at a specific point, enzymes for breaking apart molecules, etc.

Very, very useful.

vokep | karma 621 | avg karma 1.36 2021-07-20 15:51:12 | [–] similar comments

>Very, very useful.

Just to frame it a particular way, biological systems are basically solved nanotechnology, extremely good, self-sustaining, resilient little machines that have spent a long time optimizing to be better and better. But all the designs are preset, if we can crack the code and design our own little machines, then amazing things like more plastic-like cellulose could be made, all sorts of problems are suddenly far easier to solve. But also a lot of new problems emerge that weren't even imaginable before, since the code being cracked is a big chunk of the code of life itself. So, yknow, playing God and all, so there probably will be some negative consequences of this too.

ansible | karma 5563 | avg karma 2.43 2021-07-20 17:30:15 | [–] similar comments

Yes, I agree with all this.

Generally speaking molecular nanotechnology will solve all the "intractable" problems we as a society face today: climate change, poverty, biological death from old age / disease / cancer, and more.

We could also create tools of destruction so vast, it can be hard to contemplate.

jweather | karma 167 | avg karma 1.29 2021-07-21 08:13:38 | [–] similar comments

Structure is function at this level. Many proteins simply provide binding sites for a specific molecule. Others have multiple binding sites and combine molecules together into larger ones. All of these interactions are governed by the positions of the atoms in the protein, creating a 3D "lock and key" model for specific molecules to fit.

TaupeRanger | karma 1856 | avg karma 2.87 2021-07-20 09:02:37 | [–] similar comments

That remains to be seen. People hope it will lead to new treatments for diseases of all kinds. Whether or not that materializes is big question mark.

shpongled | karma 916 | avg karma 2.99 2021-07-20 17:56:54 | [–] similar comments

If we can accurately predict protein structures (particularly multiple structures, or structures reflecting what the conformation is in cells), then we can do a couple things:

  - better predict drug binding to proteins (massive benefits if accurate)

  - better understand the functional outcomes of missense mutations on proteins

  - study protein-protein interactions

  - and in general, just gain a better understanding of biology (which is driven by proteins and their reactions/interactions)

narrator | karma 13027 | avg karma 3.81 2021-07-20 18:41:16 | [–] similar comments

More ominously, this makes it easier for the gain-of-function researchers to more accurately engineer their viruses to bind to human receptors.

yetihehe | karma 3015 | avg karma 2.2 2021-07-21 02:53:29 | [–] similar comments

Or vaccine researchers to more accurately engineer antibodies/drugs to bind to viruses or cancer cells.

woliveirajr | karma 4053 | avg karma 2.93 2021-07-20 08:52:49 | [–] similar comments

Article at Nature: https://www.nature.com/articles/s41586-021-03819-2

dekhn | karma 28741 | avg karma 2.63 2021-07-20 08:56:19 | [–] similar comments

So, unsurprisngly, it appears that applying a transformer to multiple sequence alignments extracts somewhat more spatial information about proteins than we had been able to previously squeeze out.

It's pretty clear at this point that the work led to a large improvement in psp scores, but there's literally nothing else groundbreaking about it; I don't mean that in a bad way, except to criticize all the breathless press about applications and pharma.

mda | karma 1912 | avg karma 2.04 2021-07-20 09:52:56 | [–] similar comments

Well it did gave geoundbreking results, it is weird to see people dismisses it as "Not groundbrraking enough".

dekhn | karma 28741 | avg karma 2.63 2021-07-20 10:15:16 | [–] similar comments

it was a nice improvement. that's fine. But it's ultimately just statistical modelling based on deep evolutionary information. It only works on homology modelling, it doesn't actually solve the larger protein structure prediction problem. Therefore it's not groundbreaking but a significant improvement.

IshKebab | karma 13023 | avg karma 1.29 2021-07-20 10:32:59 | [–] similar comments

It's perfectly reasonable to describe a very large improvement as groundbreaking.

couteiral | karma 125 | avg karma 25.0 2021-07-20 10:39:58 | [–] similar comments

I respectfully disagree. AlphaFold 2 demonstrated almost perfect performance for a multitude of proteins for which no meaningful templates were available -- hence, it was not doing homology modelling as it is generally understood, but ab initio protein structure prediction.

What I would support is that AlphaFold 2 does not solve the protein folding problem: how, as opposed to what to, a protein folds.

dekhn | karma 28741 | avg karma 2.63 2021-07-20 11:39:45 | [–] similar comments

how could they do ab initio? They depend on multiple sequence alignments.

If I'm mistaken about this then I'll happily take back what I said, but there's no way that AF2 could work wihtout MSAs, therefore, it is not ab initio.

Ah, OK checked the paper again. They're working on the "template" category which means there is structure-sequence information... maybe CASP organizers consider this ab initio ? The paper never mentions anything about ab initio predicitons. Is that what you're saying, that template methods are ab initio?

couteiral | karma 125 | avg karma 25.0 2021-07-20 12:41:45 | [–] similar comments

Just in case there is a confusion: there is a difference between available sequences (~300 million in standard protein sequence repositories) and structures (~170k structures in the PDB, perhaps about ~120k that are structurally non-redundant). A large amount of CASP14 targets have no available templates; in fact, many of them represented previously unseen topologies. However, all of them had some (in most cases, many) available sequences.

The commonly accepted definition of homology modelling implies using a known structure ("template") as a scaffold to model the protein's topology. Since there are many CASP14 targets without appropriate templates, AlphaFold 2 simply cannot "just do homology modelling".

I do take the point that the correct term is "free modelling" (it does not have, or does not use, any good structure as a template), and not "ab initio modelling" (it uses physics to fold the protein), though. A deep enough MSA is generally a requirement.

dekhn | karma 28741 | avg karma 2.63 2021-07-20 12:56:37 | [–] similar comments

Again, it's entirely possible I missed some very subtle point in AF2's system, but my understanding is that each target AF2 predicted had an underlying structural template covering the majority of the domain and the mapping was established through the MSA.

IE, any MSAs would always include alignments to known protein structures. Are you saying their MSAs don't include alignments to known protein structures?

(the reason I'm asking all this is because if I'm mistaken, then AF2 did do something "interesting", but everything in the paper says that everything they did is template based. If they are just folding proteins using MSAs without alignments to protein structures, that's far more interesting. I don't think they did that.

edit: I've now reread the paper again, and I believe their claim of making predictions where there is no structural homology is incorrect from a technical perspective. I've communicated this to both the CASP organizers (whom I know) and DeepMind.

couteiral | karma 125 | avg karma 25.0 2021-07-20 13:14:11 | [–] similar comments

Yes: they predict structures using MSAs, without alignments to known protein structures in a majority of the cases.

dekhn | karma 28741 | avg karma 2.63 2021-07-20 13:22:33 | [–] similar comments

OK if that's truly accurate, then they did make a significant accomplishment. However, I'm 99% certain (from reading the paper) that they actually do have alignments to structures, but the similarity is very low.

It would help if you coould point to one of the alignmennts they made that has no underlying structure (even a template fragment) support.

I reread the methods section, https://static-content.springer.com/esm/art%3A10.1038%2Fs415...

They train jointly on the results of genetic search and template search (template search). Can you show an example of a prediction made using only genetic search and not template search. Those templates are fastas made from PDB files, which, while not homology modelling, is definitely not "ab initio".

dekhn | karma 28741 | avg karma 2.63 2021-07-22 10:47:01 | [–] similar comments

I've been in communication with several different teams and leaders at CASP and I've confirmed that this does appear to be the case.

I'm going to be a bit skeptical but if that's the case, then it really is a significant improvement. Glad to see that with just the idea, the academic community was able to reach near parity in a short time, demonstrating there was nothing unique to DM except their huge amount of compute, storage, and talent, and this would have happened in the next CASP anyway.

timr | karma 29007 | avg karma 5.19 2021-07-20 12:31:08 | [–] similar comments

> it was not doing homology modelling as it is generally understood, but ab initio protein structure prediction.

Maybe according to the current definition of the term, which has drifted over the years. Homology modeling and "ab initio" structure prediction have been drifting toward each other for a long time. These days, the categories are separated by (an essentially arbitrary) sequence identity threshold. If you have a protein sequence with high homology to some other protein with a structure, then you're homology modeling. If you have no matches at all, you're doing "ab initio". In the middle, you have a gray area where you can mix the approaches and call it whatever you like.

This is not a pedantic point. If your method requires homology -- however distant and fragmented -- in order to work, then you're always limited to the knowledge in the database. Maybe we've sampled enough of protein space to get the major folds, but certainly, the databases don't have enough information to get the small details right.

I have never been a huge believer in the idea that we can go directly from protein sequence to protein structure simply using a mathematical model of physics, but that is the original meaning of "ab initio structure prediction", and if you could do it, it would be far more valuable than alphafold. At risk of making a trivially nerd-snipable metaphor, it's kind of like the difference between google translate and a theoretical model of human intelligence that understands concepts and can generate language. The latter is obviously immensely more capable than the former.

dekhn | karma 28741 | avg karma 2.63 2021-07-20 12:47:52 | [–] similar comments

If CASP is calling methods that use any sequence similarity (the grey area) 'ab initio', that's disingenuous and intellectually dishonest.

ab initio means from nothing, and at most, you're allowed to have physically inspired force fields, not sequence similarity to known structures. I put a lot of effort into improving the state of the art in that area, but ultimately concluded it made more sense to concentrate experimental structural determination in the area that was most useful- in proteins that had unknown folds or no known homology (see https://scholar.google.com/citations?view_op=view_citation&h... for some previous work I did in this area).

timr | karma 29007 | avg karma 5.19 2021-07-20 12:57:50 | [–] similar comments

> If CASP is calling methods that use any sequence similarity (the grey area) 'ab initio', that's disingenuous and intellectually dishonest.

The category is given the name, not the methods. People can use any method they like to solve the structures. The organizers are not zealots.

The ab initio portion of CASP consists of proteins that the organizers know have low sequence identity to anything in the existing databases. They represent proteins that are "difficult" to solve using what any practitioner might call homology modeling. That doesn't mean that you can't use a method that takes into account the biological databases -- and essentially all of the good methods do!

For example, the Rosetta method has competed in both the homology modeling and the ab initio categories for many years. They mix a bit of both -- using homology models to get the fold, and fragment insertion to model the floppy bits.

I haven't paid close attention to CASP in a long time, but I assume the competitor list still has tons of entries from people who cling tightly to the purist vision of ab initio modeling. They don't tend to do very well.

dekhn | karma 28741 | avg karma 2.63 2021-07-20 13:19:31 | [–] similar comments

OK, be aware the person you're correcting has: competed in CASP (on a competitor team with Sali), and published papers with Baker on Rosetta methods (my paper is cited in the most RoseTTA paper).

"They mix a bit of both -- using homology models to get the fold, and fragment insertion to model the floppy bits."

that's the best description of what I believe AF2 is doing, but that AF2 is being marketed as not depending on any sequence similarity.

If the CASP folks really are saying "if you have 20% sequence identity and use the structure from that alignment it's ab initio"... that's really just totally misleading.

Of course, even ab initio methods are parameterized on biological information; for example, I used AMBER to do MD simulations and many of the force field terms were determined using spectroscopic data from fragments of biological models. That, however is ab initio, because nothing even as large as a single amino acid is parameterized.

I'm not saying there's anything wrong with homology modelling, or that the purist vision of ab initio is right. For practical purposes, exploiting subtle structural information through sequence alignment is a very nice way to save enormous amounts of computer time.

timr | karma 29007 | avg karma 5.19 2021-07-20 13:38:57 | [–] similar comments

> OK, be aware the person you're correcting has: competed in CASP (on a competitor team with Sali), and published papers with Baker on Rosetta methods (my paper is cited in the most RoseTTA paper).

OK, great. Me too. I'm not saying anything controversial here. Right from the top of the "ab initio" tab on predictioncenter.org:

"Modeling proteins with no or marginal similarity to existing structures (ab initio, new fold, non-template or free modeling) is the most challenging task in tertiary structure prediction."

dekhn | karma 28741 | avg karma 2.63 2021-07-20 13:51:06 | [–] similar comments

I think the more important question to resolve here is: did AlphaFold change anything with respect to structure prediction that enabled them to make accurate predictions in the complete absence of sequence similarity to proteins with known structure?

My understanding is no, they did the equivalent of template modelling, which uses sequence/structure relationships (that are more subtle than the ones you get from homology modelling).

I'm less interested in reconciling my internal mental model of psp wiht CASPs, than I am in understanding if AF2 is somehow able to get all the necessarily structural constraints through coevolution of amino acid pairs entirely without some (direct or indirect) learned relationship between the sequence similarity to known structures (be it even short fragments like helices).

If they really did do that, and nobody did it before, that's great and I will happily promote the DM work, as it supports what I said when I did CASP: ML and MD will eventually win, although in a way that exploits the rich sequence evolutionary information we have, rather than predominantly by having an accurate force field and good smapling methods.

visarga | karma 12425 | avg karma 1.65 2021-07-20 12:36:06 | [–] similar comments

It seems amazing to me what the transformer can learn to SOTA levels, not just language but also images, video, code, math and proteins. Replacing so much handmade neural architecture with just one thing that does it all, that was an amazing step forward.

sheggle | karma 1 | avg karma 1.0 2021-07-20 16:33:44 | [–] similar comments

Just for my reference, what percentage of known but unfolded proteins (a wild guess is good enough), would you consider to be ab initio? How many don't have parts in any database?

TaupeRanger | karma 1856 | avg karma 2.87 2021-07-20 09:00:21 | [–] similar comments

Here we go again with the hyperbole...very tiring.

phkahler | karma 20899 | avg karma 2.69 2021-07-20 09:35:42 | [–] similar comments

How does work like this get funded? It's awesome, but it seems so far removed from... let's say "profit". And there are several teams competing in these things. Are there places that really fund advanced work like this, or is it mostly graduate student underpaid labor?

seventytwo | karma 670 | avg karma 1.44 2021-07-20 09:39:13 | [–] similar comments

Government funding (eg. DARPA) or from large corporations that have skunkworks teams (Google, IBM, Microsoft, etc.)

dekhn | karma 28741 | avg karma 2.63 2021-07-20 09:50:08 | [–] similar comments

Most of the US researchers who do CASP are funded by NIH or NSF. Some are funded by private foundations, or are independently wealthy. Typically, as a "principal investigator" (postdoc, professor, scientist at a national lab) you write a proposal saying "here's my preivous work, here's the next obvious step, plz give monies so I can feed the dean's fund and pay for my grad students to manage my modest closet cluster".

A group of your competitors then trashes your proposal in a group and if you've properly massaged the right backs, you get a pittance, which permits you to struggle to keep up with all your promises.

infogulch | karma 10981 | avg karma 3.97 2021-07-20 10:11:33 | [–] similar comments

Yikes that sounds bleak.

cing | karma 1360 | avg karma 5.57 2021-07-20 11:50:26 | [–] similar comments

Yet, there were still 136 human teams who competed in CASP14 (https://predictioncenter.org/casp14/docs.cgi?view=groupsbyna...), including DeepMind. Even if a significant fraction of these projects were done piggy-backing another grant, this work does receive research funding.

dekhn | karma 28741 | avg karma 2.63 2021-07-20 12:58:02 | [–] similar comments

Be fair. Many of those rows contain duplicate names (identical teams), so the count is much smaller.

fswwi | karma 2 | avg karma 0.1 2021-07-20 17:19:04 | [–] similar comments

It's just corporations burn money to show off.

https://venturebeat.com/2020/12/27/deepminds-big-losses-and-...

wly_cdgr | karma 1662 | avg karma 1.34 2021-07-20 12:10:10 | [–] similar comments

Cool. How many years closer has this brought us to a $1 pill that extends life span by 1 year? Cos let's face it that's the only prize that really matters in this whole field

throwamon | karma 901 | avg karma 2.35 2021-07-20 13:08:35 | [–] similar comments

So... could anyone with experience in the area give an estimate of how much the likelihood of an unstoppable, untraceable "DIY" bioweapon appearing in the next decade has increased thanks to this?

jcranmer | karma 30697 | avg karma 4.25 2021-07-20 14:01:46 | [–] similar comments

So I don't have experience in the area, but I'd give it about 0%.

To paraphrase Derek Lowe a lot (see, e.g., https://blogs.sciencemag.org/pipeline/archives/2021/03/19/ai...), there are several hard problems in biology, and the kind of progress embodied in AlphaFold isn't progress towards the rate-limiting problems. And many of the things that make drugs hard to develop are going to carry over into making bioweapons hard to develop.

matt2000 | karma 3179 | avg karma 8.37 2021-07-20 14:21:37 | [–] similar comments

I'm not an expert, but in a recent article about the new mRNA synthesis techniques they were asked the same question. The answer was there's already lots of potential bioweapons and many simpler techniques for producing them, so these new technologies don't change the danger level much.

fastball | karma 11107 | avg karma 1.97 2021-07-20 19:19:50 | [–] similar comments

You can already make a pretty terrifying bioweapon with GoF research / CRISPR / etc. Better ability to design your own proteins doesn't move the needle much, and is still much harder than the other methods.

omgwtfbbq | karma 52 | avg karma 0.48 2021-07-21 10:56:44 | [–] similar comments

I believe it's already straightforward for a decent bio lab at a large University to synthesize dangerous viruses. So it would only be a matter of time until they could selectively add or subtract from the genetic code to create exceptionally dangerous mutations. Don't need any protein folding to do that.

https://www.theguardian.com/science/2014/jun/11/crazy-danger...

p1131 | karma 2 | avg karma 0.67 2021-07-20 15:18:08 | [–] similar comments

I wonder if this is the result of us having significantly better understanding of our biology or the major advancements in AI and computer performance. Or both?

shaggie76 | karma 248 | avg karma 2.73 2021-07-20 19:04:28 | [–] similar comments

I'm curious how this will effect the Folding@Home project; will it open up possibilities or will this type of approach be ill-suited to long-distance distributed volunteer effort because of the memory footprint (and bandwidth?) needed?

Legal | privacy