Razied answers How long till Inverse AlphaFold?

Razied 18 Dec 2020 21:13 UTC
7 points
The naive way can be done immediately, AlphaFold2 is differentiable end-to-end, so you could immediately use gradient descent to optimise over the space of amino acid sequences (this is a discrete space, but you can treat it as continuous and add some penalties to your loss function to bias it towards the discrete points):
Correct sequence = argmin ( || AlphaFold2(x) - Target structure||)
For some notion of distance between structures.
- maximkazhenkov 19 Dec 2020 4:10 UTC
  5 points
  Parent
  But is AlphaFold2(x) a (mostly) convex function? More importantly, is the real structure(x) convex?
  
  I can see a potential bias here, in that AlphaFold and inverse AlphaFold might work well for biomolecules because evolution is also a kind of local search, so if it can find a certain solution, AlphaFold will find it, too. But both processes will be blind to a vast design space that might contain extremely useful designs.
  
  Then again, we are biology, so maybe we only care about biomolecules and adjacent synthetic molecules anyway.
  - Razied 19 Dec 2020 13:04 UTC
    1 point
    Parent
    Yeah, I agree that there’s probably a bias towards biomolecules right now, but I think that’s relatively easily fixable by using the naive way to predict the amino acid sequence of a structure we want, then actually making that protein, then checking its structure with crystallography and retraining AlphaFold to predict the right structure. If we do this procedure with sequences that differ more and more from biomolecules, we’ll slowly remove that bias from AlphaFold.
    - maximkazhenkov 22 Dec 2020 15:21 UTC
      3 points
      Parent
      By “bias” I didn’t mean biases in the learned model, I meant “the class of proteins whose structures can be predicted by ML algorithms at all is biased towards biomolecules”. What you’re suggesting is still within the local search paradigm, which might not be sufficient for the protein folding problem in general, any more than it is sufficient for 3-SAT in general. No sampling is dense enough if large swaths of the problem space is discontinuous.