It seems pretty clear to me that AI’s could get really good at understanding and predicting the results of editing model weights in the same way they can get good at predicting how proteins will fold. From there, directly creating circuits that add XYZ reasoning functionality seems at least possible.
I don’t think you can get the information of computing the gradient updates to particular weights without actually running that computation (or something equivalent to it).
And presumably one would need empirical feedback (i.e. the value of the objective function we’re optimising the network for on particular inputs) to compute the desired gradient updates.
The idea of the system just predicting the desired gradient updates without any ground truth supervisory signal seems fanciful.
I agree that it seems possible. I have doubts that predicting the results of editing weights is a more compute-efficient way of causing a model to exhibit the desired behavior than giving it the obvious tools and using fine-tuning / RL to make it able to use those tools though, or alternatively just don’t the RL/finetune directly. That’s basically the heart of how I interpret the bitter lesson—it’s not that you can’t find more efficient ways to do what DL can do, it’s that when you have a task that humans can do and computers can’t, the approach of “introspect and think really hard about how to approach task the right way” is outperformed by the approach of “lol more layers go brrrrr”.
It seems pretty clear to me that AI’s could get really good at understanding and predicting the results of editing model weights in the same way they can get good at predicting how proteins will fold. From there, directly creating circuits that add XYZ reasoning functionality seems at least possible.
I don’t actually share this intuition.
I don’t think you can get the information of computing the gradient updates to particular weights without actually running that computation (or something equivalent to it).
And presumably one would need empirical feedback (i.e. the value of the objective function we’re optimising the network for on particular inputs) to compute the desired gradient updates.
The idea of the system just predicting the desired gradient updates without any ground truth supervisory signal seems fanciful.
Ehh, protein folding feels equally fanciful to me, figuring out how the protein will fold without actually simulating the physical interactions.
Meanwhile we have humans already editing model weights to change model behavior in desired ways: https://www.lesswrong.com/posts/gRp6FAWcQiCWkouN5/maze-solving-agents-add-a-top-right-vector-make-the-agent-go
I agree that it seems possible. I have doubts that predicting the results of editing weights is a more compute-efficient way of causing a model to exhibit the desired behavior than giving it the obvious tools and using fine-tuning / RL to make it able to use those tools though, or alternatively just don’t the RL/finetune directly. That’s basically the heart of how I interpret the bitter lesson—it’s not that you can’t find more efficient ways to do what DL can do, it’s that when you have a task that humans can do and computers can’t, the approach of “introspect and think really hard about how to approach task the right way” is outperformed by the approach of “lol more layers go brrrrr”.