Overall, strong upvote, I like this post a lot, these seem like good updates you’ve made.
we think that mesa-optimizers will primarily use a complicated stack of heuristics that takes elements from different clean optimization procedures. In the future, these internal heuristics might be combined with external optimization procedures like calculators or physics engines. This is similar to how humans that play chess don’t actually run a tree-search of depth n with alpha-beta pruning in their heads.
we think it will be much harder to learn something about search in a toy model and transfer that to a larger model because the kind of mesa-optimization is much more messy and diverse than this hypothesis assumes.
I agree. However, I agree with this as an argument against direct insight transfer from toy->real-world models. If you don’t know how to do anything with anything for how e.g. an adult would plan real-world takeover, start simple IMO.
Second, we expect that when general-purpose models like GPT-3 are playing chess, they do not call an internal optimizer. Instead, they might apply heuristics that either have small components of optimization procedures or are approximations of aspects of explicit optimization. We expect that most of the decisions will come from highly refined heuristics learned from the training data.
However, I expect there to be something like… generally useful predictive- and behavior-modifying circuits (aliased to “general-purpose problem-solving module”, perhaps), such that they get subroutine-called by many different value shards. Even though I think those subroutines are not going to be MCTS.
On a more personal note, thinking about this post made us more hopeful that mesa-optimization increases gradually and we thus get a bit of time to study it before it is too powerful but it made us more pessimistic about finding general tools that can tell us whether the model is currently doing mesa-optimization.
I feel only somewhat interested in “how much mesaoptimization is happening?”, and more interested in “what kinds of cognitive work is being done, and how, and towards what ends?” (IE what are the agent’s values, and how well are they being worked towards?)
I also agree that toy models are better than nothing and we should start with them but I moved away from “if we understand how toy models do optimization, we understand much more about how GPT-4 does optimization”.
I have a bunch of project ideas on how small models do optimization. I even trained the networks already. I just haven’t found the time to interpret them yet. I’m happy for someone to take over the project if they want to. I’m mainly looking for evidence against the outlined hypothesis, i.e. maybe small toy models actually do fairly general optimization. Would def. update my beliefs.
I’d be super interested in falsifiable predictions about what these general-purpose modules look like. Or maybe even just more concrete intuitions, e.g. what kind of general-purpose modules you would expect GPT-3 to have. I’m currently very uncertain about this.
Overall, strong upvote, I like this post a lot, these seem like good updates you’ve made.
I agree. Heuristic-free search seems very inefficient and inappropriate for real-world intelligence.
I agree. However, I agree with this as an argument against direct insight transfer from toy->real-world models. If you don’t know how to do anything with anything for how e.g. an adult would plan real-world takeover, start simple IMO.
First, thanks for making falsifiable predictions. Strong upvote for that. Second, I agree with this point. See also my made-up account of what might happen in a kid’s brain when he decides to wander away from his distracting friends. (It isn’t explicit search.)
However, I expect there to be something like… generally useful predictive- and behavior-modifying circuits (aliased to “general-purpose problem-solving module”, perhaps), such that they get subroutine-called by many different value shards. Even though I think those subroutines are not going to be MCTS.
I feel only somewhat interested in “how much mesaoptimization is happening?”, and more interested in “what kinds of cognitive work is being done, and how, and towards what ends?” (IE what are the agent’s values, and how well are they being worked towards?)
Thank you!
I also agree that toy models are better than nothing and we should start with them but I moved away from “if we understand how toy models do optimization, we understand much more about how GPT-4 does optimization”.
I have a bunch of project ideas on how small models do optimization. I even trained the networks already. I just haven’t found the time to interpret them yet. I’m happy for someone to take over the project if they want to. I’m mainly looking for evidence against the outlined hypothesis, i.e. maybe small toy models actually do fairly general optimization. Would def. update my beliefs.
I’d be super interested in falsifiable predictions about what these general-purpose modules look like. Or maybe even just more concrete intuitions, e.g. what kind of general-purpose modules you would expect GPT-3 to have. I’m currently very uncertain about this.
I agree with your final framing.