I agree just think that probably virtually all of the ‘big’ issues talked about are not possible with current models. Including mesa optimizers. Architecturally they may not be achievable in the search space of “find the function parameters that minimize error on <this enormous amount of text, or this enormous amount of robotics problems>”.
Deception theoretically has a cost, and the direction of optimization would push against it, you’re asking for the smallest representation that correctly predicts the output. So at least with these forms of training + architectures (transformer variants, both for llms and robotics), this particular flaw May. Not. Happen.
It’s precisely what you were saying with your example, the actual compiler flaws are both different and as it turns out way worse. (“Sydney” wasn’t a mesa optimizer, it’s channeling a character that exists somewhere in the training corpus. The model was Working As Intended)
Motivated by our findings that attention layers are attempting to implicitly optimize internal objective functions, we introduce the mesa-layer, a novel attention layer that efficiently solves a least-squares optimization problem, instead of taking just a single gradient step towards an optimum. We show that a single mesa-layer outperforms deep linear and softmax self-attention Transformers on simple sequential tasks while offering more interpretability
It looks like you can analyze transformers, discover the internal patterns that emergently are formed, analyze which ones work the best, and then redesign your network architecture to start with an extra layer that has this pattern already present.
Not only is this closer to the human brain, but yes, it’s adding a type of internal mesa optimizer. Doing it deliberately instead of letting one form emergently from the data probably prevents the failure mode AI doomers are worried about, this layer allowing the machine to defect against humans.
I agree just think that probably virtually all of the ‘big’ issues talked about are not possible with current models. Including mesa optimizers. Architecturally they may not be achievable in the search space of “find the function parameters that minimize error on <this enormous amount of text, or this enormous amount of robotics problems>”.
Deception theoretically has a cost, and the direction of optimization would push against it, you’re asking for the smallest representation that correctly predicts the output. So at least with these forms of training + architectures (transformer variants, both for llms and robotics), this particular flaw May. Not. Happen.
It’s precisely what you were saying with your example, the actual compiler flaws are both different and as it turns out way worse. (“Sydney” wasn’t a mesa optimizer, it’s channeling a character that exists somewhere in the training corpus. The model was Working As Intended)
Didn’t they demonstrate that transformers could be mesaoptimizers? (I never properly understood the paper, so it’s a genuine question.) Uncovering Mesaoptimization Algorithms in Transformers
From the paper:
It looks like you can analyze transformers, discover the internal patterns that emergently are formed, analyze which ones work the best, and then redesign your network architecture to start with an extra layer that has this pattern already present.
Not only is this closer to the human brain, but yes, it’s adding a type of internal mesa optimizer. Doing it deliberately instead of letting one form emergently from the data probably prevents the failure mode AI doomers are worried about, this layer allowing the machine to defect against humans.