Yes, alignment researchers don’t have access to the specific weights OpenAI is using right now, as would be the ideal real-world security failure to demonstrate. But we have plenty of posited failure conditions that we should be able to demonstrate on our own with standard deep learning tools, or public open sourced models like the ones from EleutherAI. Figuring out under what conditions Keras allows you to create mesa-optimizers, or better yet, figuring out a mesa-objective for a publicly released Facebook model would do a lot of good.
It’s a little like saying “how are we supposed to prove RCE buffer overflows can happen if we don’t have access to fingerd”? We can at least try to write some sample code first, and if someone skeptical asked us to do that—to design a system with the flaw before trying to come up with solutions—I don’t think I could blame them too much.
I agree just think that probably virtually all of the ‘big’ issues talked about are not possible with current models. Including mesa optimizers. Architecturally they may not be achievable in the search space of “find the function parameters that minimize error on <this enormous amount of text, or this enormous amount of robotics problems>”.
Deception theoretically has a cost, and the direction of optimization would push against it, you’re asking for the smallest representation that correctly predicts the output. So at least with these forms of training + architectures (transformer variants, both for llms and robotics), this particular flaw May. Not. Happen.
It’s precisely what you were saying with your example, the actual compiler flaws are both different and as it turns out way worse. (“Sydney” wasn’t a mesa optimizer, it’s channeling a character that exists somewhere in the training corpus. The model was Working As Intended)
Motivated by our findings that attention layers are attempting to implicitly optimize internal objective functions, we introduce the mesa-layer, a novel attention layer that efficiently solves a least-squares optimization problem, instead of taking just a single gradient step towards an optimum. We show that a single mesa-layer outperforms deep linear and softmax self-attention Transformers on simple sequential tasks while offering more interpretability
It looks like you can analyze transformers, discover the internal patterns that emergently are formed, analyze which ones work the best, and then redesign your network architecture to start with an extra layer that has this pattern already present.
Not only is this closer to the human brain, but yes, it’s adding a type of internal mesa optimizer. Doing it deliberately instead of letting one form emergently from the data probably prevents the failure mode AI doomers are worried about, this layer allowing the machine to defect against humans.
Yes, alignment researchers don’t have access to the specific weights OpenAI is using right now, as would be the ideal real-world security failure to demonstrate. But we have plenty of posited failure conditions that we should be able to demonstrate on our own with standard deep learning tools, or public open sourced models like the ones from EleutherAI. Figuring out under what conditions Keras allows you to create mesa-optimizers, or better yet, figuring out a mesa-objective for a publicly released Facebook model would do a lot of good.
It’s a little like saying “how are we supposed to prove RCE buffer overflows can happen if we don’t have access to fingerd”? We can at least try to write some sample code first, and if someone skeptical asked us to do that—to design a system with the flaw before trying to come up with solutions—I don’t think I could blame them too much.
I agree just think that probably virtually all of the ‘big’ issues talked about are not possible with current models. Including mesa optimizers. Architecturally they may not be achievable in the search space of “find the function parameters that minimize error on <this enormous amount of text, or this enormous amount of robotics problems>”.
Deception theoretically has a cost, and the direction of optimization would push against it, you’re asking for the smallest representation that correctly predicts the output. So at least with these forms of training + architectures (transformer variants, both for llms and robotics), this particular flaw May. Not. Happen.
It’s precisely what you were saying with your example, the actual compiler flaws are both different and as it turns out way worse. (“Sydney” wasn’t a mesa optimizer, it’s channeling a character that exists somewhere in the training corpus. The model was Working As Intended)
Didn’t they demonstrate that transformers could be mesaoptimizers? (I never properly understood the paper, so it’s a genuine question.) Uncovering Mesaoptimization Algorithms in Transformers
From the paper:
It looks like you can analyze transformers, discover the internal patterns that emergently are formed, analyze which ones work the best, and then redesign your network architecture to start with an extra layer that has this pattern already present.
Not only is this closer to the human brain, but yes, it’s adding a type of internal mesa optimizer. Doing it deliberately instead of letting one form emergently from the data probably prevents the failure mode AI doomers are worried about, this layer allowing the machine to defect against humans.