Reading the GPT-4 data, playing with it myself, and looking at the RBRM rubric (which is RSI!), I’m struck by the thought that there is extreme limits right now on who can even begin to “POC || GTFO”. That’s kind of a major issue.
Without the equipment and infrastructure/support pipeline, you can do very little. It’s essentially meaningless to play with small enough models to run locally that can’t be trained. In fact given just how more capable the new model is, it’s meaningless to try many things without a model sophisticated enough to reason about how to complete a task by “breaking out”, etc.
Only AI company staff have the tools, eval data (it’s very valuable for things like all the query|answer pairs for chatGPT, or all the question|candidate answers if you were trying to improve skill on leetcode), equipment, and so on.
Even worse it seems like it’s all or nothing, either someone is at an elite lab or they again don’t have a useful system to play with. 2048 A100s were used for llama training.
It’s less than maybe 1000 people worldwide? 10k? Not many.
I mean looking at the RBRM rubric, I’m struck by the fact that even manual POCs don’t scale. You need to be able to task an unrestricted version of GPT-4, one that you have training access to so it can become more specialized for the task, with discovering security vulnerabilities in other systems. You as a human would be telling it what to look for, the strategies to use, etc, while the system is what is iterating over millions of permutations.
Yes, alignment researchers don’t have access to the specific weights OpenAI is using right now, as would be the ideal real-world security failure to demonstrate. But we have plenty of posited failure conditions that we should be able to demonstrate on our own with standard deep learning tools, or public open sourced models like the ones from EleutherAI. Figuring out under what conditions Keras allows you to create mesa-optimizers, or better yet, figuring out a mesa-objective for a publicly released Facebook model would do a lot of good.
It’s a little like saying “how are we supposed to prove RCE buffer overflows can happen if we don’t have access to fingerd”? We can at least try to write some sample code first, and if someone skeptical asked us to do that—to design a system with the flaw before trying to come up with solutions—I don’t think I could blame them too much.
I agree just think that probably virtually all of the ‘big’ issues talked about are not possible with current models. Including mesa optimizers. Architecturally they may not be achievable in the search space of “find the function parameters that minimize error on <this enormous amount of text, or this enormous amount of robotics problems>”.
Deception theoretically has a cost, and the direction of optimization would push against it, you’re asking for the smallest representation that correctly predicts the output. So at least with these forms of training + architectures (transformer variants, both for llms and robotics), this particular flaw May. Not. Happen.
It’s precisely what you were saying with your example, the actual compiler flaws are both different and as it turns out way worse. (“Sydney” wasn’t a mesa optimizer, it’s channeling a character that exists somewhere in the training corpus. The model was Working As Intended)
Motivated by our findings that attention layers are attempting to implicitly optimize internal objective functions, we introduce the mesa-layer, a novel attention layer that efficiently solves a least-squares optimization problem, instead of taking just a single gradient step towards an optimum. We show that a single mesa-layer outperforms deep linear and softmax self-attention Transformers on simple sequential tasks while offering more interpretability
It looks like you can analyze transformers, discover the internal patterns that emergently are formed, analyze which ones work the best, and then redesign your network architecture to start with an extra layer that has this pattern already present.
Not only is this closer to the human brain, but yes, it’s adding a type of internal mesa optimizer. Doing it deliberately instead of letting one form emergently from the data probably prevents the failure mode AI doomers are worried about, this layer allowing the machine to defect against humans.
Reading the GPT-4 data, playing with it myself, and looking at the RBRM rubric (which is RSI!), I’m struck by the thought that there is extreme limits right now on who can even begin to “POC || GTFO”. That’s kind of a major issue.
Without the equipment and infrastructure/support pipeline, you can do very little. It’s essentially meaningless to play with small enough models to run locally that can’t be trained. In fact given just how more capable the new model is, it’s meaningless to try many things without a model sophisticated enough to reason about how to complete a task by “breaking out”, etc.
Only AI company staff have the tools, eval data (it’s very valuable for things like all the query|answer pairs for chatGPT, or all the question|candidate answers if you were trying to improve skill on leetcode), equipment, and so on.
Even worse it seems like it’s all or nothing, either someone is at an elite lab or they again don’t have a useful system to play with. 2048 A100s were used for llama training.
It’s less than maybe 1000 people worldwide? 10k? Not many.
I mean looking at the RBRM rubric, I’m struck by the fact that even manual POCs don’t scale. You need to be able to task an unrestricted version of GPT-4, one that you have training access to so it can become more specialized for the task, with discovering security vulnerabilities in other systems. You as a human would be telling it what to look for, the strategies to use, etc, while the system is what is iterating over millions of permutations.
Yes, alignment researchers don’t have access to the specific weights OpenAI is using right now, as would be the ideal real-world security failure to demonstrate. But we have plenty of posited failure conditions that we should be able to demonstrate on our own with standard deep learning tools, or public open sourced models like the ones from EleutherAI. Figuring out under what conditions Keras allows you to create mesa-optimizers, or better yet, figuring out a mesa-objective for a publicly released Facebook model would do a lot of good.
It’s a little like saying “how are we supposed to prove RCE buffer overflows can happen if we don’t have access to fingerd”? We can at least try to write some sample code first, and if someone skeptical asked us to do that—to design a system with the flaw before trying to come up with solutions—I don’t think I could blame them too much.
I agree just think that probably virtually all of the ‘big’ issues talked about are not possible with current models. Including mesa optimizers. Architecturally they may not be achievable in the search space of “find the function parameters that minimize error on <this enormous amount of text, or this enormous amount of robotics problems>”.
Deception theoretically has a cost, and the direction of optimization would push against it, you’re asking for the smallest representation that correctly predicts the output. So at least with these forms of training + architectures (transformer variants, both for llms and robotics), this particular flaw May. Not. Happen.
It’s precisely what you were saying with your example, the actual compiler flaws are both different and as it turns out way worse. (“Sydney” wasn’t a mesa optimizer, it’s channeling a character that exists somewhere in the training corpus. The model was Working As Intended)
Didn’t they demonstrate that transformers could be mesaoptimizers? (I never properly understood the paper, so it’s a genuine question.) Uncovering Mesaoptimization Algorithms in Transformers
From the paper:
It looks like you can analyze transformers, discover the internal patterns that emergently are formed, analyze which ones work the best, and then redesign your network architecture to start with an extra layer that has this pattern already present.
Not only is this closer to the human brain, but yes, it’s adding a type of internal mesa optimizer. Doing it deliberately instead of letting one form emergently from the data probably prevents the failure mode AI doomers are worried about, this layer allowing the machine to defect against humans.