Nitpick about terminology that applies not just to you but to lots of people:
Keeping all this in mind, it’s important to remind ourselves that mesaoptimizers are ultimately a form of misgeneralization. Generalization strategy being how you are going to handle novelty in the inputs. … If it’s possible, even easy, to fingerprint the shared initialization models were trained with using an out-of-distribution token just by observing the responses, then we should update towards JDP’s plan for mitigating mesaoptimizers working.
People seem to sometimes use “mesaoptimizers” as shorthand for “misaligned mesaoptimizers.” They sometimes say things like “We haven’t yet got hard empirical evidence that mesaoptimizers are a problem in practice” and “mesaoptimizers are a hypothetical problem that can occur with advanced AI systems.” All of this is misleading, IMO. If you go back to the original paper and look at the definition of a mesaoptimizer, it’s pretty clear that pretty much any AGI built using deep learning will be a mesaoptimizer and moreover ChatGPT using chain of thought is plausibly already a mesaoptimizer. The question is whether they’ll be aligned or not, i.e. whether their mesa-objectives will be the same as the ‘intended’ or ‘natural’ base objective inherent in the reward signal.
Strong agree with the main point, it confused me for a long time why people were saying we had no evidence of mesa-optimizers existing, and made me think I was getting something very wrong. I disagree with this line though:
ChatGPT using chain of thought is plausibly already a mesaoptimizer.
I think simulacra are better thought of as sub-agents in relation to the original paper’s terminology than mesa-optimizers. ChatGPT doesn’t seem to be doing anything qualitatively different on this note. The Assistant simulacrum can be seen as doing optimization (depending on your definition of the term), but the fact that jailbreak methods exist to get the underlying model to adopt different simulacra seems to me to show that it’s still using the simulator mechanism. Moreover, I expect that if we get GPT-3 level models that are optimizers at the simulator-level, I think things would look very different.
The critical issue is whether consequentialist mesa optimizers will arise. If consequentialist mesaoptimizers don’t arise, like in the link below, then much of the safety concern is gone.
Any agentic AGI built via deep learning will almost by definition be a consequentialist mesaoptimizer (in the broad sense of consequentialism you are talking about, I think). It’ll be performing some sort of internal search to choose actions, while also SGD or whatever the outer training loop is performs ‘search’ to update its parameters. So, boom, base optimizer and mesa optimizer.
Quoting LawrenceC from that very thread:
<begin quote> Well, no, that’s not the definition of optimizer in the mesa-optimization post! Evan gives the following definition of an optimizer:
A system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system
And the following definition of a mesa-optimizer:
Mesa-optimization occurs when a base optimizer (in searching for algorithms to solve some problem) finds a model that is itself an optimizer, which we will call a mesa-optimizer.
<end quote>
The “mesa” part is pretty trivial. Humans are mesaoptimizers relative to the base optimizer of evolution. If an AGI is an optimizer at all, it’s a mesaoptimizer relative to the process that built it—human R&D industry if nothing else, though given deep learning it’ll probably be gradient descent or RL.
Nitpick about terminology that applies not just to you but to lots of people:
People seem to sometimes use “mesaoptimizers” as shorthand for “misaligned mesaoptimizers.” They sometimes say things like “We haven’t yet got hard empirical evidence that mesaoptimizers are a problem in practice” and “mesaoptimizers are a hypothetical problem that can occur with advanced AI systems.” All of this is misleading, IMO. If you go back to the original paper and look at the definition of a mesaoptimizer, it’s pretty clear that pretty much any AGI built using deep learning will be a mesaoptimizer and moreover ChatGPT using chain of thought is plausibly already a mesaoptimizer. The question is whether they’ll be aligned or not, i.e. whether their mesa-objectives will be the same as the ‘intended’ or ‘natural’ base objective inherent in the reward signal.
Strong agree with the main point, it confused me for a long time why people were saying we had no evidence of mesa-optimizers existing, and made me think I was getting something very wrong. I disagree with this line though:
I think simulacra are better thought of as sub-agents in relation to the original paper’s terminology than mesa-optimizers. ChatGPT doesn’t seem to be doing anything qualitatively different on this note. The Assistant simulacrum can be seen as doing optimization (depending on your definition of the term), but the fact that jailbreak methods exist to get the underlying model to adopt different simulacra seems to me to show that it’s still using the simulator mechanism. Moreover, I expect that if we get GPT-3 level models that are optimizers at the simulator-level, I think things would look very different.
The critical issue is whether consequentialist mesa optimizers will arise. If consequentialist mesaoptimizers don’t arise, like in the link below, then much of the safety concern is gone.
Link below:
https://www.lesswrong.com/posts/firtXAWGdvzXYAh9B/paper-transformers-learn-in-context-by-gradient-descent#pbEciBKsk86xmcgqb
Any agentic AGI built via deep learning will almost by definition be a consequentialist mesaoptimizer (in the broad sense of consequentialism you are talking about, I think). It’ll be performing some sort of internal search to choose actions, while also SGD or whatever the outer training loop is performs ‘search’ to update its parameters. So, boom, base optimizer and mesa optimizer.
Quoting LawrenceC from that very thread:
<begin quote>
Well, no, that’s not the definition of optimizer in the mesa-optimization post! Evan gives the following definition of an optimizer:
And the following definition of a mesa-optimizer:
<end quote>
The “mesa” part is pretty trivial. Humans are mesaoptimizers relative to the base optimizer of evolution. If an AGI is an optimizer at all, it’s a mesaoptimizer relative to the process that built it—human R&D industry if nothing else, though given deep learning it’ll probably be gradient descent or RL.
I think the critical comment that I wanted to highlight was Nostaglebraist’s comment in that thread.