Strong agree with the main point, it confused me for a long time why people were saying we had no evidence of mesa-optimizers existing, and made me think I was getting something very wrong. I disagree with this line though:
ChatGPT using chain of thought is plausibly already a mesaoptimizer.
I think simulacra are better thought of as sub-agents in relation to the original paper’s terminology than mesa-optimizers. ChatGPT doesn’t seem to be doing anything qualitatively different on this note. The Assistant simulacrum can be seen as doing optimization (depending on your definition of the term), but the fact that jailbreak methods exist to get the underlying model to adopt different simulacra seems to me to show that it’s still using the simulator mechanism. Moreover, I expect that if we get GPT-3 level models that are optimizers at the simulator-level, I think things would look very different.
Strong agree with the main point, it confused me for a long time why people were saying we had no evidence of mesa-optimizers existing, and made me think I was getting something very wrong. I disagree with this line though:
I think simulacra are better thought of as sub-agents in relation to the original paper’s terminology than mesa-optimizers. ChatGPT doesn’t seem to be doing anything qualitatively different on this note. The Assistant simulacrum can be seen as doing optimization (depending on your definition of the term), but the fact that jailbreak methods exist to get the underlying model to adopt different simulacra seems to me to show that it’s still using the simulator mechanism. Moreover, I expect that if we get GPT-3 level models that are optimizers at the simulator-level, I think things would look very different.