I think deceptive alignment is still reasonably likely despite evidence from LLMs.
I agree with:
LLMs are not deceptively aligned and don’t really have inner goals in the sense that is scary
LLMs memorize a bunch of stuff
the kinds of reasoning that feed into deceptive alignment do not predict LLM behavior well
Adam on transformers does not have a super strong simplicity bias
without deceptive alignment, AI risk is a lot lower
LLMs not being deceptively aligned provides nonzero evidence against deceptive alignment (by conservation of evidence)
I predict I could pass the ITT for why LLMs are evidence that deceptive alignment is not likely.
however, I also note the following: LLMs are kind of bad at generalizing, and this makes them pretty bad at doing e.g novel research, or long horizon tasks. deceptive alignment conditions on models already being better at generalization and reasoning than current models.
my current hypothesis is that future models which generalize in a way closer to that predicted by mesaoptimization will also be better described as having a simplicity bias.
I think this and other potential hypotheses can potentially be tested empirically today rather than only being distinguishable close to AGI
Note that “LLMs are evidence against this hypothesis” isn’t my main point here. The main claim is that the positive arguments for deceptive alignment are flimsy, and thus the prior is very low.
I think this and other potential hypotheses can potentially be tested empirically today rather than only being distinguishable close to AGI
How would you imagine doing this? I understand your hypothesis to be “If a model generalises as if it’s a mesa-optimiser, then it’s better-described as having simplicity bias”. Are you imagining training systems that are mesa-optimisers (perhaps explicitly using some kind of model-based RL/inference-time planning and search/MCTS), and then trying to see if they tend to learn simple cross-episode inner goals which would be implied by a stronger implicity bias?
I think deceptive alignment is still reasonably likely despite evidence from LLMs.
I agree with:
LLMs are not deceptively aligned and don’t really have inner goals in the sense that is scary
LLMs memorize a bunch of stuff
the kinds of reasoning that feed into deceptive alignment do not predict LLM behavior well
Adam on transformers does not have a super strong simplicity bias
without deceptive alignment, AI risk is a lot lower
LLMs not being deceptively aligned provides nonzero evidence against deceptive alignment (by conservation of evidence)
I predict I could pass the ITT for why LLMs are evidence that deceptive alignment is not likely.
however, I also note the following: LLMs are kind of bad at generalizing, and this makes them pretty bad at doing e.g novel research, or long horizon tasks. deceptive alignment conditions on models already being better at generalization and reasoning than current models.
my current hypothesis is that future models which generalize in a way closer to that predicted by mesaoptimization will also be better described as having a simplicity bias.
I think this and other potential hypotheses can potentially be tested empirically today rather than only being distinguishable close to AGI
Note that “LLMs are evidence against this hypothesis” isn’t my main point here. The main claim is that the positive arguments for deceptive alignment are flimsy, and thus the prior is very low.
How would you imagine doing this? I understand your hypothesis to be “If a model generalises as if it’s a mesa-optimiser, then it’s better-described as having simplicity bias”. Are you imagining training systems that are mesa-optimisers (perhaps explicitly using some kind of model-based RL/inference-time planning and search/MCTS), and then trying to see if they tend to learn simple cross-episode inner goals which would be implied by a stronger implicity bias?