leogao comments on TurnTrout’s shortform feed

leogao 18 Dec 2023 6:09 UTC
LW: 27 AF: 15
4
AF
I think deceptive alignment is still reasonably likely despite evidence from LLMs.

I agree with:
- LLMs are not deceptively aligned and don’t really have inner goals in the sense that is scary
- LLMs memorize a bunch of stuff
- the kinds of reasoning that feed into deceptive alignment do not predict LLM behavior well
- Adam on transformers does not have a super strong simplicity bias
- without deceptive alignment, AI risk is a lot lower
- LLMs not being deceptively aligned provides nonzero evidence against deceptive alignment (by conservation of evidence)
I predict I could pass the ITT for why LLMs are evidence that deceptive alignment is not likely.

however, I also note the following: LLMs are kind of bad at generalizing, and this makes them pretty bad at doing e.g novel research, or long horizon tasks. deceptive alignment conditions on models already being better at generalization and reasoning than current models.

my current hypothesis is that future models which generalize in a way closer to that predicted by mesaoptimization will also be better described as having a simplicity bias.

I think this and other potential hypotheses can potentially be tested empirically today rather than only being distinguishable close to AGI
- TurnTrout 26 Dec 2023 21:20 UTC
  LW: 6 AF: 4
  0
  AF Parent
  Note that “LLMs are evidence against this hypothesis” isn’t my main point here. The main claim is that the positive arguments for deceptive alignment are flimsy, and thus the prior is very low.
- RobertKirk 18 Jan 2024 10:21 UTC
  LW: 1 AF: 1
  0
  AF Parent
  
  I think this and other potential hypotheses can potentially be tested empirically today rather than only being distinguishable close to AGI
  
  How would you imagine doing this? I understand your hypothesis to be “If a model generalises as if it’s a mesa-optimiser, then it’s better-described as having simplicity bias”. Are you imagining training systems that are mesa-optimisers (perhaps explicitly using some kind of model-based RL/inference-time planning and search/MCTS), and then trying to see if they tend to learn simple cross-episode inner goals which would be implied by a stronger implicity bias?