ojorgensen comments on How likely is deceptive alignment?

ojorgensen 12 Sep 2022 13:12 UTC
3 points
0
I found this post really interesting, thanks for sharing it!
It doesn’t seem obvious to me that the methods of understanding a model given a high path-dependence world become significantly less useful if we are in a low path-dependence world. I think I see why low path-dependence would give us the opportunity to use different methods of analysis, but I don’t see why the high path-dependence ones would no longer be useful.
For example, here is the reasoning behind “how likely is deceptive alignment” in a high path-dependence world (quoted from the slide).
1. We start with a proxy-aligned model
2. In early training, SGD jointly focuses on improving the model’s understanding of the world along with improving its proxies
3. The model learns about the training process from its input data
4. SGD makes the model’s proxies into more long-term goals, resulting in it instrumentally optimizing for the training objective for the purposes of staying around
5. The model’s proxies “crystallize”, as they are no longer relevant to performance, and we reach an equilibrium
Let’s suppose that this reasoning, and the associated justification of why this is likely to arise due to SGD seeking the largest possible marginal performance improvements, are sound for a high path-dependence world. Why does it no longer hold in a low path-dependence world?
- lberglund 14 Sep 2022 13:08 UTC
  1 point
  0
  Parent
  Why does it no longer hold in a low path-dependence world?
  Not sure, but here’s how I understand it:
  If we are in a low path-dependence world, the fact that SGD takes a certain path doesn’t say much about what type of model it will eventually converge to.
  In a low path-dependence world, if these steps occurred to produce a deceptive model, SGD could still “find a path” to the corrigibly aligned version. The questions of whether it would find these other models depends on things like “how big is the space of models with this property?”, which corresponds to a complexity bias.