leogao comments on Framings of Deceptive Alignment

leogao 26 Apr 2022 23:30 UTC
7 points
I like this post and generally agree with the decomposition into the two different types of deception. I prefer to call them outer and inner deception, because one happens when deception is incentivized by the base objective, and one happens when deception is incentivized by the mesaobjective. I thought about this a bit a while back and after talking to Evan about this I managed to be convinced that Goodhart/outer deception is a lot easier to tackle if you already have consequentialist/inner deception nailed down (not sure to what extent Evan endorses this and don’t want to misattribute my misunderstanding to him). I should probably revisit some of this stuff and make a post about it.
As for the “where is the deception” thing, I don’t really know if the distinction is that sharp. My intuition is that lots of systems in practice will be some weird mix of both, and I don’t really expect it to be interpretable either way anyways (even for the pure “in the weights” case you can have arbitrarily obfuscated and computationally intractable to detect versions of it).
- CallumMcDougall 27 Apr 2022 21:19 UTC
  1 point
  Parent
  Yeah I think this is Evan’s view. This is from his research agenda (I’m guessing you might have already seen this given your comment but I’ll add it here for reference anyway in case others are interested)
  I suspect we can in fact design transparency metrics that are robust to Goodharting when the only optimization pressure being applied to them is coming from SGD, but cease to be robust if the model itself starts actively trying to trick them.
  And I think his view on deception through inner optimisation pressure is that this is something we’ll basically be powerless to deal with once it happens, so the only way to make sure it doesn’t happen it to chart a safe path through model space which never enters the deceptive region in the first place.