“Just Retarget the Search” directly eliminates the inner alignment problem.
I think deception is still an issue here. A deceptive agent will try to obfuscate its goals, so unless you’re willing to assume that our interpretability tools are so good they can’t ever be tricked, you have to deal with that.
It’s not necessarily a huge issue—hopefully with interpretability tools this good we can spot deception before it gets competent enough to evade our interpretability tools, but it’s not just “bada-bing bada-boom” exactly.
Yea, I agree that if you give a deceptive model the chance to emerge then a lot more risks arise for interpretability and it could become much more difficult. Circumventing interpretability: How to defeat mind-readers kind of goes through the gauntlet, but I think one workaround/solution Lee lays out there which I haven’t seen anyone shoot down yet (aside from it seeming terribly expensive) is to run the interpretability tools continuously or near continuously from the beginning of training. This would give us the opportunity to examine the mesa-optimizer’s goals as soon as they emerge, before it has a chance to do any kind of obfuscation.
I think deception is still an issue here. A deceptive agent will try to obfuscate its goals, so unless you’re willing to assume that our interpretability tools are so good they can’t ever be tricked, you have to deal with that.
It’s not necessarily a huge issue—hopefully with interpretability tools this good we can spot deception before it gets competent enough to evade our interpretability tools, but it’s not just “bada-bing bada-boom” exactly.
Yea, I agree that if you give a deceptive model the chance to emerge then a lot more risks arise for interpretability and it could become much more difficult. Circumventing interpretability: How to defeat mind-readers kind of goes through the gauntlet, but I think one workaround/solution Lee lays out there which I haven’t seen anyone shoot down yet (aside from it seeming terribly expensive) is to run the interpretability tools continuously or near continuously from the beginning of training. This would give us the opportunity to examine the mesa-optimizer’s goals as soon as they emerge, before it has a chance to do any kind of obfuscation.