Daniel Kokotajlo comments on Notes on OpenAI’s alignment plan

Daniel Kokotajlo 9 Dec 2022 6:12 UTC
LW: 6 AF: 2
3
AF
“Wouldn’t it make more sense to use as a reward signal the fact-of-the-matter about whether a certain system followed a particular human’s intention?”

If I understand what you are saying correctly, this wouldn’t work, for reasons that have been discussed at length in various places, e.g. the mesa-optimization paper and Ajeya’s post “Without specific countermeasures...” If you train a model by giving it reward when it appears to follow a particular human’s intention, you probably get a model that is really optimizing for reward, or appearing to follow said humans intention, or something else completely different, while scheming to seize control so as to optimize even more effectively in the future. Rather than an aligned AI.

And so if you train an AI to build another AI that appears to follow a particular human’s intention, you are just training your AI to do capabilities research.

(Perhaps instead you mean: No really the reward signal is whether the system really deep down followed the humans intention, not merely appeared to do so as far as we can tell from the outside. Well, how are we going to construct such a reward signal? That would require getting all the way to the end of evhub’s Interpretability Tech Tree.)
- Alex Flint 9 Dec 2022 15:23 UTC
  LW: 6 AF: 4
  0
  AF Parent
  
  If you train a model by giving it reward when it appears to follow a particular human’s intention, you probably get a model that is really optimizing for reward, or appearing to follow said humans intention, or something else completely different, while scheming to seize control so as to optimize even more effectively in the future. Rather than an aligned AI.
  
  Right yeah I do agree with this.
  
  Perhaps instead you mean: No really the reward signal is whether the system really deep down followed the humans intention, not merely appeared to do so [...] That would require getting all the way to the end of evhub’s Interpretability Tech Tree
  
  Well I think we need something like a really-actually-reward-signal (of the kind you’re point at here). The basic challenge of alignment as I see it is finding such a reward signal that doesn’t require us to get to end of the Interpretability Tech Tree (or similar tech trees). I don’t think we’ve exhausted the design space of reward signals yet but it’s definitely the “challenge of our times” so to speak.