If you train a model by giving it reward when it appears to follow a particular human’s intention, you probably get a model that is really optimizing for reward, or appearing to follow said humans intention, or something else completely different, while scheming to seize control so as to optimize even more effectively in the future. Rather than an aligned AI.
Right yeah I do agree with this.
Perhaps instead you mean: No really the reward signal is whether the system really deep down followed the humans intention, not merely appeared to do so [...] That would require getting all the way to the end of evhub’s Interpretability Tech Tree
Well I think we need something like a really-actually-reward-signal (of the kind you’re point at here). The basic challenge of alignment as I see it is finding such a reward signal that doesn’t require us to get to end of the Interpretability Tech Tree (or similar tech trees). I don’t think we’ve exhausted the design space of reward signals yet but it’s definitely the “challenge of our times” so to speak.
Right yeah I do agree with this.
Well I think we need something like a really-actually-reward-signal (of the kind you’re point at here). The basic challenge of alignment as I see it is finding such a reward signal that doesn’t require us to get to end of the Interpretability Tech Tree (or similar tech trees). I don’t think we’ve exhausted the design space of reward signals yet but it’s definitely the “challenge of our times” so to speak.