So I do think you can get feedback on the related question of “can you write a critique of this action that makes us think we wouldn’t be happy with the outcomes” as you can give a reward of 1 if you’re unhappy with the outcomes after seeing the critique, 0 otherwise.
And this alone isn’t sufficient, e.g. maybe then the AI system says things about good actions that make us think we wouldn’t be happy with the outcome, which is then where you’d need to get into recursive evaluation or debate or something. But this feels like “hard but potentially tractable problem” and not “100% doomed”. Or at least the failure story needs to involve more steps like “sure critiques will tell us that the fusion power generator will lead to everyone dying, but we ignore that because it can write a critique of any action that makes us believe it’s bad” or “the consequences are so complicated the system can’t explain them to us in the critique and get high reward for it”
ETA: So I’m assuming the story for feedback on reliably doing things in the world you’re referring to is something like “we give the AI feedback by letting it build fusion generators and then giving it a score based on how much power it generates” or something like that, and I agree this is easier than “are we actually happy with the outcome”
So I do think you can get feedback on the related question of “can you write a critique of this action that makes us think we wouldn’t be happy with the outcomes” as you can give a reward of 1 if you’re unhappy with the outcomes after seeing the critique, 0 otherwise.
And this alone isn’t sufficient, e.g. maybe then the AI system says things about good actions that make us think we wouldn’t be happy with the outcome, which is then where you’d need to get into recursive evaluation or debate or something. But this feels like “hard but potentially tractable problem” and not “100% doomed”. Or at least the failure story needs to involve more steps like “sure critiques will tell us that the fusion power generator will lead to everyone dying, but we ignore that because it can write a critique of any action that makes us believe it’s bad” or “the consequences are so complicated the system can’t explain them to us in the critique and get high reward for it”
ETA: So I’m assuming the story for feedback on reliably doing things in the world you’re referring to is something like “we give the AI feedback by letting it build fusion generators and then giving it a score based on how much power it generates” or something like that, and I agree this is easier than “are we actually happy with the outcome”