Sorry for the off-topicness. I will not consider it rude if you stop reading here and reply with “just shut up”—but I do think that it is important:
A) I do agree that the first problem to address should probably be misalignment of the rewards to our values, and that some of the proposed problems are not likely in practice—including some versions of the planning-inside-worldmodel example.
B) I do not think that planning inside the critic or evaluating inside the actor are an example of that, because the functions that those two models are optimized to approximate reference each other explicitly in their definitions. It doesn’t mean that the critic is likely to one day kill us, just that we should take it into account when we try to I do understand what is going on.
C) Specifically, it implies 2 additional non-exotic alignment failures:
The critic itself did not converge to be a good approximation of the value function.
The actor did not converge to be a thing that maximize the output of the critic, and it maximize something else instead.
Sorry for the off-topicness. I will not consider it rude if you stop reading here and reply with “just shut up”—but I do think that it is important:
A) I do agree that the first problem to address should probably be misalignment of the rewards to our values, and that some of the proposed problems are not likely in practice—including some versions of the planning-inside-worldmodel example.
B) I do not think that planning inside the critic or evaluating inside the actor are an example of that, because the functions that those two models are optimized to approximate reference each other explicitly in their definitions. It doesn’t mean that the critic is likely to one day kill us, just that we should take it into account when we try to I do understand what is going on.
C) Specifically, it implies 2 additional non-exotic alignment failures:
The critic itself did not converge to be a good approximation of the value function.
The actor did not converge to be a thing that maximize the output of the critic, and it maximize something else instead.