See this post for discussion of some of these things.
Other points beyond those made in that post:
The easy way to think about performance is using marginal impact.
There will be non-convexities—e.g. if you need to get 3 things right to get a prize, and you currently get 0 things right, then the marginal effect of getting an additional thing right is 0 and you can be stuck at a local optimum. My schemes tend to punt these issues to the overseer, e.g. the overseer can choose to penalize the first mistake based on their beliefs about the value function of the trained system rather than the current system.
To the extent that any decision-maker has to deal with similar difficulties, then your criticism only makes sense in the context of some alternative unaligned AI that might outcompete the current AI. One alternative is the not-feedback-optimizing cognition of a system produced by gradient descent on some arbitrary goal (let’s call it an alien). In this case, I suspect my proposal would be able to compete iff informed oversight worked well enough to reflect the knowledge that the aliens use for long-term planning.
Note that catastrophe avoidance isn’t intended to overcome the linear approximation. It’s intended to prevent the importance weights from blowing up too much. (Though as we’ve discussed, it can’t do that in full generality—I’m going to shovel some stuff under “an AI that is trying to do the right thing” and grant that we aren’t going to actually get the optimal policy according to the overseer’s values. Instead I’m focused on avoiding some class of failures that I think of as alignment failures.)
I’m not including issues like “you want your AI to be predictable,” I’d say that “be very predictable” is a separate problem, just like “be really good at chess” is a separate problem. I agree that our preferences are better satisfied by AIs that solve these additional problems. And I agree that if our alignment techniques are fundamentally incompatible with other techniques that help with these desiderata then that should be considered an open problem for alignment (though we may end up disagreeing about the importance / about whether this happens).
See this post for discussion of some of these things.
Other points beyond those made in that post:
The easy way to think about performance is using marginal impact.
There will be non-convexities—e.g. if you need to get 3 things right to get a prize, and you currently get 0 things right, then the marginal effect of getting an additional thing right is 0 and you can be stuck at a local optimum. My schemes tend to punt these issues to the overseer, e.g. the overseer can choose to penalize the first mistake based on their beliefs about the value function of the trained system rather than the current system.
To the extent that any decision-maker has to deal with similar difficulties, then your criticism only makes sense in the context of some alternative unaligned AI that might outcompete the current AI. One alternative is the not-feedback-optimizing cognition of a system produced by gradient descent on some arbitrary goal (let’s call it an alien). In this case, I suspect my proposal would be able to compete iff informed oversight worked well enough to reflect the knowledge that the aliens use for long-term planning.
Note that catastrophe avoidance isn’t intended to overcome the linear approximation. It’s intended to prevent the importance weights from blowing up too much. (Though as we’ve discussed, it can’t do that in full generality—I’m going to shovel some stuff under “an AI that is trying to do the right thing” and grant that we aren’t going to actually get the optimal policy according to the overseer’s values. Instead I’m focused on avoiding some class of failures that I think of as alignment failures.)
I’m not including issues like “you want your AI to be predictable,” I’d say that “be very predictable” is a separate problem, just like “be really good at chess” is a separate problem. I agree that our preferences are better satisfied by AIs that solve these additional problems. And I agree that if our alignment techniques are fundamentally incompatible with other techniques that help with these desiderata then that should be considered an open problem for alignment (though we may end up disagreeing about the importance / about whether this happens).