My intuition is that the combination of these guarantees is insufficient for good performance and safety.
Say you’re training an agent; then the AI’s policy is π:O→ΔA for some set O of observations and A of actions (i.e. it takes in an observation and returns an action distribution). In general, your utility function will be a nonlinear function of the policy (where we can consider the policy to be a vector of probabilities for each (observation, action) pair). For example, if it is really important for the AI to output the same thing given observation “a” and given observation “b”, then this is a nonlinearity. If the AI is doing something like programming, then your utility is going to be highly nonlinear in the policy, since getting even a single character wrong in the program can result in a crash.
Say your actual utility function on the AI’s policy is U. If you approximate this utility using average performance, you get this approximation:
Vp,f(π):=Eo∼p,a∼π(o)[f(o,a)]
where p is some distribution over observations and f is some bounded performance function. Note that Vp,f is linear.
Catastrophe avoidance can handle some nonlinearities. Including catastrophe avoidance, we get this approximation:
Vp,f,c(π):=Eo∼p,a∼π(o)[f(o,a)]−maxo∈O[c(o,π(o)))]
where c is some bounded catastrophe function.
I don’t see a good argument for why, for any U you might have over the policy, there are some easy-to find p,f,c such that approximately maximizing Vp,f,c yields a policy that is nearly as good as if you had approximately maximized U .
Some examples of cases I expect to not work with linear+catastrophe approximation:
Some decisions are much more important than others, and it’s predictable which ones. (This might be easy to handle with importance sampling but that is an extension of the framework, and you have to handle things like “which observations the AI gets depends on the AI’s policy”)
The importance of a decision depends on the observations and actions of previous rounds. (e.g. in programming, typing a bad character is important if no bad characters have been typed yet, and not important if the program already contains a syntax error)
The AI has to be predictable; it has to do the same thing given similar-enough observations (this is relevant if you want different AIs to coordinate with each other)
The AI consists of multiple copies that must meet at the same point; or the AI consists of multiple copies that must meet at different points.
You could argue that we should move to an episodic RL setting to handle these, however I think my arguments continue to apply if you replace “AI takes an action” with “AI performs a single episode”. Episodes have to be short enough that they can be judged efficiently on an individual basis, and the operator’s utility function will be nonlinear in the performance on each of these short episodes.
In general my criticism here is pointing at a general criticism of feedback-optimization systems. One interpretation of this criticism is that it implies that feedback-optimization systems are too dumb to do relevant long-term reasoning, even with substantial work in reward engineering.
Evolution provides some evidence that feedback-optimization systems can, with an extremely high amount of compute, eventually produce things that do long-term reasoning (though I’m not that confident in the analogy between evolution and feedback-optimization systems). But then these agents’ long-term reasoning is not explained by their optimization of feedback. So understanding the resulting agents as feedback-optimizers is understanding them at the wrong level of abstraction (see this post for more on what “understanding at the wrong level of abstraction” means), and providing feedback based on an overseer’s values would be insufficient to get something the overseer wants.
See this post for discussion of some of these things.
Other points beyond those made in that post:
The easy way to think about performance is using marginal impact.
There will be non-convexities—e.g. if you need to get 3 things right to get a prize, and you currently get 0 things right, then the marginal effect of getting an additional thing right is 0 and you can be stuck at a local optimum. My schemes tend to punt these issues to the overseer, e.g. the overseer can choose to penalize the first mistake based on their beliefs about the value function of the trained system rather than the current system.
To the extent that any decision-maker has to deal with similar difficulties, then your criticism only makes sense in the context of some alternative unaligned AI that might outcompete the current AI. One alternative is the not-feedback-optimizing cognition of a system produced by gradient descent on some arbitrary goal (let’s call it an alien). In this case, I suspect my proposal would be able to compete iff informed oversight worked well enough to reflect the knowledge that the aliens use for long-term planning.
Note that catastrophe avoidance isn’t intended to overcome the linear approximation. It’s intended to prevent the importance weights from blowing up too much. (Though as we’ve discussed, it can’t do that in full generality—I’m going to shovel some stuff under “an AI that is trying to do the right thing” and grant that we aren’t going to actually get the optimal policy according to the overseer’s values. Instead I’m focused on avoiding some class of failures that I think of as alignment failures.)
I’m not including issues like “you want your AI to be predictable,” I’d say that “be very predictable” is a separate problem, just like “be really good at chess” is a separate problem. I agree that our preferences are better satisfied by AIs that solve these additional problems. And I agree that if our alignment techniques are fundamentally incompatible with other techniques that help with these desiderata then that should be considered an open problem for alignment (though we may end up disagreeing about the importance / about whether this happens).
One interpretation of this criticism is that it implies that feedback-optimization systems are too dumb to do relevant long-term reasoning, even with substantial work in reward engineering.
If this is true, it seems like a really important point that I need to understand better. Any chance you can surface this argument into a top-level post, so more people can see it and chime in with their thoughts? In particular I’d like to understand whether the problem is caused by current ML approaches not offering good/useful enough performance guarantees, which might change in the future, or if this a fundamental problem with ML and feedback-optimization that can’t be overcome. Also, can you suggest ways to test this empirically?
(I also can’t quite tell to what extent Paul’s response has addressed your criticism. If you decide to write a post maybe you can explain that as well?)
My intuition is that the combination of these guarantees is insufficient for good performance and safety.
Say you’re training an agent; then the AI’s policy is π:O→ΔA for some set O of observations and A of actions (i.e. it takes in an observation and returns an action distribution). In general, your utility function will be a nonlinear function of the policy (where we can consider the policy to be a vector of probabilities for each (observation, action) pair). For example, if it is really important for the AI to output the same thing given observation “a” and given observation “b”, then this is a nonlinearity. If the AI is doing something like programming, then your utility is going to be highly nonlinear in the policy, since getting even a single character wrong in the program can result in a crash.
Say your actual utility function on the AI’s policy is U. If you approximate this utility using average performance, you get this approximation:
Vp,f(π):=Eo∼p,a∼π(o)[f(o,a)]
where p is some distribution over observations and f is some bounded performance function. Note that Vp,f is linear.
Catastrophe avoidance can handle some nonlinearities. Including catastrophe avoidance, we get this approximation:
Vp,f,c(π):=Eo∼p,a∼π(o)[f(o,a)]−maxo∈O[c(o,π(o)))]
where c is some bounded catastrophe function.
I don’t see a good argument for why, for any U you might have over the policy, there are some easy-to find p,f,c such that approximately maximizing Vp,f,c yields a policy that is nearly as good as if you had approximately maximized U .
Some examples of cases I expect to not work with linear+catastrophe approximation:
Some decisions are much more important than others, and it’s predictable which ones. (This might be easy to handle with importance sampling but that is an extension of the framework, and you have to handle things like “which observations the AI gets depends on the AI’s policy”)
The importance of a decision depends on the observations and actions of previous rounds. (e.g. in programming, typing a bad character is important if no bad characters have been typed yet, and not important if the program already contains a syntax error)
The AI has to be predictable; it has to do the same thing given similar-enough observations (this is relevant if you want different AIs to coordinate with each other)
The AI consists of multiple copies that must meet at the same point; or the AI consists of multiple copies that must meet at different points.
You could argue that we should move to an episodic RL setting to handle these, however I think my arguments continue to apply if you replace “AI takes an action” with “AI performs a single episode”. Episodes have to be short enough that they can be judged efficiently on an individual basis, and the operator’s utility function will be nonlinear in the performance on each of these short episodes.
In general my criticism here is pointing at a general criticism of feedback-optimization systems. One interpretation of this criticism is that it implies that feedback-optimization systems are too dumb to do relevant long-term reasoning, even with substantial work in reward engineering.
Evolution provides some evidence that feedback-optimization systems can, with an extremely high amount of compute, eventually produce things that do long-term reasoning (though I’m not that confident in the analogy between evolution and feedback-optimization systems). But then these agents’ long-term reasoning is not explained by their optimization of feedback. So understanding the resulting agents as feedback-optimizers is understanding them at the wrong level of abstraction (see this post for more on what “understanding at the wrong level of abstraction” means), and providing feedback based on an overseer’s values would be insufficient to get something the overseer wants.
See this post for discussion of some of these things.
Other points beyond those made in that post:
The easy way to think about performance is using marginal impact.
There will be non-convexities—e.g. if you need to get 3 things right to get a prize, and you currently get 0 things right, then the marginal effect of getting an additional thing right is 0 and you can be stuck at a local optimum. My schemes tend to punt these issues to the overseer, e.g. the overseer can choose to penalize the first mistake based on their beliefs about the value function of the trained system rather than the current system.
To the extent that any decision-maker has to deal with similar difficulties, then your criticism only makes sense in the context of some alternative unaligned AI that might outcompete the current AI. One alternative is the not-feedback-optimizing cognition of a system produced by gradient descent on some arbitrary goal (let’s call it an alien). In this case, I suspect my proposal would be able to compete iff informed oversight worked well enough to reflect the knowledge that the aliens use for long-term planning.
Note that catastrophe avoidance isn’t intended to overcome the linear approximation. It’s intended to prevent the importance weights from blowing up too much. (Though as we’ve discussed, it can’t do that in full generality—I’m going to shovel some stuff under “an AI that is trying to do the right thing” and grant that we aren’t going to actually get the optimal policy according to the overseer’s values. Instead I’m focused on avoiding some class of failures that I think of as alignment failures.)
I’m not including issues like “you want your AI to be predictable,” I’d say that “be very predictable” is a separate problem, just like “be really good at chess” is a separate problem. I agree that our preferences are better satisfied by AIs that solve these additional problems. And I agree that if our alignment techniques are fundamentally incompatible with other techniques that help with these desiderata then that should be considered an open problem for alignment (though we may end up disagreeing about the importance / about whether this happens).
If this is true, it seems like a really important point that I need to understand better. Any chance you can surface this argument into a top-level post, so more people can see it and chime in with their thoughts? In particular I’d like to understand whether the problem is caused by current ML approaches not offering good/useful enough performance guarantees, which might change in the future, or if this a fundamental problem with ML and feedback-optimization that can’t be overcome. Also, can you suggest ways to test this empirically?
(I also can’t quite tell to what extent Paul’s response has addressed your criticism. If you decide to write a post maybe you can explain that as well?)