I’m curious what is DeepMind safety team’s (or any team member’s personal) view on the problem that it’s not safe to assume that the human user is a generally safe agent. For example, we only know it to be safe for a narrow range of inputs/environments, and it seems very likely to become unsafe if quickly shifted far outside of its “training distribution”, as may happen if the AI becomes very intelligent and starts to heavily modify the user’s environment according to its rewards. In principle this could perhaps be considered a form of “reward hacking”, but section 4.3 on reward hacking makes no specific mention of this problem, and I don’t see it mentioned anywhere else. (In contrast, Paul’s agenda at least tries to address a subset of this problem.)
Is this problem discussed somewhere else in the blog post or paper, that I missed? Would you consider solving this problem to be part of this research agenda, or part of DeepMind’s safety responsibility in general?
This is an obviously important problem! When we put a human in the loop, we have to be confident that the human is actually aligned—or at least that they realize when their judgement is not reliable to the current situation and defer to some other fallback process or ask for additional assistance. We are definitely thinking about this problem at DeepMind, but it’s out of the scope of this paper and the technical research direction that we are proposing to pursue here. Instead, we zoom into one particular aspect, how to solve the agent alignment problem in the context of aligning a single agent to a single user, because we think it is the hardest technical aspect of the alignment problem.
I’m glad to hear that the DeepMind safety team is thinking about this problem and look forward to reading more about your thoughts on it. However I don’t think putting human safety problems outside of “aligning a single agent to a single user” is a natural way to divide up the problem space, because there are likely ways to address human safety problems within that context. (See these twoposts which I wrote after posting my comment here.)
I would go further and say that without an understanding of the ways in which humans are unsafe, and how that should be addressed, it’s hard to even define the problem of “aligning a single agent to a single user” in a way that makes sense. To illustrate this, consider an analogy where the user is an AI that has learned a partial utility function, which gives reasonable outputs to a narrow region of inputs and a mix of “I don’t know” and random extrapolations outside of that. If another agent tries to help this user by optimizing over this partial utility function and ends up outside the region where it gives sensible answers, is that agent aligned to the user? A naive definition of alignment that doesn’t take the user’s own lack of safety into account would answer yes, but I think that would be intuitively unacceptable to many people.
To steelman your position a bit, I think what might make sense is to say something like: “Today we don’t even know how to align an agent to a single user which is itself assumed to be safe. Solving this easier problem might build up our intellectual tools and frameworks which will help us solve the full alignment problem, or otherwise be good practice for solving the full problem.” If this is a reasonable restatement of your position, I think it’s important to be clear about what you’re trying to do (and what problems remain even if you succeed), so as to not give the impression that AI alignment is easier than it actually is.
I’m curious what is DeepMind safety team’s (or any team member’s personal) view on the problem that it’s not safe to assume that the human user is a generally safe agent. For example, we only know it to be safe for a narrow range of inputs/environments, and it seems very likely to become unsafe if quickly shifted far outside of its “training distribution”, as may happen if the AI becomes very intelligent and starts to heavily modify the user’s environment according to its rewards. In principle this could perhaps be considered a form of “reward hacking”, but section 4.3 on reward hacking makes no specific mention of this problem, and I don’t see it mentioned anywhere else. (In contrast, Paul’s agenda at least tries to address a subset of this problem.)
Is this problem discussed somewhere else in the blog post or paper, that I missed? Would you consider solving this problem to be part of this research agenda, or part of DeepMind’s safety responsibility in general?
This is an obviously important problem! When we put a human in the loop, we have to be confident that the human is actually aligned—or at least that they realize when their judgement is not reliable to the current situation and defer to some other fallback process or ask for additional assistance. We are definitely thinking about this problem at DeepMind, but it’s out of the scope of this paper and the technical research direction that we are proposing to pursue here. Instead, we zoom into one particular aspect, how to solve the agent alignment problem in the context of aligning a single agent to a single user, because we think it is the hardest technical aspect of the alignment problem.
I’m glad to hear that the DeepMind safety team is thinking about this problem and look forward to reading more about your thoughts on it. However I don’t think putting human safety problems outside of “aligning a single agent to a single user” is a natural way to divide up the problem space, because there are likely ways to address human safety problems within that context. (See these two posts which I wrote after posting my comment here.)
I would go further and say that without an understanding of the ways in which humans are unsafe, and how that should be addressed, it’s hard to even define the problem of “aligning a single agent to a single user” in a way that makes sense. To illustrate this, consider an analogy where the user is an AI that has learned a partial utility function, which gives reasonable outputs to a narrow region of inputs and a mix of “I don’t know” and random extrapolations outside of that. If another agent tries to help this user by optimizing over this partial utility function and ends up outside the region where it gives sensible answers, is that agent aligned to the user? A naive definition of alignment that doesn’t take the user’s own lack of safety into account would answer yes, but I think that would be intuitively unacceptable to many people.
To steelman your position a bit, I think what might make sense is to say something like: “Today we don’t even know how to align an agent to a single user which is itself assumed to be safe. Solving this easier problem might build up our intellectual tools and frameworks which will help us solve the full alignment problem, or otherwise be good practice for solving the full problem.” If this is a reasonable restatement of your position, I think it’s important to be clear about what you’re trying to do (and what problems remain even if you succeed), so as to not give the impression that AI alignment is easier than it actually is.