This is an obviously important problem! When we put a human in the loop, we have to be confident that the human is actually aligned—or at least that they realize when their judgement is not reliable to the current situation and defer to some other fallback process or ask for additional assistance. We are definitely thinking about this problem at DeepMind, but it’s out of the scope of this paper and the technical research direction that we are proposing to pursue here. Instead, we zoom into one particular aspect, how to solve the agent alignment problem in the context of aligning a single agent to a single user, because we think it is the hardest technical aspect of the alignment problem.
I’m glad to hear that the DeepMind safety team is thinking about this problem and look forward to reading more about your thoughts on it. However I don’t think putting human safety problems outside of “aligning a single agent to a single user” is a natural way to divide up the problem space, because there are likely ways to address human safety problems within that context. (See these twoposts which I wrote after posting my comment here.)
I would go further and say that without an understanding of the ways in which humans are unsafe, and how that should be addressed, it’s hard to even define the problem of “aligning a single agent to a single user” in a way that makes sense. To illustrate this, consider an analogy where the user is an AI that has learned a partial utility function, which gives reasonable outputs to a narrow region of inputs and a mix of “I don’t know” and random extrapolations outside of that. If another agent tries to help this user by optimizing over this partial utility function and ends up outside the region where it gives sensible answers, is that agent aligned to the user? A naive definition of alignment that doesn’t take the user’s own lack of safety into account would answer yes, but I think that would be intuitively unacceptable to many people.
To steelman your position a bit, I think what might make sense is to say something like: “Today we don’t even know how to align an agent to a single user which is itself assumed to be safe. Solving this easier problem might build up our intellectual tools and frameworks which will help us solve the full alignment problem, or otherwise be good practice for solving the full problem.” If this is a reasonable restatement of your position, I think it’s important to be clear about what you’re trying to do (and what problems remain even if you succeed), so as to not give the impression that AI alignment is easier than it actually is.
This is an obviously important problem! When we put a human in the loop, we have to be confident that the human is actually aligned—or at least that they realize when their judgement is not reliable to the current situation and defer to some other fallback process or ask for additional assistance. We are definitely thinking about this problem at DeepMind, but it’s out of the scope of this paper and the technical research direction that we are proposing to pursue here. Instead, we zoom into one particular aspect, how to solve the agent alignment problem in the context of aligning a single agent to a single user, because we think it is the hardest technical aspect of the alignment problem.
I’m glad to hear that the DeepMind safety team is thinking about this problem and look forward to reading more about your thoughts on it. However I don’t think putting human safety problems outside of “aligning a single agent to a single user” is a natural way to divide up the problem space, because there are likely ways to address human safety problems within that context. (See these two posts which I wrote after posting my comment here.)
I would go further and say that without an understanding of the ways in which humans are unsafe, and how that should be addressed, it’s hard to even define the problem of “aligning a single agent to a single user” in a way that makes sense. To illustrate this, consider an analogy where the user is an AI that has learned a partial utility function, which gives reasonable outputs to a narrow region of inputs and a mix of “I don’t know” and random extrapolations outside of that. If another agent tries to help this user by optimizing over this partial utility function and ends up outside the region where it gives sensible answers, is that agent aligned to the user? A naive definition of alignment that doesn’t take the user’s own lack of safety into account would answer yes, but I think that would be intuitively unacceptable to many people.
To steelman your position a bit, I think what might make sense is to say something like: “Today we don’t even know how to align an agent to a single user which is itself assumed to be safe. Solving this easier problem might build up our intellectual tools and frameworks which will help us solve the full alignment problem, or otherwise be good practice for solving the full problem.” If this is a reasonable restatement of your position, I think it’s important to be clear about what you’re trying to do (and what problems remain even if you succeed), so as to not give the impression that AI alignment is easier than it actually is.