This works as a subtle argument for security mindset in AI control (while not being framed as such). One issue is that it might deemphasize some AI control problems that are not analogous to practical security problems, like detailed value elicitation (where in security you formulate a few general principles and then give up). That is, the concept of {AI control problems that are analogous to security problems} might be close enough to the concept of {all AI control problems} to replace it in some people’s minds.
It seems to me like failures of value learning can also be a security problem: if some gap between the AI’s values and the human values is going to cause trouble, the trouble is most likely to show up in some adversarially-crafted setting.
I do agree that this is not closely analogous to security problems that cause trouble today.
I also agree that sorting out how to do value elicitation in the long-run is not really a short-term security problem, but I am also somewhat skeptical that it is a critical control problem. I think that the main important thing is that our AI systems learn to behave effectively in the world while allowing us to maintain effective control over their future behavior, and a failure of this property (e.g. because the AI has a bad conception of “effective control”) is likely to be a security problem.
I think that the main important thing is that our AI systems learn to behave effectively in the world while allowing us to maintain effective control over their future behavior
This does seem sufficient to solve the immediate problem of AI risk, without compromising the potential for optimizing the world with our detailed values, provided
The line between “us” that maintain control and the AI design is sufficiently blurred (via learning, uploading, prediction etc., to remove the overhead of dealing with physical humans)
“Behave effectively” includes capability to disable potential misaligned AIs in the wild
“Effective control” allows replacing whatever the AI is doing with something else at any level of detail.
The advantage of introducing the concept of detailed values of the AI in the initial design is that it protects the setup from manipulation by the AI. If we don’t do that, the control problem becomes much more complicated. In the approach you are talking about, initially there are no explicitly formulated detailed values, only instrumental skills and humans.
So it’s a tradeoff: solving the value elicitation/use problem makes AIs easier to control, but if it’s possible to control an AI anyway, the problem could initially remain unsolved. I’m skeptical that it’s possible to control an AI other than by giving it completely defined values (so that it learns further details by further examining the fixed definition), if that AI is capable enough to prevent AI risk from other AIs.
This works as a subtle argument for security mindset in AI control (while not being framed as such). One issue is that it might deemphasize some AI control problems that are not analogous to practical security problems, like detailed value elicitation (where in security you formulate a few general principles and then give up). That is, the concept of {AI control problems that are analogous to security problems} might be close enough to the concept of {all AI control problems} to replace it in some people’s minds.
It seems to me like failures of value learning can also be a security problem: if some gap between the AI’s values and the human values is going to cause trouble, the trouble is most likely to show up in some adversarially-crafted setting.
I do agree that this is not closely analogous to security problems that cause trouble today.
I also agree that sorting out how to do value elicitation in the long-run is not really a short-term security problem, but I am also somewhat skeptical that it is a critical control problem. I think that the main important thing is that our AI systems learn to behave effectively in the world while allowing us to maintain effective control over their future behavior, and a failure of this property (e.g. because the AI has a bad conception of “effective control”) is likely to be a security problem.
This does seem sufficient to solve the immediate problem of AI risk, without compromising the potential for optimizing the world with our detailed values, provided
The line between “us” that maintain control and the AI design is sufficiently blurred (via learning, uploading, prediction etc., to remove the overhead of dealing with physical humans)
“Behave effectively” includes capability to disable potential misaligned AIs in the wild
“Effective control” allows replacing whatever the AI is doing with something else at any level of detail.
The advantage of introducing the concept of detailed values of the AI in the initial design is that it protects the setup from manipulation by the AI. If we don’t do that, the control problem becomes much more complicated. In the approach you are talking about, initially there are no explicitly formulated detailed values, only instrumental skills and humans.
So it’s a tradeoff: solving the value elicitation/use problem makes AIs easier to control, but if it’s possible to control an AI anyway, the problem could initially remain unsolved. I’m skeptical that it’s possible to control an AI other than by giving it completely defined values (so that it learns further details by further examining the fixed definition), if that AI is capable enough to prevent AI risk from other AIs.