I think that the main important thing is that our AI systems learn to behave effectively in the world while allowing us to maintain effective control over their future behavior
This does seem sufficient to solve the immediate problem of AI risk, without compromising the potential for optimizing the world with our detailed values, provided
The line between “us” that maintain control and the AI design is sufficiently blurred (via learning, uploading, prediction etc., to remove the overhead of dealing with physical humans)
“Behave effectively” includes capability to disable potential misaligned AIs in the wild
“Effective control” allows replacing whatever the AI is doing with something else at any level of detail.
The advantage of introducing the concept of detailed values of the AI in the initial design is that it protects the setup from manipulation by the AI. If we don’t do that, the control problem becomes much more complicated. In the approach you are talking about, initially there are no explicitly formulated detailed values, only instrumental skills and humans.
So it’s a tradeoff: solving the value elicitation/use problem makes AIs easier to control, but if it’s possible to control an AI anyway, the problem could initially remain unsolved. I’m skeptical that it’s possible to control an AI other than by giving it completely defined values (so that it learns further details by further examining the fixed definition), if that AI is capable enough to prevent AI risk from other AIs.
This does seem sufficient to solve the immediate problem of AI risk, without compromising the potential for optimizing the world with our detailed values, provided
The line between “us” that maintain control and the AI design is sufficiently blurred (via learning, uploading, prediction etc., to remove the overhead of dealing with physical humans)
“Behave effectively” includes capability to disable potential misaligned AIs in the wild
“Effective control” allows replacing whatever the AI is doing with something else at any level of detail.
The advantage of introducing the concept of detailed values of the AI in the initial design is that it protects the setup from manipulation by the AI. If we don’t do that, the control problem becomes much more complicated. In the approach you are talking about, initially there are no explicitly formulated detailed values, only instrumental skills and humans.
So it’s a tradeoff: solving the value elicitation/use problem makes AIs easier to control, but if it’s possible to control an AI anyway, the problem could initially remain unsolved. I’m skeptical that it’s possible to control an AI other than by giving it completely defined values (so that it learns further details by further examining the fixed definition), if that AI is capable enough to prevent AI risk from other AIs.