Apologies if I’m being naive, but it doesn’t seem like an oracle AI[1] is logically or practically impossible, and a good oracle should be able to be able to perform well at long-horizon tasks[2] without “wanting things” in the behaviorist sense, or bending the world in consequentialist ways.
The most obvious exception is if the oracle’s own answers are causing people to bend the world in the service of hidden behaviorist goals that the oracle has (e.g. making the world more predictable to reduce future loss), but I don’t have strong reasons to believe that this is very likely.
This is especially the case since at training time, the oracle doesn’t have any ability to bend the training dataset to fit its future goals, so I don’t see why gradient descent would find cognitive algorithms for “wanting things in the behaviorist sense.”
[1] in the sense of being superhuman at prediction for most tasks, not in the sense of being a perfect or near-perfect predictor.
[2] e.g. “Here’s the design for a fusion power plant, here’s how you acquire the relevant raw materials, here’s how you do project management, etc.” or “I predict your polio eradication strategy to have the following effects at probability p, and the following unintended side effects that you should be aware of at probability q.”
I’d be pretty scared of an oracle AI that could do novel science, and it might still want things internally. If the oracle can truly do well at designing a fusion power plant, it can anticipate obstacles and make revisions to plans just as well as an agent—if not better because it’s not allowed to observe and adapt. I’d be worried that it does similar cognition to the agent, but with all interactions with the environment done in some kind of efficient simulation. Or something more loosely equivalent.
It’s not clear to me that this is as dangerous as having some generalized skill of routing around obstacles as an agent, but I feel like “wants in the behaviorist sense” is not quite the right property to be thinking about because it depends on the exact interface between your AI and the world rather than the underlying cognition.
An oracle doesn’t have to have hidden goals. But when you ask it what actions would be needed to do the long term task, it chooses the actions that lead to that would lead to that task being completed. If you phrase that carefully enough maybe you can get away with it. But maybe it calculates the best output to achieve result X is an output that tricks you into rewriting itself into an agent. etc.
In general, asking an oracle AI any question whose answers depend on the future effects in the real world of those answers would be very dangerous.
On the other hand, I don’t think answering important questions on solving AI alignment is a task whose output necessarily needs to depend on its future effects on the real world. So, in my view an oracle could be used to solve AI alignment, without killing everyone as long as there are appropriate precautions against asking it careless questions.
Apologies if I’m being naive, but it doesn’t seem like an oracle AI[1] is logically or practically impossible, and a good oracle should be able to be able to perform well at long-horizon tasks[2] without “wanting things” in the behaviorist sense, or bending the world in consequentialist ways.
The most obvious exception is if the oracle’s own answers are causing people to bend the world in the service of hidden behaviorist goals that the oracle has (e.g. making the world more predictable to reduce future loss), but I don’t have strong reasons to believe that this is very likely.
This is especially the case since at training time, the oracle doesn’t have any ability to bend the training dataset to fit its future goals, so I don’t see why gradient descent would find cognitive algorithms for “wanting things in the behaviorist sense.”
[1] in the sense of being superhuman at prediction for most tasks, not in the sense of being a perfect or near-perfect predictor.
[2] e.g. “Here’s the design for a fusion power plant, here’s how you acquire the relevant raw materials, here’s how you do project management, etc.” or “I predict your polio eradication strategy to have the following effects at probability p, and the following unintended side effects that you should be aware of at probability q.”
I’d be pretty scared of an oracle AI that could do novel science, and it might still want things internally. If the oracle can truly do well at designing a fusion power plant, it can anticipate obstacles and make revisions to plans just as well as an agent—if not better because it’s not allowed to observe and adapt. I’d be worried that it does similar cognition to the agent, but with all interactions with the environment done in some kind of efficient simulation. Or something more loosely equivalent.
It’s not clear to me that this is as dangerous as having some generalized skill of routing around obstacles as an agent, but I feel like “wants in the behaviorist sense” is not quite the right property to be thinking about because it depends on the exact interface between your AI and the world rather than the underlying cognition.
An oracle doesn’t have to have hidden goals. But when you ask it what actions would be needed to do the long term task, it chooses the actions that lead to that would lead to that task being completed. If you phrase that carefully enough maybe you can get away with it. But maybe it calculates the best output to achieve result X is an output that tricks you into rewriting itself into an agent. etc.
In general, asking an oracle AI any question whose answers depend on the future effects in the real world of those answers would be very dangerous.
On the other hand, I don’t think answering important questions on solving AI alignment is a task whose output necessarily needs to depend on its future effects on the real world. So, in my view an oracle could be used to solve AI alignment, without killing everyone as long as there are appropriate precautions against asking it careless questions.