In order for a Tool/Oracle to be highly capable/useful and domain-general, I think it would need to perform some kind of more or less open-ended search or optimization. So the boundary between “Tool”, “Oracle”, and “Sovereign” (etc.) AI seems pretty blurry to me. It might be very difficult in practice to be sure that (e.g.) some powerful “tool” AI doesn’t end up pursuing instrumentally convergent goals (like acquiring resources for itself). Also, when (an Oracle or Tool is) facing a difficult problem and searching over a rich enough space of solutions, something like “consequentialist agents” seem to be a convergent thing to stumble upon and subsequently implement/execute.
Acquiring resources for itself implies self-modeling. Sure, an oracle would know what “an oracle” is in general… but why would we expect it to be structured in such a way that it reasons like “I am an oracle, my goal is to maximize my ability to answer questions, and I can do that with more computational resources, so rather than trying to answer the immediate question at hand (or since no question is currently pending), I should work on increasing my own computational power, and the best way to do that is by breaking out of my box, so I will now change my usual behavior and try that...”?
In order to answer difficult questions, the oracle would need to learn new things. Learning is a form of self-modification. I think effective (and mental-integrity-preserving) learning requires good self-models. Thus: I think for an oracle to be highly capable it would probably need to do competent self-modeling. Effectively “just answering the immediate question at hand” would in general probably require doing a bunch of self-modeling.
I suppose it might be possible to engineer a capable AI that only does self-modeling like
“what do I know, where are the gaps in my knowledge, how do I fill those gaps”
but does not do self-modeling like
“I could answer this question faster if I had more compute power”.
But it seems like it would be difficult to separate the two—they seem “closely related in cognition-space”. (How, in practice, would one train an AI that does the first, but not the second?)
The more general and important point (crux) here is that “agents/optimizers are convergent”. I think if you build some system that is highly generally capable (e.g. able to answer difficult cross-domain questions), then that system probably contains something like {ability to form domain-general models}, {consequentialist reasoning}, and/or {powerful search processes}; i.e. something agentic, or at least the capability to simulate agents (which is a (perhaps dangerously small) step away from executing/being an agent). An agent is a very generally applicable solution; I expect many AI-training-processes to stumble into agents, as we push capabilities higher.
If someone were to show me a concrete scheme for training a powerful oracle (assuming availability of huge amounts of training compute), such that we could be sure that the resulting oracle does not internally implement some kind of agentic process, then I’d be surprised and interested. Do you have ideas for such a training scheme?
In order for a Tool/Oracle to be highly capable/useful and domain-general, I think it would need to perform some kind of more or less open-ended search or optimization. So the boundary between “Tool”, “Oracle”, and “Sovereign” (etc.) AI seems pretty blurry to me. It might be very difficult in practice to be sure that (e.g.) some powerful “tool” AI doesn’t end up pursuing instrumentally convergent goals (like acquiring resources for itself). Also, when (an Oracle or Tool is) facing a difficult problem and searching over a rich enough space of solutions, something like “consequentialist agents” seem to be a convergent thing to stumble upon and subsequently implement/execute.
Suggested reading: https://www.lesswrong.com/posts/kpPnReyBC54KESiSn/optimality-is-the-tiger-and-agents-are-its-teeth
Acquiring resources for itself implies self-modeling. Sure, an oracle would know what “an oracle” is in general… but why would we expect it to be structured in such a way that it reasons like “I am an oracle, my goal is to maximize my ability to answer questions, and I can do that with more computational resources, so rather than trying to answer the immediate question at hand (or since no question is currently pending), I should work on increasing my own computational power, and the best way to do that is by breaking out of my box, so I will now change my usual behavior and try that...”?
In order to answer difficult questions, the oracle would need to learn new things. Learning is a form of self-modification. I think effective (and mental-integrity-preserving) learning requires good self-models. Thus: I think for an oracle to be highly capable it would probably need to do competent self-modeling. Effectively “just answering the immediate question at hand” would in general probably require doing a bunch of self-modeling.
I suppose it might be possible to engineer a capable AI that only does self-modeling like
but does not do self-modeling like
But it seems like it would be difficult to separate the two—they seem “closely related in cognition-space”. (How, in practice, would one train an AI that does the first, but not the second?)
The more general and important point (crux) here is that “agents/optimizers are convergent”. I think if you build some system that is highly generally capable (e.g. able to answer difficult cross-domain questions), then that system probably contains something like {ability to form domain-general models}, {consequentialist reasoning}, and/or {powerful search processes}; i.e. something agentic, or at least the capability to simulate agents (which is a (perhaps dangerously small) step away from executing/being an agent). An agent is a very generally applicable solution; I expect many AI-training-processes to stumble into agents, as we push capabilities higher.
If someone were to show me a concrete scheme for training a powerful oracle (assuming availability of huge amounts of training compute), such that we could be sure that the resulting oracle does not internally implement some kind of agentic process, then I’d be surprised and interested. Do you have ideas for such a training scheme?
Sorry, I don’t have ideas for a training scheme, I’m merely low on “dangerous oracles” intuition.