I view this as a capability control technique, highly analogous to running a supervised learning algorithm where a reinforcement learning algorithm is expected to perform better. Intuitively, it seems like there should be a spectrum of options between (e.g.) supervised learning and reinforcement learning that would allow one to make more fine-grained safety-performance trade-offs.
I’m very optimistic about this approach of doing “capability control” by making less agent-y AI systems. If done properly, I think it could allow us to build systems that have no instrumental incentives to create subagents (although we’d still need to worry about “accidental” creation of subagents and (e.g. evolutionary) optimization pressures for their creation).
I would like to see this fleshed out as much as possible. This idea is somewhat intuitive, but it’s hard to tell if it is coherent, or how to formalize it.
P.S. Is this the same as “platonic goals”? Could you include references to previous thought on the topic?
I haven’t heard the term “platonic goals” before. There’s been plenty written on capability control before, but I don’t know of anything written before on the strategy I described in this post (although it’s entirely possible that there’s been previous writing on the topic that I’m not aware of).
I view this as a capability control technique, highly analogous to running a supervised learning algorithm where a reinforcement learning algorithm is expected to perform better. Intuitively, it seems like there should be a spectrum of options between (e.g.) supervised learning and reinforcement learning that would allow one to make more fine-grained safety-performance trade-offs.
I’m very optimistic about this approach of doing “capability control” by making less agent-y AI systems. If done properly, I think it could allow us to build systems that have no instrumental incentives to create subagents (although we’d still need to worry about “accidental” creation of subagents and (e.g. evolutionary) optimization pressures for their creation).
I would like to see this fleshed out as much as possible. This idea is somewhat intuitive, but it’s hard to tell if it is coherent, or how to formalize it.
P.S. Is this the same as “platonic goals”? Could you include references to previous thought on the topic?
I haven’t heard the term “platonic goals” before. There’s been plenty written on capability control before, but I don’t know of anything written before on the strategy I described in this post (although it’s entirely possible that there’s been previous writing on the topic that I’m not aware of).