One example, to add a little concreteness: suppose that the path to AGI is to scale up o1-style inference-time compute, but it requires multiple OOMs of scaling. So it no longer has a relatively-short stream of “internal” thought, it’s more like the natural-language record of an entire simulated society.
Then:
There is no hope of a human reviewing the whole thing, or any significant fraction of the whole thing. Even spot checks don’t help much, because it’s all so context-dependent.
Accurate summarization would itself be a big difficult research problem.
There’s likely some part of the simulated society explicitly thinking about intentional deception, even if the system as a whole is well aligned.
… but that’s largely irrelevant, because in the context of a big complex system like a whole society, the effects of words are very decoupled from their content. Think of e.g. a charity which produces lots of internal discussion about reducing poverty, but frequently has effects entirely different from reducing poverty. The simulated society as a whole might be superintelligent, but its constituent simulated subagents are still pretty stupid (like humans), so their words decouple from effects (like humans’ words).
… and that’s how the proposal breaks down, for this example.
I think some background context here is that I am not proposing shoggoth+face+paraphraser as a complete alignment solution, as something which we should just build a sovereign with and then let rip. I agree that would go horribly wrong in a bunch of ways.
Instead, I’m saying that this is something we should do, now, that will significantly advance alignment science & governance etc. Also, I do think we’ll be able to get some useful work out of S+F+P systems that we otherwise wouldn’t be able to get.
To get on the object level and engage with your example:
--It’s like the difference between a company whose internal slack messages, comms, etc. are all kept secret from the public, and a company whose messages are all visible to the public. Or replace ‘public’ with ‘regulator.’ It’s not a panacea but it helps a ton. --My impression from being at OpenAI is that you can tell a lot about an organization’s priorities by looking at their internal messaging and such dumb metrics as ‘what do they spend their time thinking about.’ For example, ‘have they bothered to make a writeup of the costs and benefits, from an altruistic perspective, of major decision X at all?’ and ‘Now there is a writeup—was it made before, or after, the decision to do X?’— I think a similar thing would be true for this hypothetical giant bureaucracy/civilization of S+F+P CoT.
Again, not a final solution. But it buys us time and teaches us things.
One example, to add a little concreteness: suppose that the path to AGI is to scale up o1-style inference-time compute, but it requires multiple OOMs of scaling. So it no longer has a relatively-short stream of “internal” thought, it’s more like the natural-language record of an entire simulated society.
Then:
There is no hope of a human reviewing the whole thing, or any significant fraction of the whole thing. Even spot checks don’t help much, because it’s all so context-dependent.
Accurate summarization would itself be a big difficult research problem.
There’s likely some part of the simulated society explicitly thinking about intentional deception, even if the system as a whole is well aligned.
… but that’s largely irrelevant, because in the context of a big complex system like a whole society, the effects of words are very decoupled from their content. Think of e.g. a charity which produces lots of internal discussion about reducing poverty, but frequently has effects entirely different from reducing poverty. The simulated society as a whole might be superintelligent, but its constituent simulated subagents are still pretty stupid (like humans), so their words decouple from effects (like humans’ words).
… and that’s how the proposal breaks down, for this example.
I think some background context here is that I am not proposing shoggoth+face+paraphraser as a complete alignment solution, as something which we should just build a sovereign with and then let rip. I agree that would go horribly wrong in a bunch of ways.
Instead, I’m saying that this is something we should do, now, that will significantly advance alignment science & governance etc. Also, I do think we’ll be able to get some useful work out of S+F+P systems that we otherwise wouldn’t be able to get.
To get on the object level and engage with your example:
--It’s like the difference between a company whose internal slack messages, comms, etc. are all kept secret from the public, and a company whose messages are all visible to the public. Or replace ‘public’ with ‘regulator.’ It’s not a panacea but it helps a ton.
--My impression from being at OpenAI is that you can tell a lot about an organization’s priorities by looking at their internal messaging and such dumb metrics as ‘what do they spend their time thinking about.’ For example, ‘have they bothered to make a writeup of the costs and benefits, from an altruistic perspective, of major decision X at all?’ and ‘Now there is a writeup—was it made before, or after, the decision to do X?’—
I think a similar thing would be true for this hypothetical giant bureaucracy/civilization of S+F+P CoT.
Again, not a final solution. But it buys us time and teaches us things.