One example, to add a little concreteness: suppose that the path to AGI is to scale up o1-style inference-time compute, but it requires multiple OOMs of scaling. So it no longer has a relatively-short stream of “internal” thought, it’s more like the natural-language record of an entire simulated society.
Then:
There is no hope of a human reviewing the whole thing, or any significant fraction of the whole thing. Even spot checks don’t help much, because it’s all so context-dependent.
Accurate summarization would itself be a big difficult research problem.
There’s likely some part of the simulated society explicitly thinking about intentional deception, even if the system as a whole is well aligned.
… but that’s largely irrelevant, because in the context of a big complex system like a whole society, the effects of words are very decoupled from their content. Think of e.g. a charity which produces lots of internal discussion about reducing poverty, but frequently has effects entirely different from reducing poverty. The simulated society as a whole might be superintelligent, but its constituent simulated subagents are still pretty stupid (like humans), so their words decouple from effects (like humans’ words).
… and that’s how the proposal breaks down, for this example.
I think some background context here is that I am not proposing shoggoth+face+paraphraser as a complete alignment solution, as something which we should just build a sovereign with and then let rip. I agree that would go horribly wrong in a bunch of ways.
Instead, I’m saying that this is something we should do, now, that will significantly advance alignment science & governance etc. Also, I do think we’ll be able to get some useful work out of S+F+P systems that we otherwise wouldn’t be able to get.
To get on the object level and engage with your example:
--It’s like the difference between a company whose internal slack messages, comms, etc. are all kept secret from the public, and a company whose messages are all visible to the public. Or replace ‘public’ with ‘regulator.’ It’s not a panacea but it helps a ton. --My impression from being at OpenAI is that you can tell a lot about an organization’s priorities by looking at their internal messaging and such dumb metrics as ‘what do they spend their time thinking about.’ For example, ‘have they bothered to make a writeup of the costs and benefits, from an altruistic perspective, of major decision X at all?’ and ‘Now there is a writeup—was it made before, or after, the decision to do X?’— I think a similar thing would be true for this hypothetical giant bureaucracy/civilization of S+F+P CoT.
Again, not a final solution. But it buys us time and teaches us things.
Think of e.g. a charity which produces lots of internal discussion about reducing poverty, but frequently has effects entirely different from reducing poverty. The simulated society as a whole might be superintelligent, but its constituent simulated subagents are still pretty stupid (like humans), so their words decouple from effects (like humans’ words).
I think you’re implying (deliberately or not) something overly pessimistic (in this narrow point).
Your example is of an intention for something complex and a-priori-implausible to happen (intervention to reduce poverty), but the intention doesn’t actualize. But then your second sentence suggests the reverse: something complex and a-priori-implausible but superficially non-random does happen without a related intention.
If something complex and a-priori-implausible but superficially non-random happens, then I think there must have been some kind of search or optimization process leading to it. It might be at learning time or it might be at inference time. It might be searching for that exact thing, or it might be searching for something downstream of it or correlated with it. But something. And thus there’s some hope to notice whatever that process is. If it’s at learning time, then we can try to avoid the bad incentives. If it’s at inference time, then there would be an in-principle-recognizable “intention” somewhere in the system, contrary to what you wrote.
(It’s also true that dangerous things can happen that are not a-priori-implausible and thus don’t require any search or optimization process—like killing everyone by producing pollution. That still seems like a more tractable problem then if we’re up against adversarial planning, e.g. treacherous turns.)
One example, to add a little concreteness: suppose that the path to AGI is to scale up o1-style inference-time compute, but it requires multiple OOMs of scaling. So it no longer has a relatively-short stream of “internal” thought, it’s more like the natural-language record of an entire simulated society.
Then:
There is no hope of a human reviewing the whole thing, or any significant fraction of the whole thing. Even spot checks don’t help much, because it’s all so context-dependent.
Accurate summarization would itself be a big difficult research problem.
There’s likely some part of the simulated society explicitly thinking about intentional deception, even if the system as a whole is well aligned.
… but that’s largely irrelevant, because in the context of a big complex system like a whole society, the effects of words are very decoupled from their content. Think of e.g. a charity which produces lots of internal discussion about reducing poverty, but frequently has effects entirely different from reducing poverty. The simulated society as a whole might be superintelligent, but its constituent simulated subagents are still pretty stupid (like humans), so their words decouple from effects (like humans’ words).
… and that’s how the proposal breaks down, for this example.
I think some background context here is that I am not proposing shoggoth+face+paraphraser as a complete alignment solution, as something which we should just build a sovereign with and then let rip. I agree that would go horribly wrong in a bunch of ways.
Instead, I’m saying that this is something we should do, now, that will significantly advance alignment science & governance etc. Also, I do think we’ll be able to get some useful work out of S+F+P systems that we otherwise wouldn’t be able to get.
To get on the object level and engage with your example:
--It’s like the difference between a company whose internal slack messages, comms, etc. are all kept secret from the public, and a company whose messages are all visible to the public. Or replace ‘public’ with ‘regulator.’ It’s not a panacea but it helps a ton.
--My impression from being at OpenAI is that you can tell a lot about an organization’s priorities by looking at their internal messaging and such dumb metrics as ‘what do they spend their time thinking about.’ For example, ‘have they bothered to make a writeup of the costs and benefits, from an altruistic perspective, of major decision X at all?’ and ‘Now there is a writeup—was it made before, or after, the decision to do X?’—
I think a similar thing would be true for this hypothetical giant bureaucracy/civilization of S+F+P CoT.
Again, not a final solution. But it buys us time and teaches us things.
I think you’re implying (deliberately or not) something overly pessimistic (in this narrow point).
Your example is of an intention for something complex and a-priori-implausible to happen (intervention to reduce poverty), but the intention doesn’t actualize. But then your second sentence suggests the reverse: something complex and a-priori-implausible but superficially non-random does happen without a related intention.
If something complex and a-priori-implausible but superficially non-random happens, then I think there must have been some kind of search or optimization process leading to it. It might be at learning time or it might be at inference time. It might be searching for that exact thing, or it might be searching for something downstream of it or correlated with it. But something. And thus there’s some hope to notice whatever that process is. If it’s at learning time, then we can try to avoid the bad incentives. If it’s at inference time, then there would be an in-principle-recognizable “intention” somewhere in the system, contrary to what you wrote.
(It’s also true that dangerous things can happen that are not a-priori-implausible and thus don’t require any search or optimization process—like killing everyone by producing pollution. That still seems like a more tractable problem then if we’re up against adversarial planning, e.g. treacherous turns.)