Typical humans in typical bureaucracies do not seem at all aligned with the goals that the bureaucracy is meant to pursue.
Why would this be any different for simulated humans or for human-mimicry based AI (which is what ~all of the problem-factorization-based alignment strategies I’ve seen are based on)?
Since you reuse one AI model for each element of the bureaucracy, doing prework to establish sophisticated coordinated protocols for the bureaucracy takes a constant amount of effort, whereas in human bureaucracies it would scale linearly with the number of people. As a result with the same budget you can establish a much more sophisticated protocol with AI than with humans.
This one I buy. Though if it’s going to be the key load-bearing piece which makes e.g. something HCH-like work better than the corresponding existing institutions, then it really ought to play a more central role in proposals, and testing it on humans now should be a high priority. (Some of Ought’s work roughly fits that, so kudos to them, but I don’t know of anyone else doing that sort of thing.)
After a mere 100 iterations of iterated distillation and amplification where each agent can ask 2 subquestions, you are approximating a bureaucracy of 2^100 agents, which is wildly larger than any human bureaucracy and has qualitatively different strategies available to it. Probably it will be a relatively bad approximation but the exponential scaling with linear iterations still seems pretty majorly different from human bureaucracies.
Empirically it does not seem like bureaucracies’ problems get better as they get bigger. It seems like they get worse. And like, sure, maybe there’s a phase change if you go to really exponentially bigger sizes, but “maybe there’s a phase change and it scales totally differently than we’re used to and this happens to be a good thing rather than a bad thing” is the sort of argument you could make about anything, we really need some other reason to think that hypothesis is worth distinguishing at all.
I think these disanalogies are driving most of the disagreement, rather than things like “not knowing about real-world evidence” or even “failing to anticipate results in simple cases we can test today”. For example, for the relay experiment you mention, at least I personally (and probably others) did in fact anticipate these results in advance.
Kudos for correct prediction!
Continuing in the spirit of expressing my highly uncharitable intuitions, my intuitive reaction to this is “hmm Rohin’s inner simulator seems to be working fine, maybe he’s just not actually applying it to picture what would happen in an actual bureaucracy when making changes corresponding to the proposed disanalogies”. On reflection I think there’s a strong chance you have tried picturing that, but I’m not confident, so I mention it just in case you haven’t. (In particular disanalogy 3 seems like one which is unlikely to work in our favor when actually picturing it, and my inner sim is also moderately skeptical about disanalogy 2.)
One other thing I should have mentioned is that I do think the “unconscious economics” point is relevant and could end up being a major problem for problem factorization, but I don’t think we have great real-world evidence suggesting that unconscious economics by itself is enough to make teams of agents not be worthwhile.
Re disanalogy 1: I’m not entirely sure I understand what your objection is here but I’ll try responding anyway.
I’m imagining that the base agent is an AI system that is pursuing a desired task with roughly human-level competence, not something that acts the way a whole-brain emulation in a realistic environment would act. This base agent can be trained by imitation learning where you have the AI system mimic human demonstrations of the task, or by reinforcement learning on a reward model trained off of human preferences, but (we hope) is just trying to do the task and doesn’t have all the other human wants and desires. (Yes, this leaves a question of how you get that in the first place; personally I think that this distillation is the “hard part”, but that seems separate from the bureaucracy point.)
Even if you did get a bureaucracy made out of agents with human desires, it still seems like you get a lot of benefit from the fact that the agents are identical to each other, and so have less politics.
Re disanalogy 3: I agree that you have to think that a small / medium / large bureaucracy of Alices-with-15-minutes will at least slightly outperform an individual / small / medium bureaucracy of Alices-with-15-minutes before this disanalogy is actually a reason for optimism. I think that ends up coming from disanalogies 1, 2 and 4, plus some difference in opinion about real-world bureaucracies, e.g. I feel pretty good about small real-world teams beating individuals.
On reflection I think there’s a strong chance you have tried picturing that, but I’m not confident, so I mention it just in case you haven’t.
Yeah I have. Personally my inner sim feels pretty great about the combination of disanalogy 1 and disanalogy 2 -- it feels like a coalition of Rohins would do so much better than an individual Rohin, as long as the Rohins had time to get familiar with a protocol and evolve it to suit their needs. (Picturing some giant number of Rohins a la disanalogy 3 is a lot harder to do but when I try it mostly feels like it probably goes fine.)
We give feedback to the model on its reasoning, that feedback is bad in the same way that “the rest of the world pays attention and forces dumb rules on them” is bad
“Keep your reasoning transparent” is itself a dumb rule that we force upon the AI system that leads to terrible bureaucracy problems
I’m unsure about (2) and mostly disagree with (1) (and I think you were mostly saying (2)).
Disagreement with (1): Seems like the disanalogy relies pretty hard on the rest of the world not paying much attention when they force bureaucracies to follow dumb rules, whereas we will presumably pay a lot of attention to how we give process-based feedback.
Re disanalogy 1: I’m not entirely sure I understand what your objection is here but I’ll try responding anyway.
I was mostly thinking of the unconscious economics stuff.
Personally my inner sim feels pretty great about the combination of disanalogy 1 and disanalogy 2 -- it feels like a coalition of Rohins would do so much better than an individual Rohin, as long as the Rohins had time to get familiar with a protocol and evolve it to suit their needs. (Picturing some giant number of Rohins a la disanalogy 3 is a lot harder to do but when I try it mostly feels like it probably goes fine.)
I should have asked for a mental picture sooner, this is very useful to know. Thanks.
If I imagine a bunch of Johns, I think that they basically do fine, though mainly because they just don’t end up using very many Johns. I do think a small team of Johns would do way better than I do.
Why would this be any different for simulated humans or for human-mimicry based AI (which is what ~all of the problem-factorization-based alignment strategies I’ve seen are based on)?
This one I buy. Though if it’s going to be the key load-bearing piece which makes e.g. something HCH-like work better than the corresponding existing institutions, then it really ought to play a more central role in proposals, and testing it on humans now should be a high priority. (Some of Ought’s work roughly fits that, so kudos to them, but I don’t know of anyone else doing that sort of thing.)
Empirically it does not seem like bureaucracies’ problems get better as they get bigger. It seems like they get worse. And like, sure, maybe there’s a phase change if you go to really exponentially bigger sizes, but “maybe there’s a phase change and it scales totally differently than we’re used to and this happens to be a good thing rather than a bad thing” is the sort of argument you could make about anything, we really need some other reason to think that hypothesis is worth distinguishing at all.
Kudos for correct prediction!
Continuing in the spirit of expressing my highly uncharitable intuitions, my intuitive reaction to this is “hmm Rohin’s inner simulator seems to be working fine, maybe he’s just not actually applying it to picture what would happen in an actual bureaucracy when making changes corresponding to the proposed disanalogies”. On reflection I think there’s a strong chance you have tried picturing that, but I’m not confident, so I mention it just in case you haven’t. (In particular disanalogy 3 seems like one which is unlikely to work in our favor when actually picturing it, and my inner sim is also moderately skeptical about disanalogy 2.)
One more disanalogy:
4. the rest of the world pays attention to large or powerful real-world bureaucracies and force rules on them that small teams / individuals can ignore (e.g. Secret Congress, Copenhagen interpretation of ethics, startups being able to do illegal stuff), but this presumably won’t apply to alignment approaches.
One other thing I should have mentioned is that I do think the “unconscious economics” point is relevant and could end up being a major problem for problem factorization, but I don’t think we have great real-world evidence suggesting that unconscious economics by itself is enough to make teams of agents not be worthwhile.
Re disanalogy 1: I’m not entirely sure I understand what your objection is here but I’ll try responding anyway.
I’m imagining that the base agent is an AI system that is pursuing a desired task with roughly human-level competence, not something that acts the way a whole-brain emulation in a realistic environment would act. This base agent can be trained by imitation learning where you have the AI system mimic human demonstrations of the task, or by reinforcement learning on a reward model trained off of human preferences, but (we hope) is just trying to do the task and doesn’t have all the other human wants and desires. (Yes, this leaves a question of how you get that in the first place; personally I think that this distillation is the “hard part”, but that seems separate from the bureaucracy point.)
Even if you did get a bureaucracy made out of agents with human desires, it still seems like you get a lot of benefit from the fact that the agents are identical to each other, and so have less politics.
Re disanalogy 3: I agree that you have to think that a small / medium / large bureaucracy of Alices-with-15-minutes will at least slightly outperform an individual / small / medium bureaucracy of Alices-with-15-minutes before this disanalogy is actually a reason for optimism. I think that ends up coming from disanalogies 1, 2 and 4, plus some difference in opinion about real-world bureaucracies, e.g. I feel pretty good about small real-world teams beating individuals.
I mostly mention this disanalogy as a reason not to update too hard on intuitions like Can HCH epistemically dominate Ramanujan? and this SlateStarCodex post.
Yeah I have. Personally my inner sim feels pretty great about the combination of disanalogy 1 and disanalogy 2 -- it feels like a coalition of Rohins would do so much better than an individual Rohin, as long as the Rohins had time to get familiar with a protocol and evolve it to suit their needs. (Picturing some giant number of Rohins a la disanalogy 3 is a lot harder to do but when I try it mostly feels like it probably goes fine.)
I think a lot of alignment tax-imposing interventions (like requiring local work to be transparent for process-based feedback) could be analogous?
Hmm, maybe? There are a few ways this could go:
We give feedback to the model on its reasoning, that feedback is bad in the same way that “the rest of the world pays attention and forces dumb rules on them” is bad
“Keep your reasoning transparent” is itself a dumb rule that we force upon the AI system that leads to terrible bureaucracy problems
I’m unsure about (2) and mostly disagree with (1) (and I think you were mostly saying (2)).
Disagreement with (1): Seems like the disanalogy relies pretty hard on the rest of the world not paying much attention when they force bureaucracies to follow dumb rules, whereas we will presumably pay a lot of attention to how we give process-based feedback.
I was mostly thinking of the unconscious economics stuff.
I should have asked for a mental picture sooner, this is very useful to know. Thanks.
If I imagine a bunch of Johns, I think that they basically do fine, though mainly because they just don’t end up using very many Johns. I do think a small team of Johns would do way better than I do.