(Eli’s personal notes, mostly for his own understanding. Feel free to respond if you want.)
The argument for alignment isn’t that “a system made of aligned neurons is aligned.” Unalignment isn’t a thing that magically happens; it’s the result of specific optimization pressures in the system that create trouble.
Definitely agree that even if the agents are aligned, they can implement unaligned optimization, and then we’re back to square one. Amplification only works if we can improve capability without doing unaligned optimization.
It’s important that my argument for alignment-of-amplification goes through not doing problematic optimization. So if we combine that with a good enough solution to informed oversight and reliability (and amplification, and the induction working so far...), then we can continue to train imperfect imitations that definitely don’t do problematic optimization. They’ll mess up all over the place, and so might not be able to be competent (another problem amplification needs to handle), but the goal is to set things up so that being a lot dumber doesn’t break alignment.
It seem like Paul thinks that “sure, my aggregate of little agents could implement an (unaligned) algorithm that they don’t understand, but that would only happen as the result of some unaligned optimization, which shouldn’t be present at any step. ”
It seems like a linchpin of Paul’s thinking is that he’s trying to…
1) initially set up the situation such that there is no component that is doing unaligned optimization (Benignity, Approval-directed agents), and
(Eli’s personal notes, mostly for his own understanding. Feel free to respond if you want.)
It seem like Paul thinks that “sure, my aggregate of little agents could implement an (unaligned) algorithm that they don’t understand, but that would only happen as the result of some unaligned optimization, which shouldn’t be present at any step. ”
It seems like a linchpin of Paul’s thinking is that he’s trying to…
1) initially set up the situation such that there is no component that is doing unaligned optimization (Benignity, Approval-directed agents), and
2) insure that at every step, there are various checks that unaligned optimization hasn’t been introduced (Informed oversight, Techniques for Optimizing Worst-Case Performance).