I think I agree with this post? Certainly for a superintelligence that is vastly smarter than humans, I buy this argument (and in general am not optimistic about solving alignment). However, humans seem to be fairly good at keeping each other in check, without a deep understanding of what makes humans tick, even though humans often do optimize against each other. Perhaps we can maintain this situation inductively as our AI systems get more powerful, without requiring a deep understanding of what’s going on? Overall I’m pretty confused on this point.
I read Optimization Amplifies as Scott’s attempt to more explicitly articulate the core claim of Eliezer’s Security Mindset dialogues (1, 2). On this view, making software robust/secure to ordinary human optimization does demand the same kind of approach as making it robust/secure to superhuman optimization. The central disanalogy isn’t “robustness-to-humans requires X while robustness-to-superintelligence requires Y”, but rather “the costs of robustness/security failures tend to be much smaller in the human case than the superintelligence case”.
Okay, that makes sense, I agree with that. As I mentioned in the opinion, I definitely agree with this in the case of a superintelligence optimizing a utility function. Probably my mindset when writing that opinion was that it seems likely to me that AI systems that we actually deploy won’t look like a single agent optimizing for particular preferences, for reasons that I couldn’t really articulate. I still have this intuition, and I think I’m closer to being able to explain it now, but not in a comment.
I read Optimization Amplifies as Scott’s attempt to more explicitly articulate the core claim of Eliezer’s Security Mindset dialogues (1, 2). On this view, making software robust/secure to ordinary human optimization does demand the same kind of approach as making it robust/secure to superhuman optimization. The central disanalogy isn’t “robustness-to-humans requires X while robustness-to-superintelligence requires Y”, but rather “the costs of robustness/security failures tend to be much smaller in the human case than the superintelligence case”.
Okay, that makes sense, I agree with that. As I mentioned in the opinion, I definitely agree with this in the case of a superintelligence optimizing a utility function. Probably my mindset when writing that opinion was that it seems likely to me that AI systems that we actually deploy won’t look like a single agent optimizing for particular preferences, for reasons that I couldn’t really articulate. I still have this intuition, and I think I’m closer to being able to explain it now, but not in a comment.