On the one hand, I also wish Shulman would go into more detail on the “Supposing we’ve solved alignment and interpretability” part. (I still balk a bit at “in democracies” talk, but less so than I did a couple years ago.) On the other hand, I also wish you would go into more detail on the “Humans don’t benefit even if you ‘solve alignment’” part. Maybe there’s a way to meet in the middle??
I also wish you would go into more detail on the “Humans don’t benefit even if you ‘solve alignment’” part.
My own answer to this is that humans aren’t secure, and AI will exacerbate the problem greatly by helping the offense (i.e. exploiting human vulnerabilities) a lot more than the defense. I’ve focused on philosophy (thinking that offense merely requires training AI to be persuasive, while defense seems to require solving metaphilosophy, i.e., understanding what correct philosophical reasoning consists of), but more recently realized that the attack surface is so much bigger than that. For example humans can fall in love (resulting in massive changes to one’s utility function, if you were to model a human as a utility maximizer). It will be straightforward to use RL to train AI to cause humans to fall in love with them (or their characters), but how do you train AI to help defend against that? Would most humans even want to defend against that or care enough about it?
So even with alignment, the default outcome seems to be a civilization with massively warped or corrupted human values. My inner Carl wants to reply that aligned and honest AI advisors will warn us of this and help us fix it before it’s too late, maybe by convincing policymakers to pass regulations to prevent such “misuse” of AI? And then my reply to that would be that I don’t see how such regulations can work, policymakers won’t care enough, it seems easier to train AI to attack humans than to make such predictions in a way that is both honest/unbiased and persuasive to policymaker, AI might not have the necessary long-horizon causal understanding to craft good regulations before it’s too late.
Another possibility is that alignment tax is just too high, so competitive pressures erode alignment even if it’s “solved” in some sense.
On the one hand, I also wish Shulman would go into more detail on the “Supposing we’ve solved alignment and interpretability” part. (I still balk a bit at “in democracies” talk, but less so than I did a couple years ago.) On the other hand, I also wish you would go into more detail on the “Humans don’t benefit even if you ‘solve alignment’” part. Maybe there’s a way to meet in the middle??
My own answer to this is that humans aren’t secure, and AI will exacerbate the problem greatly by helping the offense (i.e. exploiting human vulnerabilities) a lot more than the defense. I’ve focused on philosophy (thinking that offense merely requires training AI to be persuasive, while defense seems to require solving metaphilosophy, i.e., understanding what correct philosophical reasoning consists of), but more recently realized that the attack surface is so much bigger than that. For example humans can fall in love (resulting in massive changes to one’s utility function, if you were to model a human as a utility maximizer). It will be straightforward to use RL to train AI to cause humans to fall in love with them (or their characters), but how do you train AI to help defend against that? Would most humans even want to defend against that or care enough about it?
So even with alignment, the default outcome seems to be a civilization with massively warped or corrupted human values. My inner Carl wants to reply that aligned and honest AI advisors will warn us of this and help us fix it before it’s too late, maybe by convincing policymakers to pass regulations to prevent such “misuse” of AI? And then my reply to that would be that I don’t see how such regulations can work, policymakers won’t care enough, it seems easier to train AI to attack humans than to make such predictions in a way that is both honest/unbiased and persuasive to policymaker, AI might not have the necessary long-horizon causal understanding to craft good regulations before it’s too late.
Another possibility is that alignment tax is just too high, so competitive pressures erode alignment even if it’s “solved” in some sense.