After hitting “submit” I realized that “alignment failure” is upstream of this divergence of analyses.
By “alignment failure”, I mean “the thing they are optimizing for isn’t aligned with the thing they claim to be optimizing for”. It’s a bit “agnostic” on the cause of this, because the cause isn’t so clearly separable into “evil vs incompetent”. Alignment failure happens by default, and it takes active work to avoid.
Goodharting is an example. Maybe you think “Well, COVID kills people, so we want people to not get COVID, so… let’s fine people for positive COVID tests!”. Okay, sure, that might work if you have mandatory testing. If you have voluntary testing though, that just incentivizes people to not get tested, which will probably make things worse. At this point, someone could complain that you’re aiming to make COVID *look* like it’s not a problem, not actually aiming to solve the problem. They will be right in that this is the direction your interventions are pointing, *even if you didn’t mean to and don’t like it*. In order to actually help keep people healthy and COVID free, you have to keep your eyes on the prize and adjust your aim point as necessary. In order to aim at aiming to keep people healthy and COVID free, you have to keep your eyes on the prize of alignment, and act to correct things when you see that your method of aiming is no longer keeping convergence.
When it comes to things like pro-mask advertisements, it’s oversimplifying to say “It’s an honest mistake” and it’s *also* oversimplifying to say “They WANT to exercise power, not save lives” (hopefully). The question is where, *exactly* the alignment between stated goals and effects break. And the way to tell is to try different interventions and see what happens.
What happens if you say “All I got from your ad was ‘eat shit’! Go to hell you evil condescending jerk!”? Do they look genuinely surprised and say “Shoot, I’m so sorry. I definitely care about your opinion and I have no idea how I came off that way. Can you please explain so that I can see where I went wrong and make it more clear that my respect for your opinion and autonomy is genuine?”?
Do they think “Hm. This person seems to think that I’m condescending to him, and I don’t want them to think that, yet I notice that I’m not surprised. Is it true? Do I have to check my inner alignment to the goal of saving lives, and maybe humble myself somewhat?”
What if you state the case more politely? What if you go out of your way to explain it in a way that makes it easy for them to continue to see themselves as good people, while also making it unmistakable that remaining a “good person who cares about saving lives” requires running ads which don’t leak contempt? Do they change the ad, mind how they’re coming off and how they’re feeling more closely, and thank you for helping them out? Or do they try making up nonsense to justify things before finally admitting “Okay, I don’t actually care about people I just like being a jerk”?
My own answer is that the contempt is likely real. It’s likely something they aren’t very aware of, but that they likely would be if they were motivated to find these things. It’s likely that they are not so virtuous and committed to alignment to their stated goals of being a good person that you can rudely shove this in their face and have them fix their mistakes. If you play the part of someone being stomped on, and cast them as a stomper, they will play into the role you’ve cast them in while dismissing the idea that they’re doing it. How evil!
However, it’s also overwhelmingly likely that if you sit down with them and see them for where they’re at, and explain things in a way that makes it feel okay to be who they are and shows them *how* they can be more of who they want to see themselves as being, they’ll choose to better align themselves and be grateful for the help. If you play the part of someone who recognizes their good intent and who recognizes that there are causal reasons which are beyond them for all of their failures, and cast them in the role of someone who is virtuous enough to choose good… they’ll probably still choose to play the part you cast them in.
That’s why it’s not “Simple mistake, nothing to see here” and also not “They’re doing it on purpose, those irredeemable bastards!”. It’s kinda “accidentally on purpose”. You can’t just point at what they did on purpose and expect them to change because they did in fact “do it on purpose” (in a sense). You *can*, however, point out the accident of how they allowed their purpose to become misaligned (if you know how to do so), and expect that to work.
Aligning ourselves (and others) with good takes active work, and active re-aiming, both of object level goals and meta-goals of what we’re aiming for. Framing things as either “innocent mistakes” or “purposeful actions of coherent agents” misses the important opportunity to realign and teach alignment
Some optimizer computed by a human brain is doing it on purpose. I agree that it seems desirable to be able to coherently and nonvacuously say that this is generally not something the person wants. I tried to lay out a principled model that distinguishes between perverse and humane optimization in Civil Law and Political Drama.
After hitting “submit” I realized that “alignment failure” is upstream of this divergence of analyses.
By “alignment failure”, I mean “the thing they are optimizing for isn’t aligned with the thing they claim to be optimizing for”. It’s a bit “agnostic” on the cause of this, because the cause isn’t so clearly separable into “evil vs incompetent”. Alignment failure happens by default, and it takes active work to avoid.
Goodharting is an example. Maybe you think “Well, COVID kills people, so we want people to not get COVID, so… let’s fine people for positive COVID tests!”. Okay, sure, that might work if you have mandatory testing. If you have voluntary testing though, that just incentivizes people to not get tested, which will probably make things worse. At this point, someone could complain that you’re aiming to make COVID *look* like it’s not a problem, not actually aiming to solve the problem. They will be right in that this is the direction your interventions are pointing, *even if you didn’t mean to and don’t like it*. In order to actually help keep people healthy and COVID free, you have to keep your eyes on the prize and adjust your aim point as necessary. In order to aim at aiming to keep people healthy and COVID free, you have to keep your eyes on the prize of alignment, and act to correct things when you see that your method of aiming is no longer keeping convergence.
When it comes to things like pro-mask advertisements, it’s oversimplifying to say “It’s an honest mistake” and it’s *also* oversimplifying to say “They WANT to exercise power, not save lives” (hopefully). The question is where, *exactly* the alignment between stated goals and effects break. And the way to tell is to try different interventions and see what happens.
What happens if you say “All I got from your ad was ‘eat shit’! Go to hell you evil condescending jerk!”? Do they look genuinely surprised and say “Shoot, I’m so sorry. I definitely care about your opinion and I have no idea how I came off that way. Can you please explain so that I can see where I went wrong and make it more clear that my respect for your opinion and autonomy is genuine?”?
Do they think “Hm. This person seems to think that I’m condescending to him, and I don’t want them to think that, yet I notice that I’m not surprised. Is it true? Do I have to check my inner alignment to the goal of saving lives, and maybe humble myself somewhat?”
What if you state the case more politely? What if you go out of your way to explain it in a way that makes it easy for them to continue to see themselves as good people, while also making it unmistakable that remaining a “good person who cares about saving lives” requires running ads which don’t leak contempt? Do they change the ad, mind how they’re coming off and how they’re feeling more closely, and thank you for helping them out? Or do they try making up nonsense to justify things before finally admitting “Okay, I don’t actually care about people I just like being a jerk”?
My own answer is that the contempt is likely real. It’s likely something they aren’t very aware of, but that they likely would be if they were motivated to find these things. It’s likely that they are not so virtuous and committed to alignment to their stated goals of being a good person that you can rudely shove this in their face and have them fix their mistakes. If you play the part of someone being stomped on, and cast them as a stomper, they will play into the role you’ve cast them in while dismissing the idea that they’re doing it. How evil!
However, it’s also overwhelmingly likely that if you sit down with them and see them for where they’re at, and explain things in a way that makes it feel okay to be who they are and shows them *how* they can be more of who they want to see themselves as being, they’ll choose to better align themselves and be grateful for the help. If you play the part of someone who recognizes their good intent and who recognizes that there are causal reasons which are beyond them for all of their failures, and cast them in the role of someone who is virtuous enough to choose good… they’ll probably still choose to play the part you cast them in.
That’s why it’s not “Simple mistake, nothing to see here” and also not “They’re doing it on purpose, those irredeemable bastards!”. It’s kinda “accidentally on purpose”. You can’t just point at what they did on purpose and expect them to change because they did in fact “do it on purpose” (in a sense). You *can*, however, point out the accident of how they allowed their purpose to become misaligned (if you know how to do so), and expect that to work.
Aligning ourselves (and others) with good takes active work, and active re-aiming, both of object level goals and meta-goals of what we’re aiming for. Framing things as either “innocent mistakes” or “purposeful actions of coherent agents” misses the important opportunity to realign and teach alignment
Some optimizer computed by a human brain is doing it on purpose. I agree that it seems desirable to be able to coherently and nonvacuously say that this is generally not something the person wants. I tried to lay out a principled model that distinguishes between perverse and humane optimization in Civil Law and Political Drama.