Evan Hubinger recently wrote a great FAQ on inner alignment terminology. We won’t be talking about inner/outer alignment today, but I intend for my usage of “impact alignment” to map onto his “alignment”
This doesn’t seem true. From Evan’s post:
Alignment: An agent is aligned (with humans) if it doesn’t take actions that we would judge to be bad/problematic/dangerous/catastrophic.
From your post:
Impact alignment: the AI’s actual impact is aligned with what we want. Deploying the AI actually makes good things happen.
“Bad things don’t happen” and “good things happen” seem quite different, e.g. a rock is Evan-aligned but not Alex-impact-aligned. (Personally, I prefer “aligned” to be about “good things” rather than “not bad things”, so I prefer your definition.)
Hmmm… this is a subtle distinction and both definitions seem pretty reasonable to me. I guess I feel like I want “good things happen” to be part of capabilities (e.g. is the model capable of doing the things we want it to do) rather than alignment, making (impact) alignment more about not doing stuff we don’t want.
Wouldn’t outcome-based “not doing bad things” impact alignment still run into that capabilities issue? “Not doing bad things” requires serious capabilities for some goals (e.g. sparse but intially achievable goals).
In any case, you can say “I think that implementing strong capabilities + strong intent alignment is a good instrumental strategy for impact alignment”, which seems compatible with the distinction you seek?
“Bad things don’t happen” and “good things happen” seem quite different, e.g. a rock is Evan-aligned but not Alex-impact-aligned.
To rephrase: Alex(/Critch)-impact-alignment is about (strictly) increasing value, non-obstruction is about non-strict value increase, and Evan-alignment is about not taking actions we would judge to significantly decrease value (making it more similar to non-obstruction, except wrt our expectations about the consequences of actions).
I’d also like to flag that Evan’s definition involves (hypothetical) humans evaluating the actions, while my definition involves evaluating the outcomes. Whenever we’re reasoning about non-trivial scenarios using my definition, though, it probably doesn’t matter. That’s because we would have to reason using our beliefs about the consequences of different kinds of actions.
However, the different perspectives might admit different kinds of theorems, and we could perhaps reason using those, and so perhaps the difference matters after all.
Nitpick:
This doesn’t seem true. From Evan’s post:
From your post:
“Bad things don’t happen” and “good things happen” seem quite different, e.g. a rock is Evan-aligned but not Alex-impact-aligned. (Personally, I prefer “aligned” to be about “good things” rather than “not bad things”, so I prefer your definition.)
Hmmm… this is a subtle distinction and both definitions seem pretty reasonable to me. I guess I feel like I want “good things happen” to be part of capabilities (e.g. is the model capable of doing the things we want it to do) rather than alignment, making (impact) alignment more about not doing stuff we don’t want.
Wouldn’t outcome-based “not doing bad things” impact alignment still run into that capabilities issue? “Not doing bad things” requires serious capabilities for some goals (e.g. sparse but intially achievable goals).
In any case, you can say “I think that implementing strong capabilities + strong intent alignment is a good instrumental strategy for impact alignment”, which seems compatible with the distinction you seek?
To rephrase: Alex(/Critch)-impact-alignment is about (strictly) increasing value, non-obstruction is about non-strict value increase, and Evan-alignment is about not taking actions we would judge to significantly decrease value (making it more similar to non-obstruction, except wrt our expectations about the consequences of actions).
I’d also like to flag that Evan’s definition involves (hypothetical) humans evaluating the actions, while my definition involves evaluating the outcomes. Whenever we’re reasoning about non-trivial scenarios using my definition, though, it probably doesn’t matter. That’s because we would have to reason using our beliefs about the consequences of different kinds of actions.
However, the different perspectives might admit different kinds of theorems, and we could perhaps reason using those, and so perhaps the difference matters after all.