I feel like I’m pretty off outer vs. inner alignment.
People have had a go at inner alignment, but they keep trying to affect it by taking terms for interpretability, or modeled human feedbacks, or characteristics of the AI’s self-model, and putting them into the loss function, diluting the entire notion that inner alignment isn’t about what’s in the loss function.
People have had a go at outer alignment too, but (if they’re named Charlie) they keep trying to point to what we want by saying that the AI should be trying to learn good moral reasoning, which means it should be modeling its reasoning procedures and changing them to conform to human meta-preferences, diluting the notion that outer alignment is just about what we want the AI to do, not about how it works.
I feel like I’m pretty off outer vs. inner alignment.
People have had a go at inner alignment, but they keep trying to affect it by taking terms for interpretability, or modeled human feedbacks, or characteristics of the AI’s self-model, and putting them into the loss function, diluting the entire notion that inner alignment isn’t about what’s in the loss function.
People have had a go at outer alignment too, but (if they’re named Charlie) they keep trying to point to what we want by saying that the AI should be trying to learn good moral reasoning, which means it should be modeling its reasoning procedures and changing them to conform to human meta-preferences, diluting the notion that outer alignment is just about what we want the AI to do, not about how it works.