This isn’t about “inner alignment” (as catchy as the name might be), it’s just about regular old alignment.
But I think you’re right. As long as the learning step “erases” the model editing in a sensible way, then I was wrong and there won’t be an incentive for the learned model to compensate for the editing process.
So if you can identify a “customer gender detector” node in some complicated learned world-model, you can artificially set it to a middle value as a legitimate way to make the RL agent less sensitive to customer gender.
I’m not sure how well this approach fulfills modern regulatory guidelines, or how useful it is for AGI alignment, for basically the same reason: models that are learned from scratch sometimes encode knowledge in a distributed way that’s hard to edit out. Ultimately, your proof that a certain edit reduces gender bias is going to have to come from testing the system and checking that it behaves well, which is a kind of evidence a lot of regulators / businesspeople are skittish about.
This isn’t about “inner alignment” (as catchy as the name might be), it’s just about regular old alignment.
But I think you’re right. As long as the learning step “erases” the model editing in a sensible way, then I was wrong and there won’t be an incentive for the learned model to compensate for the editing process.
So if you can identify a “customer gender detector” node in some complicated learned world-model, you can artificially set it to a middle value as a legitimate way to make the RL agent less sensitive to customer gender.
I’m not sure how well this approach fulfills modern regulatory guidelines, or how useful it is for AGI alignment, for basically the same reason: models that are learned from scratch sometimes encode knowledge in a distributed way that’s hard to edit out. Ultimately, your proof that a certain edit reduces gender bias is going to have to come from testing the system and checking that it behaves well, which is a kind of evidence a lot of regulators / businesspeople are skittish about.