I think this approach can be combined with self-other overlap fine-tuning (SOO FT, see Self-Other Overlap: A Neglected Approach to AI AlignmentI’m part of the SOO team, now an ICLR submission). The difficult part of SOO is to precisely isolate the representation of self and other, and I think it should be possible to use ERA to get a tighter bound on it. Note: I’m part of the SOO team.
I think this approach can be combined with self-other overlap fine-tuning (SOO FT, see Self-Other Overlap: A Neglected Approach to AI AlignmentI’m part of the SOO team, now an ICLR submission). The difficult part of SOO is to precisely isolate the representation of self and other, and I think it should be possible to use ERA to get a tighter bound on it. Note: I’m part of the SOO team.