It strikes me that the kind of self-supervision you describe is a suspiciously similar to trying to incorporate meta-preferences in the outer objective by self-modeling. When the model understands humans differently, it changes its notion of what it is to be misaligned or deceptive, which gets used to form a loss term against that kind of behavior, which then impacts how it understands humans.
This analogy:
makes me significantly more optimistic about interpretability inside the training loop.
makes me slightly more pessimistic about meta-preferences incorporated as fine-tuning based on self-modeling.
clarifies preference-related arguments for why getting the former to work means fulfilling more desiderata than current basic schemes do.
It strikes me that the kind of self-supervision you describe is a suspiciously similar to trying to incorporate meta-preferences in the outer objective by self-modeling. When the model understands humans differently, it changes its notion of what it is to be misaligned or deceptive, which gets used to form a loss term against that kind of behavior, which then impacts how it understands humans.
This analogy:
makes me significantly more optimistic about interpretability inside the training loop.
makes me slightly more pessimistic about meta-preferences incorporated as fine-tuning based on self-modeling.
clarifies preference-related arguments for why getting the former to work means fulfilling more desiderata than current basic schemes do.