(Sorry, this comment is only tangentially relevant to MELBO / DCT per se.)
It seems like the central idea of MELBO / DCT is that ‘we can find effective steering vectors by optimizing for changes in downstream layers’.
I’m pretty interested as to whether this also applies to SAEs.
i.e. instead of training SAEs to minimize current-layer reconstruction loss, train the decoder to maximize changes in downstream layers, and then train an encoder on top of that (with fixed decoder) to minimize changes in downstream layers.
i.e. “sparse dictionary learning with MELBO / DCT vectors as (part of) the decoder”.
(Sorry, this comment is only tangentially relevant to MELBO / DCT per se.)
It seems like the central idea of MELBO / DCT is that ‘we can find effective steering vectors by optimizing for changes in downstream layers’.
I’m pretty interested as to whether this also applies to SAEs.
i.e. instead of training SAEs to minimize current-layer reconstruction loss, train the decoder to maximize changes in downstream layers, and then train an encoder on top of that (with fixed decoder) to minimize changes in downstream layers.
i.e. “sparse dictionary learning with MELBO / DCT vectors as (part of) the decoder”.
E2E SAEs are close cousins of this idea
Problem: MELBO / DCT force the learned SVs to be orthogonal. Can we relax this constraint somehow?