on a call, i was discussing my idea for doing activation-level learning to (hopefully) provide models feedback based on their internal computations and choices:
I may have slipped into a word game… are we “training against the [interpretability] detection method” or are we “providing feedback away from one kind of algorithm and towards another”? They seem to suggest very different generalizations, even though they describe the same finetuning process. How could that be?
Some of my mentees are working on something related to this right now! (One of them brought this comment to my attention). It’s definitely very preliminary work—observing what results in different contexts when you do something like tune a model using some extracted representations of model internals as a signal—but as you say we don’t really have very much empirics on this, so I’m pretty excited by it.
on a call, i was discussing my idea for doing activation-level learning to (hopefully) provide models feedback based on their internal computations and choices:
Some of my mentees are working on something related to this right now! (One of them brought this comment to my attention). It’s definitely very preliminary work—observing what results in different contexts when you do something like tune a model using some extracted representations of model internals as a signal—but as you say we don’t really have very much empirics on this, so I’m pretty excited by it.