Thanks for writing this! I think there are a number of interesting directions here.
I think in (very roughly) increasing order of excitement:
Connections to mechanistic interpretability
I think it would be nice to have connections to mechanistic interpretability. My main concern here is just that this seems quite hard to me in general. But I could imagine some particular sub-questions here being more tractable, such as connections to ROME/MEMIT in particular.
Improving the loss function + using other consistency constraints
In general I’m interested in work that makes CCS more reliable/robust; it’s currently more of a prototype than something ready for practice. But I think some types of practical improvements seem more conceptually deep than others.
I particularly agree that L_confidence doesn’t seem like quite what we want, so I’d love to see improvements there.
I’m definitely interested in extensions to more consistency properties, though I’m not sure if conjunctions/disjunctions alone lets you avoid degenerate solutions without L_confidence. (EDIT: never mind, I now think this has a reasonable chance of working.)
Perhaps more importantly, I worry that it might be a bit too difficult in practice right now to make effective use of conjunctions and disjunctions in current models – I think they might be too bad at conjunctions/disjunctions, in the sense that a linear probe wouldn’t get high accuracy (at least with current open source models). But I think someone should definitely try this.
Understanding simulated agents
I’m very excited to see work on understanding how this type of method works when applied to models that are simulating other agents/perspectives.
Generalizing to other concepts
I found the connection to ramsification + natural abstractions interesting, and I’m very interested in the idea of thinking about how you can generalize this to searching for other concepts (other than truth) in an unsupervised + non-mechanistic way.
Thanks for writing this! I think there are a number of interesting directions here.
I think in (very roughly) increasing order of excitement:
Connections to mechanistic interpretability
I think it would be nice to have connections to mechanistic interpretability. My main concern here is just that this seems quite hard to me in general. But I could imagine some particular sub-questions here being more tractable, such as connections to ROME/MEMIT in particular.
Improving the loss function + using other consistency constraints
In general I’m interested in work that makes CCS more reliable/robust; it’s currently more of a prototype than something ready for practice. But I think some types of practical improvements seem more conceptually deep than others.
I particularly agree that L_confidence doesn’t seem like quite what we want, so I’d love to see improvements there.
I’m definitely interested in extensions to more consistency properties, though I’m not sure if conjunctions/disjunctions alone lets you avoid degenerate solutions without L_confidence. (EDIT: never mind, I now think this has a reasonable chance of working.)
Perhaps more importantly, I worry that it might be a bit too difficult in practice right now to make effective use of conjunctions and disjunctions in current models – I think they might be too bad at conjunctions/disjunctions, in the sense that a linear probe wouldn’t get high accuracy (at least with current open source models). But I think someone should definitely try this.
Understanding simulated agents
I’m very excited to see work on understanding how this type of method works when applied to models that are simulating other agents/perspectives.
Generalizing to other concepts
I found the connection to ramsification + natural abstractions interesting, and I’m very interested in the idea of thinking about how you can generalize this to searching for other concepts (other than truth) in an unsupervised + non-mechanistic way.
I’m excited to see where this work goes!