Less tongue-in-cheek: certainly it’s unclear to what extent interpretability will be sufficient for addressing various forms of inner alignment failures, but I definitely think interpretability research should count as inner alignment research.
I mean, it’s mostly semantics but I think of mechanical interpretability as “inner” but not alignment and think it’s clearer that way, personally, so that we don’t call everything alignment. Observing properties doesn’t automatically get you good properties. I’ll read your link but it’s a bit too much to wade into for me atm.
Either way, it’s clear how to restate my question: Is mechanical interpretability work the only inner alignment work Anthropic is doing?
Evan and others on my team are working on non-mechanistic-interpretability directions primarily motivated by inner alignment:
Developing model organisms for deceptive inner alignment, which we may use to study the risk factors for deceptive alignment
Conditioning predictive models as an alternative to training agents. Predictive models may pose fewer inner alignment risks, for reasons discussed here
Studying the extent to which models exhibit likely pre-requisites to deceptive inner alignment, such as situational awareness (a very preliminary exploration is in Sec. 5 in our paper on model-written evaluations)
Investigating the extent to which externalized reasoning (e.g. chain of thought) is a way to gain transparency into a model’s process for solving a task
There’s also ongoing work on other teams related to (automated) red teaming of models and understanding how models generalize, which may also turn out to be relevant/helpful for inner alignment. It’s pretty unclear to me how useful any of these directions will turn out to be for inner alignment in the end, but we’ve chosen these directions in large part because we’re very concerned about inner alignment, and we’re actively looking for new directions that seem useful for mitigating inner misalignment risks.
???
Less tongue-in-cheek: certainly it’s unclear to what extent interpretability will be sufficient for addressing various forms of inner alignment failures, but I definitely think interpretability research should count as inner alignment research.
I mean, it’s mostly semantics but I think of mechanical interpretability as “inner” but not alignment and think it’s clearer that way, personally, so that we don’t call everything alignment. Observing properties doesn’t automatically get you good properties. I’ll read your link but it’s a bit too much to wade into for me atm.
Either way, it’s clear how to restate my question: Is mechanical interpretability work the only inner alignment work Anthropic is doing?
Evan and others on my team are working on non-mechanistic-interpretability directions primarily motivated by inner alignment:
Developing model organisms for deceptive inner alignment, which we may use to study the risk factors for deceptive alignment
Conditioning predictive models as an alternative to training agents. Predictive models may pose fewer inner alignment risks, for reasons discussed here
Studying the extent to which models exhibit likely pre-requisites to deceptive inner alignment, such as situational awareness (a very preliminary exploration is in Sec. 5 in our paper on model-written evaluations)
Investigating the extent to which externalized reasoning (e.g. chain of thought) is a way to gain transparency into a model’s process for solving a task
There’s also ongoing work on other teams related to (automated) red teaming of models and understanding how models generalize, which may also turn out to be relevant/helpful for inner alignment. It’s pretty unclear to me how useful any of these directions will turn out to be for inner alignment in the end, but we’ve chosen these directions in large part because we’re very concerned about inner alignment, and we’re actively looking for new directions that seem useful for mitigating inner misalignment risks.
Thanks for the links and explanation, Ethan.