Evan and others on my team are working on non-mechanistic-interpretability directions primarily motivated by inner alignment:
Developing model organisms for deceptive inner alignment, which we may use to study the risk factors for deceptive alignment
Conditioning predictive models as an alternative to training agents. Predictive models may pose fewer inner alignment risks, for reasons discussed here
Studying the extent to which models exhibit likely pre-requisites to deceptive inner alignment, such as situational awareness (a very preliminary exploration is in Sec. 5 in our paper on model-written evaluations)
Investigating the extent to which externalized reasoning (e.g. chain of thought) is a way to gain transparency into a model’s process for solving a task
There’s also ongoing work on other teams related to (automated) red teaming of models and understanding how models generalize, which may also turn out to be relevant/helpful for inner alignment. It’s pretty unclear to me how useful any of these directions will turn out to be for inner alignment in the end, but we’ve chosen these directions in large part because we’re very concerned about inner alignment, and we’re actively looking for new directions that seem useful for mitigating inner misalignment risks.
Evan and others on my team are working on non-mechanistic-interpretability directions primarily motivated by inner alignment:
Developing model organisms for deceptive inner alignment, which we may use to study the risk factors for deceptive alignment
Conditioning predictive models as an alternative to training agents. Predictive models may pose fewer inner alignment risks, for reasons discussed here
Studying the extent to which models exhibit likely pre-requisites to deceptive inner alignment, such as situational awareness (a very preliminary exploration is in Sec. 5 in our paper on model-written evaluations)
Investigating the extent to which externalized reasoning (e.g. chain of thought) is a way to gain transparency into a model’s process for solving a task
There’s also ongoing work on other teams related to (automated) red teaming of models and understanding how models generalize, which may also turn out to be relevant/helpful for inner alignment. It’s pretty unclear to me how useful any of these directions will turn out to be for inner alignment in the end, but we’ve chosen these directions in large part because we’re very concerned about inner alignment, and we’re actively looking for new directions that seem useful for mitigating inner misalignment risks.
Thanks for the links and explanation, Ethan.