This Google doc^ is a halted, formerly work-in-progress writeup of Evan Hubinger’s AI alignment research agenda, authored by Evan. It dates back to around 2020, and so Evan’s views on alignment have shifted since then.
Nevertheless, we thought it would be valuable to get this posted and available to everyone working in alignment!
In it, Evan outlines the following alignment scheme:
We should bake transparency tools into the loss function we’re training powerful models on, grading the model on its internal cognitive processes as well as on external behavior. We start by initializing a relatively dumb but non-deceptive model. We scale up the model, selecting against any model that isn’t demonstrably acceptable to a transparency-tool-assisted overseer.
While Evan doesn’t expect this approach to be robust against deceptively aligned models, the hope is that we can define a notion of an ‘acceptability predicate’ such that, if we start with a dumb aligned model and scale from there, grading on cognitive processes as well as behavior, no model on that trajectory in model space will ever become deceptive in the first place. That is, beforea model can be updated to become deceptive in this training process, it hopefully first must be updated to become unacceptable and non-deceptive. We can therefore update away from all merely unacceptable models as they appear, and thereby never instantiate a deceptive model in the first place.
At the time of this doc’s writing, the leading candidate for an adequate acceptability predicate was ‘demonstrably myopic.’ One plausible account of ‘myopia’ here is “return the action that your model of HCH would return, if it received your inputs.”
Since writing up this agenda, some things that Evan has updated on include:
Acceptability Verification: A Research Agenda
Link post
This Google doc^ is a halted, formerly work-in-progress writeup of Evan Hubinger’s AI alignment research agenda, authored by Evan. It dates back to around 2020, and so Evan’s views on alignment have shifted since then.
Nevertheless, we thought it would be valuable to get this posted and available to everyone working in alignment!
In it, Evan outlines the following alignment scheme:
At the time of this doc’s writing, the leading candidate for an adequate acceptability predicate was ‘demonstrably myopic.’ One plausible account of ‘myopia’ here is “return the action that your model of HCH would return, if it received your inputs.”
Since writing up this agenda, some things that Evan has updated on include:
Understanding myopia carefully is less important relative to just improving our transparency and interpretability capabilities.
Scaling interpretability via training transparency will require us to go through a bunch of independent transparency tech levels along the way.
Charting paths through model space is not necessarily a good way to think about ML inductive biases.
Training stories are the right way to think about alignment generally.
The best way to think about argmax HCH is to think of it as a human imitator in the ELK sense, but with the human replaced by HCH.
Large language models currently have the correct notion of myopia.