I have been thinking along very similar lines, with the idea of creating a synthetic dataset showing many examples of a corrigible agent (see Max Harms’ CAST series ). Seth Herd also mentions this, but I wanted to redundantly chime in.
Corrigibility seems particularly promising because I think it is likely that it does form an attractor basin where getting close enough and then using the imperfectly corrigible agent to produce better training data does consistently move towards increasing corrigibility.
In fact, I specifically would like Anthropic to hire me to tackle this project of creating a corrigibility dataset and then training or at least fine-tuning on it. The part I’m trying to figure out now is how to sell Anthropic on the idea.
I have been thinking along very similar lines, with the idea of creating a synthetic dataset showing many examples of a corrigible agent (see Max Harms’ CAST series ). Seth Herd also mentions this, but I wanted to redundantly chime in.
Corrigibility seems particularly promising because I think it is likely that it does form an attractor basin where getting close enough and then using the imperfectly corrigible agent to produce better training data does consistently move towards increasing corrigibility.
In fact, I specifically would like Anthropic to hire me to tackle this project of creating a corrigibility dataset and then training or at least fine-tuning on it. The part I’m trying to figure out now is how to sell Anthropic on the idea.
I hadn’t yet got around to reading the CAST series: now I have to! :-)
Some of the authors of the Pretraining Language Models with Human Preferences paper now work at Anthropic. I would also love for Anthropic to hire me to work on this stuff!
In some sense, the human input and oversight in AI-assisted alignment is the same thing as corrigibility.