This is probably one of the most important post on alignment on this forum. Seriously. I want everyone thinking about conceptual alignment, and everyone trying conceptual alignment, to read this and think about it deeply.
What this gives us is a way of combining the output of many disparate epistemic strategies to get well structured and directly relevant knowledge about alignment and how our proposals would fare. This is great, because now, we can combine many different methods of investigation (theory arguments, philosophical approaches, empirical studies of analogous systems and problems) and try to tie them to a common narrative (pun intended) about alignment.
Of course, we should expect that some things we want to learn about don’t fit neatly in there, but training stories are still surprisingly inclusive. For example we could expect that reasoning about potential problems of AGI, in the very conceptual/philosophical/theoretical way we favor on the AF, doesn’t fit a framework focused on justifying a given approach. Yet training stories also includes the probing of their rationale, and finding a new problem/issue allows new probing and refinement, like the very theoretical computer science model presented by Paul in his research methodology post.
There is indeed one thing this post doesn’t get into: exactly which epistemic strategies can and should we use to argue for each part of a training story, and to break and falsify each. Still, I find that having a framing for combining and linking the output of the existing and new epistemic strategies is already quite an accomplishment. Plus it leaves me some work to do on clarifying and distilling the epistemic strategies of alignment.
Last but not least, I really like the name “story” for two reasons:
First this actually captures what most of these reasoning feel like. They’re not so much theories than narratives, and using the word story makes that clear and explicit.
But more importantly, “story” makes technical people feel uncomfortable. We immediately fear weird justification and biases towards believing interesting stories. And we should be wary of this when working on alignment, while acknowledging that most of our knowledge will take a form like that. So the word reminds us daily to not feel too comfortable with our ideas and intuitions, as we always risk falling for our own inventions.
This is probably one of the most important post on alignment on this forum. Seriously. I want everyone thinking about conceptual alignment, and everyone trying conceptual alignment, to read this and think about it deeply.
Glad you think so! I definitely agree and am planning on using this framework in my own research going forward.
“story” makes technical people feel uncomfortable. We immediately fear weird justification and biases towards believing interesting stories. And we should be wary of this when working on alignment, while acknowledging that most of our knowledge will take a form like that. So the word reminds us daily to not feel too comfortable with our ideas and intuitions, as we always risk falling for our own inventions.
Yep, this is definitely intentional. I think in many ways just thinking about inner alignment as avoiding proxy-aligned mesa-optimizers can give you false confidence in your training story because you reason “of course I won’t get that specific failure model”—but the problem is that you need to couple some reason that you won’t get the wrong thing with some strong reason that you actually will get the right thing to really be confident in your training process’s safety.
This is probably one of the most important post on alignment on this forum. Seriously. I want everyone thinking about conceptual alignment, and everyone trying conceptual alignment, to read this and think about it deeply.
What this gives us is a way of combining the output of many disparate epistemic strategies to get well structured and directly relevant knowledge about alignment and how our proposals would fare. This is great, because now, we can combine many different methods of investigation (theory arguments, philosophical approaches, empirical studies of analogous systems and problems) and try to tie them to a common narrative (pun intended) about alignment.
Of course, we should expect that some things we want to learn about don’t fit neatly in there, but training stories are still surprisingly inclusive. For example we could expect that reasoning about potential problems of AGI, in the very conceptual/philosophical/theoretical way we favor on the AF, doesn’t fit a framework focused on justifying a given approach. Yet training stories also includes the probing of their rationale, and finding a new problem/issue allows new probing and refinement, like the very theoretical computer science model presented by Paul in his research methodology post.
There is indeed one thing this post doesn’t get into: exactly which epistemic strategies can and should we use to argue for each part of a training story, and to break and falsify each. Still, I find that having a framing for combining and linking the output of the existing and new epistemic strategies is already quite an accomplishment. Plus it leaves me some work to do on clarifying and distilling the epistemic strategies of alignment.
Last but not least, I really like the name “story” for two reasons:
First this actually captures what most of these reasoning feel like. They’re not so much theories than narratives, and using the word story makes that clear and explicit.
But more importantly, “story” makes technical people feel uncomfortable. We immediately fear weird justification and biases towards believing interesting stories. And we should be wary of this when working on alignment, while acknowledging that most of our knowledge will take a form like that. So the word reminds us daily to not feel too comfortable with our ideas and intuitions, as we always risk falling for our own inventions.
Glad you think so! I definitely agree and am planning on using this framework in my own research going forward.
Yep, this is definitely intentional. I think in many ways just thinking about inner alignment as avoiding proxy-aligned mesa-optimizers can give you false confidence in your training story because you reason “of course I won’t get that specific failure model”—but the problem is that you need to couple some reason that you won’t get the wrong thing with some strong reason that you actually will get the right thing to really be confident in your training process’s safety.