Building weak AI systems that help improve alignment seems extremely important to me and is a significant part of my optimism about AI alignment. I also think it’s a major reason that my work may turn out not to be relevant in the long term.
I still think there are tons of ways that delegating alignment can fail, such that it matters that we do alignment research in advance:
AI systems could have comparative disadvantage at alignment relative to causing trouble, so that AI systems are catastrophically risky before they solve alignment. Or more realistically, they may be worse at some parts of alignment such that human efforts continue to be relevant even if most of the work is being done by AI.
It could be very hard to oversee work on alignment because it’s hard to tell what constitutes progress. Being really good at alignment ourselves increases the probability that we know what to look for, so that we can train for alignment researchers. More broadly, doing a bunch of alignment increases the chances that we know how to automate it.
There may be long serial dependencies, where it’s hard to scale up work in alignment too rapidly and early work can facilitate larger investments down the line. Or more generally there could be prep work that’s important to do in advance that we can recognize and then do. This is closely related to the last two points.
It may be that alignment is just not soluble, and by recognizing that further in advance (and understanding the nature of the difficulty) we can have a better response (like actually avoiding dangerous forms of AI altogether, or starting to invest more seriously in various hail mary plans).
Overall I think that “make sure we are able to get good alignment research out of early AI systems” is comparably important to “do alignment ourselves.” Realistically I think the best case for “do alignment ourselves” is that if “do alignment” is the most important task to automate, then just working a ton on alignment is a great way to automate it. But that still means you should be investing quite a significant fraction of your time in automating alignment.
I also basically buy that language models are now good enough that “use them to help with alignment” can be taken seriously and it’s good to be attacking it directly.
Building weak AI systems that help improve alignment seems extremely important to me and is a significant part of my optimism about AI alignment. I also think it’s a major reason that my work may turn out not to be relevant in the long term.
I still think there are tons of ways that delegating alignment can fail, such that it matters that we do alignment research in advance:
AI systems could have comparative disadvantage at alignment relative to causing trouble, so that AI systems are catastrophically risky before they solve alignment. Or more realistically, they may be worse at some parts of alignment such that human efforts continue to be relevant even if most of the work is being done by AI.
It could be very hard to oversee work on alignment because it’s hard to tell what constitutes progress. Being really good at alignment ourselves increases the probability that we know what to look for, so that we can train for alignment researchers. More broadly, doing a bunch of alignment increases the chances that we know how to automate it.
There may be long serial dependencies, where it’s hard to scale up work in alignment too rapidly and early work can facilitate larger investments down the line. Or more generally there could be prep work that’s important to do in advance that we can recognize and then do. This is closely related to the last two points.
It may be that alignment is just not soluble, and by recognizing that further in advance (and understanding the nature of the difficulty) we can have a better response (like actually avoiding dangerous forms of AI altogether, or starting to invest more seriously in various hail mary plans).
Overall I think that “make sure we are able to get good alignment research out of early AI systems” is comparably important to “do alignment ourselves.” Realistically I think the best case for “do alignment ourselves” is that if “do alignment” is the most important task to automate, then just working a ton on alignment is a great way to automate it. But that still means you should be investing quite a significant fraction of your time in automating alignment.
I also basically buy that language models are now good enough that “use them to help with alignment” can be taken seriously and it’s good to be attacking it directly.
What do you (or others) think is the most promising, soon-possible way to use language models to help with alignment? A couple of possible ideas:
Using LMs to help with alignment theory (e.g., alignment forum posts, ELK proposals, etc.)
Using LMs to run experiments (e.g., writing code, launching experiments, analyzing experiments, and repeat)
Using LMs as research assistants (what Ought is doing with Elicit)
Something else?