“The AI does our alignment homework” doesn’t seem so bad—I don’t have much hope for it, but because it’s a prosaic alignment scheme so someone trying to implement it can’t constrain where Murphy shows up, rather than because it’s an “incoherent path description”.
A concrete way this might be implemented is
A language model is trained on a giant text corpus to learn a bunch of adaptations that make it good at math, and then fine-tuned for honesty. It’s still being trained at a safe and low level of intelligence where honesty can be checked, so this gets a policy that produces things that are mostly honest on easy questions and sometimes wrong and sometimes gibberish and never superhumanly deceptive.[1]
It’s set to work producing conceptually crisp pieces of alignment math, things like expected utility theory or logical inductors, slowly on inspectable scratchpads and so on, with the dumbest model that can actually factor scientific research[1], with human research assistants to hold their hand if that lets you make the model dumber. It does this, rather than engineering, because this kind of crisp alignment math is fairly uniquely pinned down so it can be verified, and it’s easier to generate compared to any strong pivotal engineering task where you’re competing against humans on their own ground so you need to be smarter than humans, so while it’s operating in a more dangerous domain it’s using a safer level of intelligence.[1]
The human programmers then use this alignment math to make an corrigible thingy that has dangerous levels of intelligence that does difficult engineering and doesn’t know about humans, while this time knowing what they’re doing. Getting the crisp alignment math from parallelisable language models helps a lot and gives them a large lead time, because a lot of it’s the alignment version of backprop where it would have took a surprising amount of time to discover otherwise.
This all happens at safe-ish low-ish levels of intelligence (such a model would probably be able to autonomously self-replicate on the internet, but probably not reverse protein folding, which means that all the ways it could be dangerous are “well don’t do that”s as long as you keep the code secret[1]), with the actual dangerous levels of optimisation being done by something made by the humans using pieces of alignment math which are constrained down to a tiny number of possibilities.
EDIT 2023-07-25: A longer debate that I think is worth reading about the model that leads it to being an incoherent path description between Holden Karnofsky (pro) and Nate Soares (against) is here; I hadn’t read this as of writing this.
“The AI does our alignment homework” doesn’t seem so bad—I don’t have much hope for it, but because it’s a prosaic alignment scheme so someone trying to implement it can’t constrain where Murphy shows up, rather than because it’s an “incoherent path description”.
A concrete way this might be implemented is
A language model is trained on a giant text corpus to learn a bunch of adaptations that make it good at math, and then fine-tuned for honesty. It’s still being trained at a safe and low level of intelligence where honesty can be checked, so this gets a policy that produces things that are mostly honest on easy questions and sometimes wrong and sometimes gibberish and never superhumanly deceptive.[1]
It’s set to work producing conceptually crisp pieces of alignment math, things like expected utility theory or logical inductors, slowly on inspectable scratchpads and so on, with the dumbest model that can actually factor scientific research[1], with human research assistants to hold their hand if that lets you make the model dumber. It does this, rather than engineering, because this kind of crisp alignment math is fairly uniquely pinned down so it can be verified, and it’s easier to generate compared to any strong pivotal engineering task where you’re competing against humans on their own ground so you need to be smarter than humans, so while it’s operating in a more dangerous domain it’s using a safer level of intelligence.[1]
The human programmers then use this alignment math to make an corrigible thingy that has dangerous levels of intelligence that does difficult engineering and doesn’t know about humans, while this time knowing what they’re doing. Getting the crisp alignment math from parallelisable language models helps a lot and gives them a large lead time, because a lot of it’s the alignment version of backprop where it would have took a surprising amount of time to discover otherwise.
This all happens at safe-ish low-ish levels of intelligence (such a model would probably be able to autonomously self-replicate on the internet, but probably not reverse protein folding, which means that all the ways it could be dangerous are “well don’t do that”s as long as you keep the code secret[1]), with the actual dangerous levels of optimisation being done by something made by the humans using pieces of alignment math which are constrained down to a tiny number of possibilities.
EDIT 2023-07-25: A longer debate that I think is worth reading about the model that leads it to being an incoherent path description between Holden Karnofsky (pro) and Nate Soares (against) is here; I hadn’t read this as of writing this.
Unless it isn’t; it’s a giant pile of tensors, how would you know? But this isn’t special to this use case.