make the AI produce the AI safety ideas which not only solve alignment, but also yield some aspect of capabilities growth along an axis that the big players care about, and in a way where the capabilities are not easily separable from the alignment.
So firstly, in this world capability is bottlenecked by chips. There isn’t a runaway process of software improvements happening yet. And this means there probably aren’t large easy capabilities software improvements lying around.
Now “making capability improvements that are actively tied to alignment somehow” sounds harder than making any capability improvement at all. And you don’t have as much compute as the big players. So you probably don’t find much.
What kind of AI research would make it hard to create a misaligned AI anyway?
A new more efficient matrix multiplication algorithm that only works when it’s part of a CEV maximizing AI?
The big players do care about having instruction-following AIs,
Likely somewhat true.
and if the way to do that is to use the AI safety book, they will use it.
Perhaps. Don’t underestimate sheer incompetence. Someone pressing the run button to test the code works so far, when they haven’t programmed the alignment bit yet. Someone copying and pasting in an alignment function but forgetting to actually call the function anywhere. Misspelled variable names that are actually another variable. Nothing is idiot proof.
I mean presumably alignment is fairly complicated and it could all go badly wrong because of the equivalent of one malfunctioning o-ring. Or what if someone finds a much more efficient approach that’s harder to align.
So firstly, in this world capability is bottlenecked by chips. There isn’t a runaway process of software improvements happening yet. And this means there probably aren’t large easy capabilities software improvements lying around.
Now “making capability improvements that are actively tied to alignment somehow” sounds harder than making any capability improvement at all. And you don’t have as much compute as the big players. So you probably don’t find much.
What kind of AI research would make it hard to create a misaligned AI anyway?
A new more efficient matrix multiplication algorithm that only works when it’s part of a CEV maximizing AI?
Likely somewhat true.
Perhaps. Don’t underestimate sheer incompetence. Someone pressing the run button to test the code works so far, when they haven’t programmed the alignment bit yet. Someone copying and pasting in an alignment function but forgetting to actually call the function anywhere. Misspelled variable names that are actually another variable. Nothing is idiot proof.
I mean presumably alignment is fairly complicated and it could all go badly wrong because of the equivalent of one malfunctioning o-ring. Or what if someone finds a much more efficient approach that’s harder to align.