The hope discussed in this post is that you could have a system that is aligned but not superintelligent (more like human-level-ish, and aligned in the sense that it is imitation-ish), doing the kind of alignment work humans are doing today, which could hopefully lead to a more scalable alignment approach that works on more capable systems.
But then would a less intelligent being (i.e. the collectivity of human alignment researchers and less powerful AI systems that they use as tool in their research) be capable of validly examining a more intelligent being, without being deceived by the more intelligent being?
It seems like the same question would apply to humans trying to solve the alignment problem—does that seem right? My answer to your question is “maybe”, but it seems good to get on the same page about whether “humans trying to solve alignment” and “specialized human-ish safe AIs trying to solve alignment” are basically the same challenge.
The hope discussed in this post is that you could have a system that is aligned but not superintelligent (more like human-level-ish, and aligned in the sense that it is imitation-ish), doing the kind of alignment work humans are doing today, which could hopefully lead to a more scalable alignment approach that works on more capable systems.
But then would a less intelligent being (i.e. the collectivity of human alignment researchers and less powerful AI systems that they use as tool in their research) be capable of validly examining a more intelligent being, without being deceived by the more intelligent being?
It seems like the same question would apply to humans trying to solve the alignment problem—does that seem right? My answer to your question is “maybe”, but it seems good to get on the same page about whether “humans trying to solve alignment” and “specialized human-ish safe AIs trying to solve alignment” are basically the same challenge.