For any confidence that an AI system A will do a good job of its assigned duty of maximizing alignment in AI system B, wouldn’t you need to be convinced that AI system A is well aligned with its given assignment of maximizing alignment in AI system B? In other words, doesn’t that suppose you have actually already solved the problem you are trying to solve?
And if you have not—aren’t you just priming yourself for manipulation by smarter beings?
There might be good reasons why we don’t ask the fox about the best ways to keep the fox out of the henhouse, even though the fox is very smart, and might well actually know what those would be, if it cared to tell us.
The hope discussed in this post is that you could have a system that is aligned but not superintelligent (more like human-level-ish, and aligned in the sense that it is imitation-ish), doing the kind of alignment work humans are doing today, which could hopefully lead to a more scalable alignment approach that works on more capable systems.
But then would a less intelligent being (i.e. the collectivity of human alignment researchers and less powerful AI systems that they use as tool in their research) be capable of validly examining a more intelligent being, without being deceived by the more intelligent being?
It seems like the same question would apply to humans trying to solve the alignment problem—does that seem right? My answer to your question is “maybe”, but it seems good to get on the same page about whether “humans trying to solve alignment” and “specialized human-ish safe AIs trying to solve alignment” are basically the same challenge.
For any confidence that an AI system A will do a good job of its assigned duty of maximizing alignment in AI system B, wouldn’t you need to be convinced that AI system A is well aligned with its given assignment of maximizing alignment in AI system B? In other words, doesn’t that suppose you have actually already solved the problem you are trying to solve?
And if you have not—aren’t you just priming yourself for manipulation by smarter beings?
There might be good reasons why we don’t ask the fox about the best ways to keep the fox out of the henhouse, even though the fox is very smart, and might well actually know what those would be, if it cared to tell us.
The hope discussed in this post is that you could have a system that is aligned but not superintelligent (more like human-level-ish, and aligned in the sense that it is imitation-ish), doing the kind of alignment work humans are doing today, which could hopefully lead to a more scalable alignment approach that works on more capable systems.
But then would a less intelligent being (i.e. the collectivity of human alignment researchers and less powerful AI systems that they use as tool in their research) be capable of validly examining a more intelligent being, without being deceived by the more intelligent being?
It seems like the same question would apply to humans trying to solve the alignment problem—does that seem right? My answer to your question is “maybe”, but it seems good to get on the same page about whether “humans trying to solve alignment” and “specialized human-ish safe AIs trying to solve alignment” are basically the same challenge.