The other problem with checking the code is that an FAI’s Friendliness content is also going to consist significantly or mostly of things the FAI has learned, in its own cognitive representation. Keeping these cognitive representations transparent is going to be an important issue, but basically you’d have to trust that the tool and possibly AI skill that somebody told you translates the cognitive content, really does so; and that the AI is answering questions honestly.
The main reason this isn’t completely hopeless for external assurance (by a trusted party, i.e., they have to be trusted not to destroy the world or start a competing project using gleaned insights) is that the FAI team can be expected to spend effort on maintaining their own assurance of Friendliness, and their own ability to be assured that goal-system content is transparent. Still, we’re not talking about anything nearly as easy as checking the code to see if the EVIL variable is set to 1.
The other problem with checking the code is that an FAI’s Friendliness content is also going to consist significantly or mostly of things the FAI has learned, in its own cognitive representation. Keeping these cognitive representations transparent is going to be an important issue, but basically you’d have to trust that the tool and possibly AI skill that somebody told you translates the cognitive content, really does so; and that the AI is answering questions honestly.
The main reason this isn’t completely hopeless for external assurance (by a trusted party, i.e., they have to be trusted not to destroy the world or start a competing project using gleaned insights) is that the FAI team can be expected to spend effort on maintaining their own assurance of Friendliness, and their own ability to be assured that goal-system content is transparent. Still, we’re not talking about anything nearly as easy as checking the code to see if the EVIL variable is set to 1.