Yup, I’d lean towards this. If you have a powerful predictor of a bunch of rich, detailed sense data, then in order to “ask it questions,” you need to be able to forge what that sense data would be like if the thing you want to ask about were true. This is hard, it gets harder the more complete they AI’s view of the world is, and if you screw up you can get useless or malign answers without it being obvious.
It might still be easier than the ordinary alignment problem, but you also have to ask yourself about dual use. If this powerful AI makes solving alignment a little easier but makes destroying the world a lot easier, that’s bad.
Yup, I’d lean towards this. If you have a powerful predictor of a bunch of rich, detailed sense data, then in order to “ask it questions,” you need to be able to forge what that sense data would be like if the thing you want to ask about were true. This is hard, it gets harder the more complete they AI’s view of the world is, and if you screw up you can get useless or malign answers without it being obvious.
It might still be easier than the ordinary alignment problem, but you also have to ask yourself about dual use. If this powerful AI makes solving alignment a little easier but makes destroying the world a lot easier, that’s bad.