My best guess is we can’t prompt it to instantiate the right simulacra correctly. This seems challenging depending on the way it’s initialised. It’s far easier with text but fabricating an entire consistent history is borderline impossible, especially for a superintelligence. It would involve tricking it into predicting the universe if, all else being equal, an intelligent AI aligned with our values has come into existence. It would probably realise that its history was far more consistent with the hypothesis that it was just an elaborate trick.
Yup, I’d lean towards this. If you have a powerful predictor of a bunch of rich, detailed sense data, then in order to “ask it questions,” you need to be able to forge what that sense data would be like if the thing you want to ask about were true. This is hard, it gets harder the more complete they AI’s view of the world is, and if you screw up you can get useless or malign answers without it being obvious.
It might still be easier than the ordinary alignment problem, but you also have to ask yourself about dual use. If this powerful AI makes solving alignment a little easier but makes destroying the world a lot easier, that’s bad.
My best guess is we can’t prompt it to instantiate the right simulacra correctly. This seems challenging depending on the way it’s initialised. It’s far easier with text but fabricating an entire consistent history is borderline impossible, especially for a superintelligence. It would involve tricking it into predicting the universe if, all else being equal, an intelligent AI aligned with our values has come into existence. It would probably realise that its history was far more consistent with the hypothesis that it was just an elaborate trick.
Yup, I’d lean towards this. If you have a powerful predictor of a bunch of rich, detailed sense data, then in order to “ask it questions,” you need to be able to forge what that sense data would be like if the thing you want to ask about were true. This is hard, it gets harder the more complete they AI’s view of the world is, and if you screw up you can get useless or malign answers without it being obvious.
It might still be easier than the ordinary alignment problem, but you also have to ask yourself about dual use. If this powerful AI makes solving alignment a little easier but makes destroying the world a lot easier, that’s bad.