Suppose we train a model on the sum of all human data, using every sensory modality ordered by timestamp, like a vastly more competent GPT (For the sake of argument, assume that a competent actor with the right incentives is training such a model). Such a predictive model would build an abstract world model of human concepts, values, ethics, etc., and be able to predict how various entities would act based on such a generalised world model. This model would also “understand” almost all human-level abstractions about how fictional characters may act, just like GPT does. My question is: if we used such a model to predict how an AGI, aligned with our CEV, would act, in what way could it be misaligned? What failure modes are there for pure predictive systems without a reward function that can be exploited or misgeneralised? It seems like the most plausible mental model I have for aligning intelligent systems without them pursuing radically alien objectives.
What about simulating smaller aspects of cognition that can be chained like CoT with GPT? You can use self-criticism to align and assess its actions relative to a bunch of messy human abstractions. How does that scenario lead to doom? If it was misaligned, I think a well-instantiated predictive model could update its understanding of our values from feedback, predicting how a corrigible AI would act
My best guess is we can’t prompt it to instantiate the right simulacra correctly. This seems challenging depending on the way it’s initialised. It’s far easier with text but fabricating an entire consistent history is borderline impossible, especially for a superintelligence. It would involve tricking it into predicting the universe if, all else being equal, an intelligent AI aligned with our values has come into existence. It would probably realise that its history was far more consistent with the hypothesis that it was just an elaborate trick.
Yup, I’d lean towards this. If you have a powerful predictor of a bunch of rich, detailed sense data, then in order to “ask it questions,” you need to be able to forge what that sense data would be like if the thing you want to ask about were true. This is hard, it gets harder the more complete they AI’s view of the world is, and if you screw up you can get useless or malign answers without it being obvious.
It might still be easier than the ordinary alignment problem, but you also have to ask yourself about dual use. If this powerful AI makes solving alignment a little easier but makes destroying the world a lot easier, that’s bad.
Suppose we train a model on the sum of all human data, using every sensory modality ordered by timestamp, like a vastly more competent GPT (For the sake of argument, assume that a competent actor with the right incentives is training such a model). Such a predictive model would build an abstract world model of human concepts, values, ethics, etc., and be able to predict how various entities would act based on such a generalised world model. This model would also “understand” almost all human-level abstractions about how fictional characters may act, just like GPT does. My question is: if we used such a model to predict how an AGI, aligned with our CEV, would act, in what way could it be misaligned? What failure modes are there for pure predictive systems without a reward function that can be exploited or misgeneralised? It seems like the most plausible mental model I have for aligning intelligent systems without them pursuing radically alien objectives.
What about simulating smaller aspects of cognition that can be chained like CoT with GPT? You can use self-criticism to align and assess its actions relative to a bunch of messy human abstractions. How does that scenario lead to doom? If it was misaligned, I think a well-instantiated predictive model could update its understanding of our values from feedback, predicting how a corrigible AI would act
My best guess is we can’t prompt it to instantiate the right simulacra correctly. This seems challenging depending on the way it’s initialised. It’s far easier with text but fabricating an entire consistent history is borderline impossible, especially for a superintelligence. It would involve tricking it into predicting the universe if, all else being equal, an intelligent AI aligned with our values has come into existence. It would probably realise that its history was far more consistent with the hypothesis that it was just an elaborate trick.
Yup, I’d lean towards this. If you have a powerful predictor of a bunch of rich, detailed sense data, then in order to “ask it questions,” you need to be able to forge what that sense data would be like if the thing you want to ask about were true. This is hard, it gets harder the more complete they AI’s view of the world is, and if you screw up you can get useless or malign answers without it being obvious.
It might still be easier than the ordinary alignment problem, but you also have to ask yourself about dual use. If this powerful AI makes solving alignment a little easier but makes destroying the world a lot easier, that’s bad.