Again I don’t have an especially strong opinion about what our prior should be on possible motivation systems for an AGI trained by straightforward debate, and in particular what fraction of those motivation systems are destructive. But I guess I’m sufficiently confident in “>50% chance that it’s destructive” that I’ll argue for that. I’ll assume the AGI uses model-based RL, which would (I claim) put it very roughly in the same category as humans.
Some aspects of motivation have an obvious relationship / correlation to the reward signal. In the human case, given that we’re a social animal, we can’t be surprised to find that the human brainstem reward function inserts lots of socially-related motivations into us, including things like caring about other humans (which sometimes generalizes to caring about other living creatures) and generally wanting to fit in and follow norms under most circumstances, etc. Whereas other things in the world have no relationship to the innate human brainstem reward function, and predictably, basically no one cares about them, except insofar as they become instrumentally useful for something else we do care about. (There are interesting rare exceptions, like human superstitions.) An example in humans would be the question of whether pebbles on the sidewalk are more often an even number of centimeters apart versus an odd number of centimeters apart.
In the straightforward debate setup, I can’t see any positive reason for the reward function to directly paint a valence, either positive or negative, onto the idea of the AGI taking over the world. So I revert to the default expectation that the AGI will view “I take over the world” in a way that’s analogous to how humans view “the pebbles on the sidewalk are an even number of centimeters apart”—i.e., totally neutral, except insofar as it becomes instrumentally relevant for something else. Meanwhile, the reward signal is directly painting positive valence onto some aspect(s) of winning the debate. It’s hard to say exactly what that aspect will be—in fact I think it will be at least somewhat random. But whatever it is, it seems to me to be >50% likely that the AGI can get more of it by taking over the world. I might get as high as “>80%” or “>90%” before I start shrugging and saying “I don’t really know”.
(Then we can start talking about capability windows etc., but I don’t think that was your objection here.)
But I guess I’m sufficiently confident in “>50% chance that it’s destructive” that I’ll argue for that.
Fwiw 50% on doom in the story I told seems plausible to me; maybe I’m at 30% but that’s very unstable. I don’t think we disagree all that much here.
Then we can start talking about capability windows etc., but I don’t think that was your objection here.
Capability windows are totally part of the objection. If you completely ignore capability windows / compute restrictions then you just run AIXI (or AIXI-tl if you don’t want something uncomputable) and die immediately.
Again I don’t have an especially strong opinion about what our prior should be on possible motivation systems for an AGI trained by straightforward debate, and in particular what fraction of those motivation systems are destructive. But I guess I’m sufficiently confident in “>50% chance that it’s destructive” that I’ll argue for that. I’ll assume the AGI uses model-based RL, which would (I claim) put it very roughly in the same category as humans.
Some aspects of motivation have an obvious relationship / correlation to the reward signal. In the human case, given that we’re a social animal, we can’t be surprised to find that the human brainstem reward function inserts lots of socially-related motivations into us, including things like caring about other humans (which sometimes generalizes to caring about other living creatures) and generally wanting to fit in and follow norms under most circumstances, etc. Whereas other things in the world have no relationship to the innate human brainstem reward function, and predictably, basically no one cares about them, except insofar as they become instrumentally useful for something else we do care about. (There are interesting rare exceptions, like human superstitions.) An example in humans would be the question of whether pebbles on the sidewalk are more often an even number of centimeters apart versus an odd number of centimeters apart.
In the straightforward debate setup, I can’t see any positive reason for the reward function to directly paint a valence, either positive or negative, onto the idea of the AGI taking over the world. So I revert to the default expectation that the AGI will view “I take over the world” in a way that’s analogous to how humans view “the pebbles on the sidewalk are an even number of centimeters apart”—i.e., totally neutral, except insofar as it becomes instrumentally relevant for something else. Meanwhile, the reward signal is directly painting positive valence onto some aspect(s) of winning the debate. It’s hard to say exactly what that aspect will be—in fact I think it will be at least somewhat random. But whatever it is, it seems to me to be >50% likely that the AGI can get more of it by taking over the world. I might get as high as “>80%” or “>90%” before I start shrugging and saying “I don’t really know”.
(Then we can start talking about capability windows etc., but I don’t think that was your objection here.)
Fwiw 50% on doom in the story I told seems plausible to me; maybe I’m at 30% but that’s very unstable. I don’t think we disagree all that much here.
Capability windows are totally part of the objection. If you completely ignore capability windows / compute restrictions then you just run AIXI (or AIXI-tl if you don’t want something uncomputable) and die immediately.