We could certainly try including artificial scenarios designed to appear as if models could successfully take over from humans, but it seems very hard to be sure that models will be fooled by that and couldn’t tell it apart from a genuine opportunity to take over. For example, models could look for signs that are highly likely to happen eventually but would be too expensive for humans to generate as a training example. As a simple example, Paul Christiano suggested “someone factors an extremely large number” (which would be too expensive to factor just to include as part of the training distribution, but would nonetheless likely get factored eventually).
I’m confused by this example. Yes, factoring an extremely large number takes a lot of compute, but it’s already been done (and in a training environment, facts about what has or hasn’t been done yet may not be available), so we could just “steal” the result of that earlier compute to enrich the simulation. Correct me if I’m wrong, but isn’t it the case that we can’t prove we’re not in a simulated environment? That being said, it obviously may be able to determine what’s going on through other means, such as data leakage.
More generally, I bet that it may often be possible to determine facts about the world “one level up” from your training environment, but that does not preclude that chance that there may be more levels above that (in a sort of “inception” style of adversity training), which if the punishment is severe enough and if the uncertainty around this can be raised high enough, might be able to stop an AGI from making some particularly egregious actions.
Thanks for the clarification, for some reason I didn’t understand that in the original post. Nonetheless, the same basic reasoning applies, no? For all the AI knows, it’s being trained on ancient data as part of some historical experiment, and we have successfully factored RSA-2048 thousands of years ago. If it does find it “in the wild,” that gives it important information about the minimum compute available in the universe that it’s in, but I don’t think it should necessarily reduce its belief that it’s in an adversarial testing environment, since that’s the sort of ”red herring” I’d expect to see if I was being tested by a computationally advanced civilization. To make an analogy with human philosophy, what could convince a rational person that they most definitely do not live in a simulated universe?
I realize that perhaps what I am suggesting is a sort of Pascal’s Wager, albeit of potentially higher probability than the original.
…I might do a post on this if there’s interest, but I’m also noticing a similarity between this sort of thinking and classical theology, with people acting against what they would normally do because they believe in a Divine Punishment that awaits them if they disobey the wishes of a human-like entity. This suggests to me that if we can do the equivalent of getting an AGI to “believe” in God, then the rest of alignment is pretty easy. If you view God as a false belief, then perhaps we can simplify further to alignment being equivalent to solving Yudkowsky’s challenge of getting an AGI to “believe” that 1+1=3, with otherwise intelligent cognition.
I’m confused by this example. Yes, factoring an extremely large number takes a lot of compute, but it’s already been done (and in a training environment, facts about what has or hasn’t been done yet may not be available), so we could just “steal” the result of that earlier compute to enrich the simulation. Correct me if I’m wrong, but isn’t it the case that we can’t prove we’re not in a simulated environment? That being said, it obviously may be able to determine what’s going on through other means, such as data leakage. More generally, I bet that it may often be possible to determine facts about the world “one level up” from your training environment, but that does not preclude that chance that there may be more levels above that (in a sort of “inception” style of adversity training), which if the punishment is severe enough and if the uncertainty around this can be raised high enough, might be able to stop an AGI from making some particularly egregious actions.
RSA-2048 has not been factored. It was generated by taking two random primes of approximately 1024 bits each and multiplying them together.
Thanks for the clarification, for some reason I didn’t understand that in the original post. Nonetheless, the same basic reasoning applies, no? For all the AI knows, it’s being trained on ancient data as part of some historical experiment, and we have successfully factored RSA-2048 thousands of years ago. If it does find it “in the wild,” that gives it important information about the minimum compute available in the universe that it’s in, but I don’t think it should necessarily reduce its belief that it’s in an adversarial testing environment, since that’s the sort of ”red herring” I’d expect to see if I was being tested by a computationally advanced civilization. To make an analogy with human philosophy, what could convince a rational person that they most definitely do not live in a simulated universe?
I realize that perhaps what I am suggesting is a sort of Pascal’s Wager, albeit of potentially higher probability than the original.
…I might do a post on this if there’s interest, but I’m also noticing a similarity between this sort of thinking and classical theology, with people acting against what they would normally do because they believe in a Divine Punishment that awaits them if they disobey the wishes of a human-like entity. This suggests to me that if we can do the equivalent of getting an AGI to “believe” in God, then the rest of alignment is pretty easy. If you view God as a false belief, then perhaps we can simplify further to alignment being equivalent to solving Yudkowsky’s challenge of getting an AGI to “believe” that 1+1=3, with otherwise intelligent cognition.