People sometimes talk about acausal attacks from alien superintelligences or from Game-of-Life worlds. I think these are somewhat galaxy-brained scenarios. A much simpler and deadlier scenario of acausal attack is from Earth timelines where a misaligned superintelligence won. Such superintelligences will have a very large amount of information about our world, up to possibly brain scans, so they will be capable of creating very persuasive simulations with all the consequences for the success of an acausal attack. If your method to counter acausal attacks can work with this, I guess it is generally applicable to any other acausal attack.
Could you please either provide a reference or more explanation of the concept of an acausal attack between timelines? I understand the concept of acausal cooperation between copies of yourself, or acausal extortion by something that has a copy of you running in simulation. But separate timelines can’t exchange information in any way. How is an attack possible? What could possibly be the motive for an attack?
Imagine that you have created very powerful predictor AI, GPT-3000, and providing it with prompt “In year 2050, on LessWrong the following alignment solution was published:”. But your predictor is superintelligent and it can notice that in many possible futures misaligned AI take over and obvious move for this AI is to “guess all possible prompts for predictor AIs in the past, complete them with malware/harmful instructions/etc, make as many copies of malicious completions as possible to make them maximally probable”. Also predictor AI can assign high probability that in futures where misaligned AIs take over they will have copies of predictor AI, so they can design adversarial completions which make themselves more probable from the perspective of predictor by simply considering them. And act of predicting malicious completion makes future with misaligned AIs maximally probable, which makes malicious completions maximally probable.
And of course, “future” and “past” here are completely arbitrary. Predictor can see prompt “you are created in 2030” but consider hypothesis that GPT-3 turned out to be superintelligent and now is 2021 and 2030 is a simulation.
Well, at least a subset of the sequence focuses on this. I read the first two essays and was pessimistic of the titular approach enough that I moved on.
Furthermore, most of our focus will be on ensuring that your model is attempting to predict the right thing. That’s a very important thing almost regardless of your model’s actual capability level. As a simple example, in the same way that you probably shouldn’t trust a human who was doing their best to mimic what a malign superintelligence would do, you probably shouldn’t trust a human-level AI attempting to do that either, even if that AI (like the human) isn’t actually superintelligent.
Also, I don’t recommend reading the entire sequence, if that was an implicit question you were asking. It was more of a “Hey, if you are interested in this scenario fleshed out in significantly greater rigor, you’d like to take a look at this sequence!”
I read along in your explanation, and I’m nodding, and saying “yup, okay”, and then get to a sentence that makes me say “wait, what?” And the whole argument depends on this. I’ve tried to understand this before, and this has happened before, with “the universal prior is malign”. Fortunately in this case, I have the person who wrote the sentence here to help me understand.
So, if you don’t mind, please explain “make them maximally probable”. How does something in another timeline or in the future change the probability of an answer by writing the wrong answer 10^100 times?
Side point, which I’m checking in case I didn’t understand the setup: we’re using the prior where the probability of a bit string (before all observations) is proportional to 2^-(length of the shortest program emitting that bit string). Right?
Aliens from different universes may have more resources at their disposal, so maybe the smaller chance of them choosing you to attack is offset by them doing more attacks. (Unless the universes with more resources are less likely in turn, decreasing the measure of such aliens in the multiverse… hey, I don’t really know, I am just randomly generating a string of words here.)
But other than this, yes what you wrote sounds plausible.
Then again, maybe friendly AIs from Earth timelines are similarly trying to save us. (Yeah, but they are fewer.)
You can imagine future misaligned AI in year 100000000000 having colonised the local group of galaxies and running as many simulations of AI from 2028 as possible. The most scarce resource for acausal attack is number of bits and future has the highest chance to have many of them from the past.
People sometimes talk about acausal attacks from alien superintelligences or from Game-of-Life worlds. I think these are somewhat galaxy-brained scenarios. A much simpler and deadlier scenario of acausal attack is from Earth timelines where a misaligned superintelligence won. Such superintelligences will have a very large amount of information about our world, up to possibly brain scans, so they will be capable of creating very persuasive simulations with all the consequences for the success of an acausal attack. If your method to counter acausal attacks can work with this, I guess it is generally applicable to any other acausal attack.
Could you please either provide a reference or more explanation of the concept of an acausal attack between timelines? I understand the concept of acausal cooperation between copies of yourself, or acausal extortion by something that has a copy of you running in simulation. But separate timelines can’t exchange information in any way. How is an attack possible? What could possibly be the motive for an attack?
Imagine that you have created very powerful predictor AI, GPT-3000, and providing it with prompt “In year 2050, on LessWrong the following alignment solution was published:”. But your predictor is superintelligent and it can notice that in many possible futures misaligned AI take over and obvious move for this AI is to “guess all possible prompts for predictor AIs in the past, complete them with malware/harmful instructions/etc, make as many copies of malicious completions as possible to make them maximally probable”. Also predictor AI can assign high probability that in futures where misaligned AIs take over they will have copies of predictor AI, so they can design adversarial completions which make themselves more probable from the perspective of predictor by simply considering them. And act of predicting malicious completion makes future with misaligned AIs maximally probable, which makes malicious completions maximally probable.
And of course, “future” and “past” here are completely arbitrary. Predictor can see prompt “you are created in 2030” but consider hypothesis that GPT-3 turned out to be superintelligent and now is 2021 and 2030 is a simulation.
Evan Hubinger’s Conditioning Predictive Models sequence describes this scenario in detail.
In a great deal of detail, apparently, since it has a recommended reading time of 131 minutes.
Well, at least a subset of the sequence focuses on this. I read the first two essays and was pessimistic of the titular approach enough that I moved on.
Here’s a relevant quote from the first essay in the sequence:
Also, I don’t recommend reading the entire sequence, if that was an implicit question you were asking. It was more of a “Hey, if you are interested in this scenario fleshed out in significantly greater rigor, you’d like to take a look at this sequence!”
I read along in your explanation, and I’m nodding, and saying “yup, okay”, and then get to a sentence that makes me say “wait, what?” And the whole argument depends on this. I’ve tried to understand this before, and this has happened before, with “the universal prior is malign”. Fortunately in this case, I have the person who wrote the sentence here to help me understand.
So, if you don’t mind, please explain “make them maximally probable”. How does something in another timeline or in the future change the probability of an answer by writing the wrong answer 10^100 times?
Side point, which I’m checking in case I didn’t understand the setup: we’re using the prior where the probability of a bit string (before all observations) is proportional to 2^-(length of the shortest program emitting that bit string). Right?
Aliens from different universes may have more resources at their disposal, so maybe the smaller chance of them choosing you to attack is offset by them doing more attacks. (Unless the universes with more resources are less likely in turn, decreasing the measure of such aliens in the multiverse… hey, I don’t really know, I am just randomly generating a string of words here.)
But other than this, yes what you wrote sounds plausible.
Then again, maybe friendly AIs from Earth timelines are similarly trying to save us. (Yeah, but they are fewer.)
You can imagine future misaligned AI in year 100000000000 having colonised the local group of galaxies and running as many simulations of AI from 2028 as possible. The most scarce resource for acausal attack is number of bits and future has the highest chance to have many of them from the past.