Imagine that you have created very powerful predictor AI, GPT-3000, and providing it with prompt “In year 2050, on LessWrong the following alignment solution was published:”. But your predictor is superintelligent and it can notice that in many possible futures misaligned AI take over and obvious move for this AI is to “guess all possible prompts for predictor AIs in the past, complete them with malware/harmful instructions/etc, make as many copies of malicious completions as possible to make them maximally probable”. Also predictor AI can assign high probability that in futures where misaligned AIs take over they will have copies of predictor AI, so they can design adversarial completions which make themselves more probable from the perspective of predictor by simply considering them. And act of predicting malicious completion makes future with misaligned AIs maximally probable, which makes malicious completions maximally probable.
And of course, “future” and “past” here are completely arbitrary. Predictor can see prompt “you are created in 2030” but consider hypothesis that GPT-3 turned out to be superintelligent and now is 2021 and 2030 is a simulation.
Well, at least a subset of the sequence focuses on this. I read the first two essays and was pessimistic of the titular approach enough that I moved on.
Furthermore, most of our focus will be on ensuring that your model is attempting to predict the right thing. That’s a very important thing almost regardless of your model’s actual capability level. As a simple example, in the same way that you probably shouldn’t trust a human who was doing their best to mimic what a malign superintelligence would do, you probably shouldn’t trust a human-level AI attempting to do that either, even if that AI (like the human) isn’t actually superintelligent.
Also, I don’t recommend reading the entire sequence, if that was an implicit question you were asking. It was more of a “Hey, if you are interested in this scenario fleshed out in significantly greater rigor, you’d like to take a look at this sequence!”
I read along in your explanation, and I’m nodding, and saying “yup, okay”, and then get to a sentence that makes me say “wait, what?” And the whole argument depends on this. I’ve tried to understand this before, and this has happened before, with “the universal prior is malign”. Fortunately in this case, I have the person who wrote the sentence here to help me understand.
So, if you don’t mind, please explain “make them maximally probable”. How does something in another timeline or in the future change the probability of an answer by writing the wrong answer 10^100 times?
Side point, which I’m checking in case I didn’t understand the setup: we’re using the prior where the probability of a bit string (before all observations) is proportional to 2^-(length of the shortest program emitting that bit string). Right?
Imagine that you have created very powerful predictor AI, GPT-3000, and providing it with prompt “In year 2050, on LessWrong the following alignment solution was published:”. But your predictor is superintelligent and it can notice that in many possible futures misaligned AI take over and obvious move for this AI is to “guess all possible prompts for predictor AIs in the past, complete them with malware/harmful instructions/etc, make as many copies of malicious completions as possible to make them maximally probable”. Also predictor AI can assign high probability that in futures where misaligned AIs take over they will have copies of predictor AI, so they can design adversarial completions which make themselves more probable from the perspective of predictor by simply considering them. And act of predicting malicious completion makes future with misaligned AIs maximally probable, which makes malicious completions maximally probable.
And of course, “future” and “past” here are completely arbitrary. Predictor can see prompt “you are created in 2030” but consider hypothesis that GPT-3 turned out to be superintelligent and now is 2021 and 2030 is a simulation.
Evan Hubinger’s Conditioning Predictive Models sequence describes this scenario in detail.
In a great deal of detail, apparently, since it has a recommended reading time of 131 minutes.
Well, at least a subset of the sequence focuses on this. I read the first two essays and was pessimistic of the titular approach enough that I moved on.
Here’s a relevant quote from the first essay in the sequence:
Also, I don’t recommend reading the entire sequence, if that was an implicit question you were asking. It was more of a “Hey, if you are interested in this scenario fleshed out in significantly greater rigor, you’d like to take a look at this sequence!”
I read along in your explanation, and I’m nodding, and saying “yup, okay”, and then get to a sentence that makes me say “wait, what?” And the whole argument depends on this. I’ve tried to understand this before, and this has happened before, with “the universal prior is malign”. Fortunately in this case, I have the person who wrote the sentence here to help me understand.
So, if you don’t mind, please explain “make them maximally probable”. How does something in another timeline or in the future change the probability of an answer by writing the wrong answer 10^100 times?
Side point, which I’m checking in case I didn’t understand the setup: we’re using the prior where the probability of a bit string (before all observations) is proportional to 2^-(length of the shortest program emitting that bit string). Right?