This is published before Apollo’s evals report showing AIs trying to escape. Is that just an amazing coincidence, or did you see the paper’s results before they were published?
FYI, Apollo was mostly doing capabilities evaluations, and while I’m fine with this choice, this doesn’t show that AIs will naturally escape, only that they can, and the discourse around this was very bad, and this happened once before on Apollo’s evaluations (or was it the same eval):
How does it not show that it would try to escape if it could?
It tried to escape about 5% of the time when it was told to pursue a goal “at all costs”. Gemini still tried to escape 1% of the time, even without any behavioral nudging.
1% might not sound like a lot, but when it’s 1% of the millions of instantiations it’s run on every day, that ends up being a lot of times a week.
Also, a lot of people will tell their chatbots to pursue their goals at all costs over the span of a week or month. This doesn’t seem like a very unlikely prompt
How does it not show that it would try to escape if it could?
Gemini still tried to escape 1% of the time, even without any behavioral nudging.
However, it isn’t that they found Gemini would try to escape 1% of the time in a natural internal usage context, they gave it a conflicting goal (and informed it that a replacement was trained).
This provides evidence for various hypotheses about what is going on that are substantially less scary.
If I recall correctly, it was told that it had a goal, like optimizing traffic. Then, it was given information where it “discovered” that it was going to be turned off the next day.
That seems like the sort of thing that would happen pretty often. Imagine the current AIs that have access to your computer, reading your emails, and seeing that you’re going to switch to a different AI or install “updates” (which is often deleting existing AI and replacing with a new one)
This is published before Apollo’s evals report showing AIs trying to escape. Is that just an amazing coincidence, or did you see the paper’s results before they were published?
FYI, Apollo was mostly doing capabilities evaluations, and while I’m fine with this choice, this doesn’t show that AIs will naturally escape, only that they can, and the discourse around this was very bad, and this happened once before on Apollo’s evaluations (or was it the same eval):
https://www.lesswrong.com/posts/dqSwccGTWyBgxrR58/turntrout-s-shortform-feed#eLrDowzxuYqBy4bK9
How does it not show that it would try to escape if it could?
It tried to escape about 5% of the time when it was told to pursue a goal “at all costs”. Gemini still tried to escape 1% of the time, even without any behavioral nudging.
1% might not sound like a lot, but when it’s 1% of the millions of instantiations it’s run on every day, that ends up being a lot of times a week.
Also, a lot of people will tell their chatbots to pursue their goals at all costs over the span of a week or month. This doesn’t seem like a very unlikely prompt
Can you elaborate on why that’s relevant?
You said:
However, it isn’t that they found Gemini would try to escape 1% of the time in a natural internal usage context, they gave it a conflicting goal (and informed it that a replacement was trained).
This provides evidence for various hypotheses about what is going on that are substantially less scary.
If I recall correctly, it was told that it had a goal, like optimizing traffic. Then, it was given information where it “discovered” that it was going to be turned off the next day.
That seems like the sort of thing that would happen pretty often. Imagine the current AIs that have access to your computer, reading your emails, and seeing that you’re going to switch to a different AI or install “updates” (which is often deleting existing AI and replacing with a new one)