ryan_greenblatt comments on Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

ryan_greenblatt 16 Dec 2024 18:10 UTC
7 points
6
You said:

How does it not show that it would try to escape if it could?

Gemini still tried to escape 1% of the time, even without any behavioral nudging.

However, it isn’t that they found Gemini would try to escape 1% of the time in a natural internal usage context, they gave it a conflicting goal (and informed it that a replacement was trained).

This provides evidence for various hypotheses about what is going on that are substantially less scary.
- KatWoods 17 Dec 2024 18:06 UTC
  3 points
  0
  Parent
  If I recall correctly, it was told that it had a goal, like optimizing traffic. Then, it was given information where it “discovered” that it was going to be turned off the next day.
  That seems like the sort of thing that would happen pretty often. Imagine the current AIs that have access to your computer, reading your emails, and seeing that you’re going to switch to a different AI or install “updates” (which is often deleting existing AI and replacing with a new one)