Tom Davidson comments on Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

Tom Davidson 10 Sep 2024 17:10 UTC
LW: 1 AF: 1
0
AF
I mean that you start with a scenario where the AI does an egregious act. Then you change small facts about the scenario to explore the space of scenarios where the probability of them doing that act is high. The thought is that, if scheming is systematic, then this will lead you to discover a wide range of scenarios in which the AI schemes and evidence that it’s not just a one-off random role-playing thing