Lukas Finnveden comments on Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

Lukas Finnveden 7 Sep 2024 15:47 UTC
LW: 2 AF: 1
0
AF

Yeah I was imagining we can proliferate by ‘gradient descenting’ on similar cases.

What is this referring to? Are you thinking about something like: varying small facts about the scenario to get a function from “details of the scenario”->p(escape attempt) and then switch to a scenario with a higher p and then repeat?
- Tom Davidson 10 Sep 2024 17:10 UTC
  LW: 1 AF: 1
  0
  AF Parent
  I mean that you start with a scenario where the AI does an egregious act. Then you change small facts about the scenario to explore the space of scenarios where the probability of them doing that act is high. The thought is that, if scheming is systematic, then this will lead you to discover a wide range of scenarios in which the AI schemes and evidence that it’s not just a one-off random role-playing thing