Tom Davidson comments on Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

Tom Davidson 5 Sep 2024 13:25 UTC
LW: 6 AF: 4
−3
AF
It sounds as though you’re imagining that we can proliferate the one case in which we caught the AI into many cases which can be well understood as independent (rather than basically just being small variations).
This deep understanding presumably requires a relatively solid accounting of all the cases in which models defect? At least an accounting sufficient to very credibly indicate that the model is overall misaligned against motivated counterarguments. I think this might be hard to get by default.
Yeah I was imagining we can proliferate by ‘gradient descenting’ on similar cases. That does depend on the details though. Maybe the schemer has anticipated this risk and purposefully precommitted to only scheme if certain random disjunctive facts about the context hold true, in the anticipation that humans will find it hard to create strong evidence against them. Though this strategy is possible conceptually, I imagine it would be very hard to implement properly in practice and only significantly superhuman AI could pull it off. I was assuming a less galaxy-brained schemer that decides whether to scheme simply based on it’s probability that it will get caught and how much it could contribute to AI takeover if it succeeds. On this more simple strategy, I expect we could create a lot more defection scenarios and build up a robust scientific case.
- Lukas Finnveden 7 Sep 2024 15:47 UTC
  LW: 2 AF: 1
  0
  AF Parent
  
  Yeah I was imagining we can proliferate by ‘gradient descenting’ on similar cases.
  
  What is this referring to? Are you thinking about something like: varying small facts about the scenario to get a function from “details of the scenario”->p(escape attempt) and then switch to a scenario with a higher p and then repeat?
  - Tom Davidson 10 Sep 2024 17:10 UTC
    LW: 1 AF: 1
    0
    AF Parent
    I mean that you start with a scenario where the AI does an egregious act. Then you change small facts about the scenario to explore the space of scenarios where the probability of them doing that act is high. The thought is that, if scheming is systematic, then this will lead you to discover a wide range of scenarios in which the AI schemes and evidence that it’s not just a one-off random role-playing thing