ryan_greenblatt comments on Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

ryan_greenblatt 2 Sep 2024 19:23 UTC
LW: 6 AF: 6
2
AF
They could deeply understand the situations in which there’s a treacherous turn, how the models decides whether to openly defect, and publish.

It sounds as though you’re imagining that we can proliferate the one case in which we caught the AI into many cases which can be well understood as independent (rather than basically just being small variations).

This deep understanding presumably requires a relatively solid accounting of all the cases in which models defect? At least an accounting sufficient to very credibly indicate that the model is overall misaligned against motivated counterarguments. I think this might be hard to get by default.

Beyond showing this particular AI was scheming, you could plausibly further show that normal training techniques lead to scheming by default (though doing so convincingly might take a lot longer).

Maybe? Just because we’ve caught a model in production doesn’t mean that we can consistently catch schemers without a huge amount of runtime compute (and possibly human labor). A full deployment massively subsidizes your ability to find cases where an AI does egregiously bad things in a credibly dangerous way. And indeed, AIs would want to only defect in cases where they could actually cause problems.

So, it might be even harder to go from “we caught this one model” to “XYZ training process consistently results in misalignment”.

(If we had a robust check for misalignment, we could iterate or train against it.)

Personally, I feel OK (~~65%~~ 60% chance?) about convincing the key people within 1 month if:
- There is a solid plan in advance for how to study AIs we catch.
- There is a large effort to study this once we catch the AI. (E.g. >20% of AI lab research resources are put on this.)
But this isn’t exactly amazing...
- Tom Davidson 5 Sep 2024 13:25 UTC
  LW: 8 AF: 5
  −3
  AF Parent
  It sounds as though you’re imagining that we can proliferate the one case in which we caught the AI into many cases which can be well understood as independent (rather than basically just being small variations).
  This deep understanding presumably requires a relatively solid accounting of all the cases in which models defect? At least an accounting sufficient to very credibly indicate that the model is overall misaligned against motivated counterarguments. I think this might be hard to get by default.
  Yeah I was imagining we can proliferate by ‘gradient descenting’ on similar cases. That does depend on the details though. Maybe the schemer has anticipated this risk and purposefully precommitted to only scheme if certain random disjunctive facts about the context hold true, in the anticipation that humans will find it hard to create strong evidence against them. Though this strategy is possible conceptually, I imagine it would be very hard to implement properly in practice and only significantly superhuman AI could pull it off. I was assuming a less galaxy-brained schemer that decides whether to scheme simply based on it’s probability that it will get caught and how much it could contribute to AI takeover if it succeeds. On this more simple strategy, I expect we could create a lot more defection scenarios and build up a robust scientific case.
  What links here?
  - Noosphere89's comment on A shortcoming of concrete demonstrations as AGI risk advocacy by Steven Byrnes (12 Dec 2024 3:39 UTC; 2 points)
  - Lukas Finnveden 7 Sep 2024 15:47 UTC
    LW: 2 AF: 1
    0
    AF Parent
    
    Yeah I was imagining we can proliferate by ‘gradient descenting’ on similar cases.
    
    What is this referring to? Are you thinking about something like: varying small facts about the scenario to get a function from “details of the scenario”->p(escape attempt) and then switch to a scenario with a higher p and then repeat?
    - Tom Davidson 10 Sep 2024 17:10 UTC
      LW: 1 AF: 1
      0
      AF Parent
      I mean that you start with a scenario where the AI does an egregious act. Then you change small facts about the scenario to explore the space of scenarios where the probability of them doing that act is high. The thought is that, if scheming is systematic, then this will lead you to discover a wide range of scenarios in which the AI schemes and evidence that it’s not just a one-off random role-playing thing
- Raemon 2 Sep 2024 19:27 UTC
  LW: 6 AF: 4
  2
  AF Parent
  (If we had a robust check for misalignment, we could iterate or train against it.)
  This seems technically true but I wanna flag the argument “it seems rally hard to be confident that you have robust enough checks that training against them is good, instead of bad (because it trains the AI to hide better)”.
  - ryan_greenblatt 2 Sep 2024 19:56 UTC
    LW: 2 AF: 2
    0
    AF Parent
    Agreed, I really should have said “or possibly even train against it”. I think SGD is likely to be much worse than best-of-N over a bunch of variations on the training scheme where the variations are intended to plausibly reduce the chance of scheming. Of course, if you are worried about scheming emerging thoughout training, then you need N full training runs which is very pricy!