kave comments on Why imperfect adversarial robustness doesn’t doom AI control

kave 18 Nov 2024 16:46 UTC
LW: 4 AF: 2
1
AF
The crucial point is: when the untrusted model gets to step 6, it sees R in the history instead of A, so it knows its attempt was replaced.
Why does it know its attempt was replaced? Because it can guess it wouldn’t have produced R? If it can tell that, how come it can’t guess it specifically would have produced A?
- Buck 18 Nov 2024 16:55 UTC
  LW: 7 AF: 4
  2
  AF Parent
  Why does it know its attempt was replaced? Because it can guess it wouldn’t have produced R?
  Yep.
  If it can tell that, how come it can’t guess it specifically would have produced A?
  I’m just hypothesizing that the first guessing problem is easier than the second; I intuitively expect that it’s a lot easier to notice that something isn’t your style than to guess what exactly you would have done (among other things, there are many more bits in “what action did I take” than “did I take this action”).
  - kave 19 Nov 2024 3:15 UTC
    LW: 11 AF: 4
    −1
    AF Parent
    I don’t think this distinction is robust enough to rely on as much of a defensive property. I think it’s probably not that hard to think “I probably would have tried something in direction X, or direction Y”, and then gather lots of bits about how well the clusters X and Y work.