eggsyntax comments on Why imperfect adversarial robustness doesn’t doom AI control

eggsyntax 19 Nov 2024 19:51 UTC
2 points
0
One story here could be that the model is being used only in an API context where it’s being asked to take actions on something well-modeled as a Markov process, where the past doesn’t matter (and we assume that the current state doesn’t incorporate relevant information about the past). There are certainly use cases that fit that (‘trigger invisible fence iff this image contains a dog’; ‘search for and delete this file’). It does seem to me, though, that for many (most?) AI use cases, past information is useful, and so the assumptions above fail unless labs are willing to pay the performance cost in the interest of greater safety.
Another set of cases where the assumptions fail is models that are trained to expect access to intrinsic or extrinsic memory; that doesn’t apply to current LLMs but seems like a very plausible path for economically useful models like agents.