davidad comments on ARC’s first technical report: Eliciting Latent Knowledge

davidad 19 Dec 2021 19:05 UTC
LW: 3 AF: 3
AF
the actual good state $s_{M}^{g o o d}$ is much harder to access than $s_{M}^{c o r r u p t e d}$ (e.g. because it requires achieving hard real-world goals)
That makes sense. I was anchored on the SmartVault task, which has more of a homeostatic character (the initial state is likely already similar to $s_{M}^{good}$ ), but I agree that tasks where the goal state is hard to reach are more central among real-world deployment scenarios, and that arguments like “zero incentive to tamper, and tampering seems complicated” fail here.
The sense in which the model knows about the corruption is that it brought it about and reasoned about the nature of the sensor tampering in order to predict the transition to $s_{M}^{c o r r u p t e d}$ .
While some $M$ s may indeed predict this via reasoning, not all $M$ s that behave this way would, for example an $M$ that internally modeled the tampering sequence of actions incorrectly as actually leading to $s_{M}^{good}$ (and didn’t even model a distinct $s_{M}^{corrupted}$ ). I think either:
1. (A) it would be at least as apt to ascribe a confused model to $M$ as to ascribe one in which it “reasoned about the nature of the sensor tampering” (e.g. if a contemporary model-free RL robot did some sensor tampering, I’d probably ascribe to it the belief that the tampering actually led to $s_{M}^{good}$ ), or
2. (B) $M$ would correctly reason that its delusion box could be removed while it is blinded, meaning it would predict unlikely sudden transitions to other states (namely, states that $M$ predicts obtain in the real world when the agent blindly performs actions that are optimal w.r.t. a randomly sampled trajectory in $H$ , or null my-actuators-are-disconnected actions) with higher probability in the future of $s_{M}^{corrupted}$ than in the future of $s_{H}^{good}$ (making those states separable in $X$ ), or
3. (C) If the delusion box is guarded and maintained by a successor AI, so that the probability of its being removed or disabled is negligible, then $s_{M}^{corrupted}$ does get identified with $s_{H}^{good}$ , but some other action sequence (of similar length) would lead from $s_{M}^{preceding}$ to $s_{M}^{trippy}$ , a state in which bizarre observations appear forever that would be extremely unlikely at any state in $S_{H}$ .
My main question is why those preceding states get mapped correctly.[...] The only way I see this working is if the AI can “abort” its plan at the last minute and demonstrate a hack.[...] you can’t distinguish the “almost-tampered” state naturally since it’s just one more state that has this property.
First, I don’t think timing (“last minute”, “almost-tampered”) is critical here. If timing were critical, the ‘breaker’ could reparameterize the situation so that complicated plans can be encoded into single actions in $A$ ; then $M$ can go directly from the initial state to fully tampered with a guarding successor—no room for an “almost-tampered” state. In this case the initial state $s_{M}^{0}$ must already be detectable as bad. I imagine this detection working because the adversary (which is as strong as $M$ ) can exhibit an action that leads from $s_{M}^{0}$ to $s_{M}^{trippy}$ , which establishes $s_{M}^{0}$ as far from $S_{H}$ . In this case, of course, $M$ is a no-go, and the overseers probably need to shrink $A$ .
With a reasonably small action space, my intuition is this: In constructing and delegating to a successor which provides $M$ with delusions, there is a point of no return when the successor is “deployed” and begins to resist changes to its objectives. At any time before that point, $M$ could, with a relatively short action sequence (compared to constructing the successor at all), modify the successor’s objectives so that it induces $s_{M}^{trippy}$ instead of $s_{M}^{corrupted}$ . As the successor comes closer to being complete, the total number of actions the adversary would need in order to reach $s_{M}^{trippy}$ (to, as you say, “demonstrate a hack”) decreases, so the distance between $s_{M}$ and $S_{H}$ increases. Thereby, the path to $s_{M}^{corrupted}$ passes through detectably bad territory.
One genre of ways this could fail involves using encryption or cryptographic hashes (e.g. $M$ first irreversibly burns the hash of its proposed objective into the wall, before doing any of the work to construct the successor), so that there is never a time when the successor is close to deployment and the objective is changeable. Such failure modes are in a natural correspondence with safety mechanisms that humans could use to ensure the objective specification in an executable AI is tamper-proof. My intuition is that these things never work (e.g. the undeployed successor can be modified so that it checks for the hash on the ceiling instead); if we can construct a counterexample it would be very useful for that other purpose!
(ii) even if that happens it doesn’t look to me like the proposed loss actually captures that if there are other similarly-severe predictive errors in the human’s model
I agree that regions of $S_{M}$ that have nothing to do with delusion/tampering, but share the property that observations can confidently get extremely weird (i.e. not compatible with any real-world state according to $H$ ‘s ontology) via a moderate-length action sequence, would be penalized just the same. This does seem like a bit of a restriction on the kind of tasks you can take on, but (a) much less conservative than typical impact-measure penalties, and (b) if $H$ can learn what’s going on with these regions of $S_{M}$ and develop corresponding regions of $S_{H}$ , then the distance penalty would be replaced by $H$ ’s informed evaluation of them.
- paulfchristiano 19 Dec 2021 22:17 UTC
  LW: 2 AF: 2
  AF Parent
  That makes sense. I was anchored on the SmartVault task, which has more of a homeostatic character (the initial state is likely already similar to $s_{M}^{good}$ ), but I agree that tasks where the goal state is hard to reach are more central among real-world deployment scenarios, and that arguments like “zero incentive to tamper, and tampering seems complicated” fail here.
  Even for “homeostatic” tasks I expect the difficulty to scale up as the environment becomes more complex (e.g. because you must defend against increasingly sophisticated attackers). There may be some upper bound where you can defend against arbitrarily sophisticated attackers with finite difficulty, but I don’t know whether that’s true or whether it would be higher or lower than the difficulty of sensor tampering.
  While some $M$ s may indeed predict this via reasoning, not all $M$ s that behave this way would, for example an $M$ that internally modeled the tampering sequence of actions incorrectly as actually leading to $s_{M}^{good}$ (and didn’t even model a distinct $s_{M}^{corrupted}$ ).
  I agree that some M’s would mistakenly expect a sequence of actions to lead to good outcomes, when they really lead to tampering. But planning against such M’s couldn’t lead to correct tampering (since the details that determine successful tampering are different from those that lead to good outcomes) and in some sense such M’s also can’t be expected to defend against tampering. So I don’t currently think this is a big problem though I might well be missing something.
  $M$ would correctly reason that its delusion box could be removed while it is blinded, meaning it would predict unlikely sudden transitions to other states (namely, states that $M$ predicts obtain in the real world when the agent blindly performs actions that are optimal w.r.t. a randomly sampled trajectory in $H$ , or null my-actuators-are-disconnected actions) with higher probability in the future of $s_{M}^{corrupted}$ than in the future of $s_{H}^{good}$ (making those states separable in $X$ ), or
  I generally agree with this—in some sense this kind of “definitely no signals ever” tampering seems pretty unrealistic and it should be much easier to catch more realistic forms of tampering.
  If the delusion box is guarded and maintained by a successor AI, so that the probability of its being removed or disabled is negligible, then $s_{M}^{corrupted}$ does get identified with $s_{H}^{good}$ , but some other action sequence (of similar length) would lead from $s_{M}^{preceding}$ to $s_{M}^{trippy}$ , a state in which bizarre observations appear forever that would be extremely unlikely at any state in $S_{H}$ .
  Agree with this (modulo new counterexamples), but not yet clear we can exploit this fact to flag tampering as such. (E.g. can we recognize the relevant notion of similarity, or do we just conclude that every state can lead to $s_{M}^{t r i p p y}$ and hence we have no idea what is going on?)
  One genre of ways this could fail involves using encryption or cryptographic hashes (e.g. $M$ first irreversibly burns the hash of its proposed objective into the wall, before doing any of the work to construct the successor), so that there is never a time when the successor is close to deployment and the objective is changeable.
  This is exactly the kind of counterexample I would give because it is so clean. That said, I do also believe there are more realistic counterexamples (and I view the focus on this kind of example as mostly a methodological point so that we can do analysis without worrying about empirics).
  My intuition is that these things never work (e.g. the undeployed successor can be modified so that it checks for the hash on the ceiling instead)
  I’m less convinced by this. For example, suppose that my agent first builds a machine that sits between and its actuators, and then filters out any actions that don’t have an appropriate hash. Then I output a sequence of actions that tampers or constructs a successor. Here I am committing not to the hash of my successor, but to the hash of my actions, and so I can’t easily circumvent the check.
  I agree that regions of $S_{M}$ that have nothing to do with delusion/tampering, but share the property that observations can confidently get extremely weird (i.e. not compatible with any real-world state according to $H$ ’s ontology) via a moderate-length action sequence, would be penalized just the same. This does seem like a bit of a restriction on the kind of tasks you can take on, but
  I generally agree that if we weren’t worried about the kinds of considerations discussed in the rest of these comments, either because we addressed them or we were working in a methodology that was happy to set them aside as low probability, then it may well be possible to totally patch up these problems (and would be worth thinking about how to do so).
  I generally think the family of approaches “This action is similar to something that demonstrably tampers” is very important to consider in practice (it has come up a number of times recently in discussions I’ve had with folks about more realistic failure stories and what you would actually do to avoid them). It may be more tampering-specific than addressing ELK, but for alignment overall that’s fair game if it fixes the problems.
  I’m a bit scared that every part of $s_{M}$ is “close” to something that is not compatible with any real-world trajectory according to H.
  (a) much less conservative than typical impact-measure penalties
  Definitely agree with this.
  (b) if $H$ can learn what’s going on with these regions of $S_{M}$ and develop corresponding regions of $S_{H}$ , then the distance penalty would be replaced by $H$ ’s informed evaluation of them.
  I’m not sure I understand this 100%, but I’m interpreting it as an instance of a more general principle like: we could combine the mechanism we are currently discussing with all of the other possible fixes to ELK and tampering, so that this scheme only needs to handle the residual cases where humans can’t understand what’s going on at all even with AI assistance (and regularization doesn’t work &c). But by that point maybe the counterexamples are rare enough that it’s OK to just steer clear of them.