Suppose I have some world state S, a transition function T : S → S, actions Action : S → S, and a surjective Camera : S → CameraState. Since Camera is (very) surjective, seeing a particular camera image with happy people does not imply a happy world state, because many other situations involving nanobots or camera manipulation could have created that image.
This is important because I only have a human evaluation function H : S → Boolean, not on CameraState directly. When I look at the image with the fake happy people, I use a mocked up H’ : CameraState → Boolean := λ cs. H(Camera⁻¹(cs)). The issue is that Camera⁻¹ points to many possible states, and in practice I might pick whichever state is apriori most likely according to a human distribution over world states Distₕ(S).
The trick is that if I have a faithful model M : Action × CameraState → CameraState, I can back out hidden information about the state. The idea is that M must contain information about the true state, not just CameraState, in order to make accurate predictions.
The key idea is that M(a) acts like Camera ∘ T ∘ a ∘ Camera⁻¹, so we should be able to trace out which path Camera⁻¹ took, and in turn get a probability distribution over S. So we can make a recognizer— Recognizer : [Action] × CameraState × M → Dist(S) := λ actions, cs, m. normalize([sum([L₂(M(a,cs), (C∘T∘a)(hidden_state)) a∈actions]) ∀ hidden_state ∈ Camera⁻¹(cs)]) where normalize l := l/sum(l) And lastly we can evaluate our world state using Evaluate := λ actions, cs, m. E[H(R(actions,cs,m))], and Evaluate can be used as the evaluation part of a planning loop.
Everything seems right except I didn’t follow the definition of the regularizer. What is L2?
The trick is that if I have a faithful model M : Action × CameraState → CameraState, I can back out hidden information about the state. The idea is that M must contain information about the true state, not just CameraState, in order to make accurate predictions.
This is what we want to do, and intuitively you ought to be able to back out info about the hidden state, but it’s not clear how to do so.
All of our strategies involve introducing some extra structure, the human’s model, with state space SH, where the map CameraH:SH→CameraState also throws out a lot of information.
The setup you describe is very similar to the way it is presented in Ontological crises.
ETA: also we imagine H:SH→CameraState, i.e. the underlying state space may also be different. I’m not sure any of the state mismatches matters much unless you start considering approaches to the problem that actually exploit structure of the hidden space used within M though.
This is what we want to do, and intuitively you ought to be able to back out info about the hidden state, but it’s not clear how to do so.
Here’s an approach I just thought of, building on scottviteri’s comment. Forgive me if there turns out to be nothing new here.
Supposing that the machine and the human are working with the same observation space (O:=CameraState) and action space (A:=Action), then the human’s model H:SH→A→P(O×SH) and the machine’s model M:SM→A→P(O×SM) are both coalgebras of the endofunctor F:=λX.A→P(O×X), therefore both have a canonical morphism into the terminal coalgebra of F, X:≅FX (assuming that such an X exists in the ambient category). That is, we can map SH→X and SM→X. Then, if we can define a distance function on X with type dX:X×X→R≥0, we can use these maps to define distances between human states and machine states, d:SH×SM→R≥0.
How can we make use of a distance function? Basically, we can use the distance function to define a kernel (e.g. K(x,y)=exp(−βdX(x,y))), and then use kernel regression to predict the utility of states in SM by averaging “nearby” states in SH, and then finally (and crucially) estimating the generalization error so that states from SM that aren’t really near to anywhere in SH get big warning flags (and/or utility penalties for being outside a trust region).
How to get such a distance function? One way is to use CMet (the category of complete metric spaces) as the ambient category, and instantiate P as the Kantorovich monad. Crank-turning yields the formula
where U is constrained to be a non-expansive map, i.e., it is subject to the condition |U(o1,s1)−U(o2,s2)|≤max{dO(o1,o2),dX(s1,s2)}. If O is discrete, I think this is maybe equivalent to an adversarial game where the adversary chooses, for every possible sH and sM, a partition of O and a next action, and optimizes the probability that sampled predictions from H and M will eventually predict observations on opposite sides of the partition. This distance function is canonical, but in some sense seems too strict: if M knows more about the world than H, then of course the adversary will be able to find an action policy that eventually leads the state into some region that M can confidently predict with p≈1 while H finds it very unlikely (p⋘1). In other words, even if two states are basically concordant, this distance function will consider them maximally distant if there exists any policy that eventually leads to a maximal breakdown of bisimulation. (Both the canonical character and the too-strict character are in common with L∞ metrics.)
Inspired by this kind of corecursion but seeking more flexibility, let’s consider the induced metric on the type X×X→R≥0itself, namely the sup-norm ddX(d1,d2):=supx,y:X|d1(x,y)−d2(x,y)|, then build a contraction map on that space and apply the Banach fixed-point theorem to pick out a well-defined dX. For example,
We are now firmly in Abstract Dynamic Programming territory. The distance between two states is the maximum score achievable by an adversary playing an MDP with state space as the product SH×SM, the initial state as the pair (sH,sM) of states being compared, the one-stage reward as the divergence of predictions about observations between the two models, the dynamics as just the H and M dynamics evolving separately (but fed identical actions), and exponential discounting.
The divergence dPO is a free parameter here, although it has to be bounded, but it doesn’t have to be a metric. It could be attainable utility regret, or KL divergence, or Jensen-Shannon divergence, or Bhattacharyya distance, etc. (with truncation or softmax to keep them bounded); lots of potential for experimentation here.
Consider a state sM where the sensors have been tampered with in order to “look like” the human state sH, i.e. we’ve connected the actuators and camera to a box which just simulates the human model (starting from sH) and then feeding the predicted outputs of the human model to the camera.
It seems to me like the state sM would have zero distance from the state sH under all of these proposals. Does that seem right? (I didn’t follow all of the details of the example, and definitely not the more general idea.)
(I first encountered this counterexample in Alignment for advanced machine learning systems. They express the hope that you can get around this by thinking about the states that can lead to the sensor-tampered state and making some kind of continuity assumption, but I don’t currently think you can make that work and it doesn’t look like your solution is trying to capture that intuition.)
(Thanks for playing along with me as ‘breaker’!) I agree that such an sM would have zero distance from the corresponding sH, but I have some counterpoints:
This is a problem for ELK in general, to the extent it’s a problem (which I think is smallish-but-non-negligible). An M with this property is functionally equivalent to an M′ which actually believes that sM refers to the same state of the real world as sH. So the dynamics of M’s world-model don’t contain any latent knowledge of the difference at this point.
This seems to be against the ELK report’s knowledge-criterion “There is a feature of the computation done by the AI which is robustly correlated with Z.”
The only way I can think of that ELK could claim to reliably distinguish sM from sH is by arguing that the only plausible way to get such an M′ is via a training trajectory where some previous Mθ did treat sM differently from sH, and perform ELK monitoring at training checkpoints (in which case I don’t see reason to expect my approach comes off worse than others).
Such an sM would not be incentivized by the model. Assuming that rewards factor through O, Q(sM)=Q(sH). So a policy that’s optimized against the world-model M wouldn’t have enough selection pressure to find the presumably narrow and high-entropy path that would lead to the tampered state from the initial state (assuming that the initial state in the real world at deployment is tamper-free).
In the real world, Mhas higher expected loss in sM. If all the sensor inputs are generated by simulating the human model, M has completely blinded itself to potential threats that could disrupt the tampering and reveal a missing diamond. These real-world threats are independent of the threats that the tampering box would be stochastically simulating as part of the human model, either of which would produce observations with high loss. Thus, the real-world expected loss in sM is typically strictly higher than in sH.
This seems to be against the ELK report’s knowledge-criterion “M(x) produces outputs that receive a significantly lower loss in the real world, where A′ is the right answer to Q, than in some kind of counterfactual where A is a better answer.”
The only way I can think of that M could reach an internal state like sM without regret (with respect to the real-world dynamics) is by first constructing and delegating to an aligned successor, and only then transitioning into a state like sM (and, effectively, completely forgetting about everything weird). Of course, if that has happened, it is too late for ELK to help anyway. We need to apply ELK to a whole plan in advance, not just to the final state.
EtA: My intuition here is similar to the one you pointed at in Alignment for advanced machine learning systems (page 14) insofar as I think reaching sM requires previously crossing state-space territory that was detectably unsafe. However, instead of trying to smooth out a utility-relevant representation ϕ as they suggest (so that ϕ(Ct)appears to “remember” how weird ϕ(Ct−1) was), I’m saying, it’s okay if the ELK-informed utility estimate ~U(Ct=sM) is equal to the human utility U(sH), because since ~U(Ct−k) is really bad, the safety system should avoid entering Ct−k and thereby avoid being in a position to enter Ct.
I’m thinking of this in a family of proposals like:
Some kinds of tampering can be easily detected (and so should get identified with states sH where tampering has occurred)
Some other tampering can’t be easily detected, but this undetectable tampering has important similarities with detectable tampering and we could use that to recognize it.
In this case, we’re going to try to exploit the fact that detectable tampering shares a prefix of actions/states with undetectable tampering (such that later states reached in that sequence have a much higher probability of detectable tampering, as well as undetectable tampering, and can be flagged as suspicious).
I currently don’t believe this kind of thing can work (at least in the worst case), but I’m not at all confident.
One thing worth mentioning (which makes me much more optimistic about this class of proposals): if your AI, delegates to another AI who helps actually implement the tampering, that makes things harder but in some sense it also provides extra structure you can work with in a solution. Your AI needs to reason about that other AI’s reasoning, and so you could hope (i) if your AI is reasoning about the other AI’s reasoning in detail, then we can apply exactly the same machinery to that AI’s predictive model in order to elicit it’s beliefs (which your AI obviously trusts) about what is going on, (ii) if your AI reasons abstractly about the other AI’s reasoning, you could hope to treat that as a shortcut to reach the same conclusions and incentivize the model to give you the same answers that it would if it were reasoning explicitly (modulo its ability to figure it out).
(I also think the single-AI case basically needs to be treated in the same way as that—I don’t think in general your AI is going to be reasoning explicitly about its own future actions, it just needs to reason abstractly about its future self in a qualitatively similar same way that it reasons about other AIs. Or at least in the worst case its reasoning about itself is just as messy and complex as its reasoning about other AIs.)
The sense in which the model knows about the corruption is that it brought it about and reasoned about the nature of the sensor tampering in order to predict the transition to scorruptedM.
The reason I’m concerned that it brings about this state is because the actual good state sgoodM is much harder to access than scorruptedM (e.g. because it requires achieving hard real-world goals). The intuition is that scorruptedM has constant difficulty while sgoodM gets harder and harder as we make the tasks more sophisticated.
I’m OK having the human look at entire trajectories, so it’s provisionally OK (though not the most satisfying) if scorruptedM gets identified with SgoodH as long as the preceding states look bad to the human and then they can do some kind of inference on top to figure out what happened. My main question is why those preceding states get mapped correctly.
If the preceding state sprecedingM gets corrupted via a mechanism that isn’t obvious to the human, and otherwise looks like a predecessor to SgoodH, then it seems like it should get identified with a predecessor of SgoodH.
The only way I see this working is if the AI can “abort” its plan at the last minute and demonstrate a hack. But my main concerns are that (i) that doesn’t seem true in general—the AI can tamper with sensors in ways humans don’t understand starting from many initial states, and you can’t distinguish the “almost-tampered” state naturally since it’s just one more state that has this property, (ii) even if that happens it doesn’t look to me like the proposed loss actually captures that if there are other similarly-severe predictive errors in the human’s model, though if I wasn’t worried about (i) I’d want to think about (ii) much more carefully and actually construct a counterexample.
the actual good state sgoodM is much harder to access than scorruptedM (e.g. because it requires achieving hard real-world goals)
That makes sense. I was anchored on the SmartVault task, which has more of a homeostatic character (the initial state is likely already similar to sgoodM), but I agree that tasks where the goal state is hard to reach are more central among real-world deployment scenarios, and that arguments like “zero incentive to tamper, and tampering seems complicated” fail here.
The sense in which the model knows about the corruption is that it brought it about and reasoned about the nature of the sensor tampering in order to predict the transition to scorruptedM.
While some Ms may indeed predict this via reasoning, not all Ms that behave this way would, for example an M that internally modeled the tampering sequence of actions incorrectly as actually leading to sgoodM (and didn’t even model a distinct scorruptedM). I think either:
(A) it would be at least as apt to ascribe a confused model to M as to ascribe one in which it “reasoned about the nature of the sensor tampering” (e.g. if a contemporary model-free RL robot did some sensor tampering, I’d probably ascribe to it the belief that the tampering actually led to sgoodM), or
(B) M would correctly reason that its delusion box could be removed while it is blinded, meaning it would predict unlikely sudden transitions to other states (namely, states that M predicts obtain in the real world when the agent blindly performs actions that are optimal w.r.t. a randomly sampled trajectory in H, or null my-actuators-are-disconnected actions) with higher probability in the future of scorruptedM than in the future of sgoodH (making those states separable in X), or
(C) If the delusion box is guarded and maintained by a successor AI, so that the probability of its being removed or disabled is negligible, then scorruptedM does get identified with sgoodH, but some other action sequence (of similar length) would lead from sprecedingM to strippyM, a state in which bizarre observations appear forever that would be extremely unlikely at any state in SH.
My main question is why those preceding states get mapped correctly.[...] The only way I see this working is if the AI can “abort” its plan at the last minute and demonstrate a hack.[...] you can’t distinguish the “almost-tampered” state naturally since it’s just one more state that has this property.
First, I don’t think timing (“last minute”, “almost-tampered”) is critical here. If timing were critical, the ‘breaker’ could reparameterize the situation so that complicated plans can be encoded into single actions in A; then M can go directly from the initial state to fully tampered with a guarding successor—no room for an “almost-tampered” state. In this case the initial state s0M must already be detectable as bad. I imagine this detection working because the adversary (which is as strong as M) can exhibit an action that leads from s0M to strippyM, which establishes s0M as far from SH. In this case, of course, M is a no-go, and the overseers probably need to shrink A.
With a reasonably small action space, my intuition is this: In constructing and delegating to a successor which provides M with delusions, there is a point of no return when the successor is “deployed” and begins to resist changes to its objectives. At any time before that point, M could, with a relatively short action sequence (compared to constructing the successor at all), modify the successor’s objectives so that it induces strippyM instead of scorruptedM. As the successor comes closer to being complete, the total number of actions the adversary would need in order to reach strippyM(to, as you say, “demonstrate a hack”) decreases, so the distance between sM and SH increases. Thereby, the path to scorruptedM passes through detectably bad territory.
One genre of ways this could fail involves using encryption or cryptographic hashes (e.g.M first irreversibly burns the hash of its proposed objective into the wall, before doing any of the work to construct the successor), so that there is never a time when the successor is close to deployment and the objective is changeable. Such failure modes are in a natural correspondence with safety mechanisms that humans could use to ensure the objective specification in an executable AI is tamper-proof. My intuition is that these things never work (e.g. the undeployed successor can be modified so that it checks for the hash on the ceiling instead); if we can construct a counterexample it would be very useful for that other purpose!
(ii) even if that happens it doesn’t look to me like the proposed loss actually captures that if there are other similarly-severe predictive errors in the human’s model
I agree that regions of SM that have nothing to do with delusion/tampering, but share the property that observations can confidently get extremely weird (i.e. not compatible with any real-world state according to H‘s ontology) via a moderate-length action sequence, would be penalized just the same. This does seem like a bit of a restriction on the kind of tasks you can take on, but (a) much less conservative than typical impact-measure penalties, and (b) if H can learn what’s going on with these regions of SM and develop corresponding regions of SH, then the distance penalty would be replaced by H’s informed evaluation of them.
That makes sense. I was anchored on the SmartVault task, which has more of a homeostatic character (the initial state is likely already similar to sgoodM), but I agree that tasks where the goal state is hard to reach are more central among real-world deployment scenarios, and that arguments like “zero incentive to tamper, and tampering seems complicated” fail here.
Even for “homeostatic” tasks I expect the difficulty to scale up as the environment becomes more complex (e.g. because you must defend against increasingly sophisticated attackers). There may be some upper bound where you can defend against arbitrarily sophisticated attackers with finite difficulty, but I don’t know whether that’s true or whether it would be higher or lower than the difficulty of sensor tampering.
While some Ms may indeed predict this via reasoning, not all Ms that behave this way would, for example an M that internally modeled the tampering sequence of actions incorrectly as actually leading to sgoodM (and didn’t even model a distinct scorruptedM).
I agree that some M’s would mistakenly expect a sequence of actions to lead to good outcomes, when they really lead to tampering. But planning against such M’s couldn’t lead to correct tampering (since the details that determine successful tampering are different from those that lead to good outcomes) and in some sense such M’s also can’t be expected to defend against tampering. So I don’t currently think this is a big problem though I might well be missing something.
M would correctly reason that its delusion box could be removed while it is blinded, meaning it would predict unlikely sudden transitions to other states (namely, states that M predicts obtain in the real world when the agent blindly performs actions that are optimal w.r.t. a randomly sampled trajectory in H, or null my-actuators-are-disconnected actions) with higher probability in the future of scorruptedM than in the future of sgoodH (making those states separable in X), or
I generally agree with this—in some sense this kind of “definitely no signals ever” tampering seems pretty unrealistic and it should be much easier to catch more realistic forms of tampering.
If the delusion box is guarded and maintained by a successor AI, so that the probability of its being removed or disabled is negligible, then scorruptedM does get identified with sgoodH, but some other action sequence (of similar length) would lead from sprecedingM to strippyM, a state in which bizarre observations appear forever that would be extremely unlikely at any state in SH.
Agree with this (modulo new counterexamples), but not yet clear we can exploit this fact to flag tampering as such. (E.g. can we recognize the relevant notion of similarity, or do we just conclude that every state can lead to strippyM and hence we have no idea what is going on?)
One genre of ways this could fail involves using encryption or cryptographic hashes (e.g.M first irreversibly burns the hash of its proposed objective into the wall, before doing any of the work to construct the successor), so that there is never a time when the successor is close to deployment and the objective is changeable.
This is exactly the kind of counterexample I would give because it is so clean. That said, I do also believe there are more realistic counterexamples (and I view the focus on this kind of example as mostly a methodological point so that we can do analysis without worrying about empirics).
My intuition is that these things never work (e.g. the undeployed successor can be modified so that it checks for the hash on the ceiling instead)
I’m less convinced by this. For example, suppose that my agent first builds a machine that sits between and its actuators, and then filters out any actions that don’t have an appropriate hash. Then I output a sequence of actions that tampers or constructs a successor. Here I am committing not to the hash of my successor, but to the hash of my actions, and so I can’t easily circumvent the check.
I agree that regions of SM that have nothing to do with delusion/tampering, but share the property that observations can confidently get extremely weird (i.e. not compatible with any real-world state according to H’s ontology) via a moderate-length action sequence, would be penalized just the same. This does seem like a bit of a restriction on the kind of tasks you can take on, but
I generally agree that if we weren’t worried about the kinds of considerations discussed in the rest of these comments, either because we addressed them or we were working in a methodology that was happy to set them aside as low probability, then it may well be possible to totally patch up these problems (and would be worth thinking about how to do so).
I generally think the family of approaches “This action is similar to something that demonstrably tampers” is very important to consider in practice (it has come up a number of times recently in discussions I’ve had with folks about more realistic failure stories and what you would actually do to avoid them). It may be more tampering-specific than addressing ELK, but for alignment overall that’s fair game if it fixes the problems.
I’m a bit scared that every part of sM is “close” to something that is not compatible with any real-world trajectory according to H.
(a) much less conservative than typical impact-measure penalties
Definitely agree with this.
(b) if H can learn what’s going on with these regions of SM and develop corresponding regions of SH, then the distance penalty would be replaced by H’s informed evaluation of them.
I’m not sure I understand this 100%, but I’m interpreting it as an instance of a more general principle like: we could combine the mechanism we are currently discussing with all of the other possible fixes to ELK and tampering, so that this scheme only needs to handle the residual cases where humans can’t understand what’s going on at all even with AI assistance (and regularization doesn’t work &c). But by that point maybe the counterexamples are rare enough that it’s OK to just steer clear of them.
Everything seems right except I didn’t follow the definition of the regularizer. What is L2?
By L₂ I meant the Euclidian norm, measuring the distance between two different predictions of the next CameraState. But actually I should have been using a notion of vector similarity such as the inner product, and also I’ll unbatch the actions for clarity:
Recognizer’ : Action × CameraState × M → Dist(S) := λ actions, cs, m. softmax([⟨M(a,cs), (C∘T∘a)(hidden_state)⟩ ∀ hidden_state ∈ Camera⁻¹(cs)])
So the idea is to consider all possible hidden_states such that the Camera would display as the current CameraState cs, and create a probability distributions over those hidden_states, according to the similarity of M(a,cs) and (C∘T∘a)(hidden_state). Which is to say, how similar would the resulting CameraState be if I went the long way around, taking the hidden_state, applying my action, transition, and Camera functions.
The setup you describe is very similar to the way it is presented in Ontological crises.
Great, I’ll take a look.
All of our strategies involve introducing some extra structure, the human’s model, with state space S_H, where the map Camera_H : S_H→CameraState also throws out a lot of information.
Right so I wasn’t understanding the need for something like this, but now I think I see what is going on. I made an assumption above that I have some human value function H : S → Boolean. If I have some human internal state S_H, and I relax the human value function to H_V : S_H → Boolean, then the solution I have above falls apart, but here is another.
Now the goal is to create a function F from the machine state to human state, so that the human value function will compose with F to take machine states as input.
I am using all fresh variable names starting here.
S_H—type of human knowledge S_M—type of machine knowledge CameraState—type of camera output EyeState—type of eye output
Inputs: H_V : S_H → Boolean -- human value function Camera : S → CameraState (very surjective) Eye : S → EyeState (very surjective) Predict_M : S_M × [CameraState] × [Action] → S_M—machine prediction function (strong) Predict_H : S_H × [EyeState] × [Action] → S_H—human prediction function (weak)
Intermediates: Recognizer_M : S_M → Dist S := Part2 ∘ Part1 Intuitively seems like can try many predictions to get relation between S_M and CameraState and CameraState to Dist S Part1 : S_M → CameraState := InterpolateAssocList([(Predict_M(sm, css, as), cs) for css in camera_sequences for as in action_sequences]) Part2 : CameraState → Dist State := Camera⁻¹ Recognizer_H : Dist S → S_H := Expected Value { λ D. do s ← D. as ← actions. let es = Eye(s). Predict_H(Prior_H,[es],as) } where actions is a distribution over lists of actions. F : S_M → S_H := Recognizer_M ∘ Recognizer_H—function from machine to human state
Desired Output: Win : S_M → Boolean := H_V ∘ F—lift the value function to machine state
I didn’t follow some parts of the new algorithm. Probably most centrally: what is Dist(S)? Is this the type of distributions over real states of the world, and if so how do we have access to the true map Camera: S --> video? Based on that I likely have some other confusions, e.g. where are the camera_sequences and action_sequences coming from in the definition of Recognizer_M, what is the prior being used to define Camera−1, and don’t Recognizer_M and Recognizer_H effectively advance time a lot under some kind of arbitrary sequences of actions (making them unsuitable for exactly matching up states)?
F should be Recognizer_H ∘ Recognizer_M, rather than Recognizer_M ∘ Recognizer_H
In Recognizer_H, I don’t think you can take the expected value of a stochastic term of type SH, because SH doesn’t necessarily have convex structure. But, you could have Recognizer_H output Dist S_H instead of taking the ExpectedValue, and move the ExpectedValue into Win, and have Win output a probability rather than a Boolean.
Confusions:
Your types for Predict_M and Predict_H seem to not actually make testable predictions, because they output the opaque state types, and only take observations as inputs.
I’m also a bit confused about having them take lists of actions as a primitive notion. Don’t you want to ensure that, say, (Predict_M s css (as1++as2)) = (Predict_M (Predict_M s css as1) as2)? If so, I think it would make sense to accept only one action at a time, since that will uniquely characterize the necessary behavior on lists.
I don’t really understand Part1. For instance, where does the variable cs come from there?
Let me see if I am on the right page here.
Suppose I have some world state S, a transition function T : S → S, actions Action : S → S, and a surjective Camera : S → CameraState. Since Camera is (very) surjective, seeing a particular camera image with happy people does not imply a happy world state, because many other situations involving nanobots or camera manipulation could have created that image.
This is important because I only have a human evaluation function H : S → Boolean, not on CameraState directly.
When I look at the image with the fake happy people, I use a mocked up H’ : CameraState → Boolean := λ cs. H(Camera⁻¹(cs)). The issue is that Camera⁻¹ points to many possible states, and in practice I might pick whichever state is apriori most likely according to a human distribution over world states Distₕ(S).
The trick is that if I have a faithful model M : Action × CameraState → CameraState, I can back out hidden information about the state. The idea is that M must contain information about the true state, not just CameraState, in order to make accurate predictions.
The key idea is that M(a) acts like Camera ∘ T ∘ a ∘ Camera⁻¹, so we should be able to trace out which path Camera⁻¹ took, and in turn get a probability distribution over S.
So we can make a recognizer—
Recognizer : [Action] × CameraState × M → Dist(S) :=
λ actions, cs, m. normalize([sum([L₂(M(a,cs), (C∘T∘a)(hidden_state)) a∈actions]) ∀ hidden_state ∈ Camera⁻¹(cs)])
where normalize l := l/sum(l)
And lastly we can evaluate our world state using Evaluate := λ actions, cs, m. E[H(R(actions,cs,m))], and Evaluate can be used as the evaluation part of a planning loop.
Everything seems right except I didn’t follow the definition of the regularizer. What is L2?
This is what we want to do, and intuitively you ought to be able to back out info about the hidden state, but it’s not clear how to do so.
All of our strategies involve introducing some extra structure, the human’s model, with state space SH, where the map CameraH:SH→CameraState also throws out a lot of information.
The setup you describe is very similar to the way it is presented in Ontological crises.
ETA: also we imagine H:SH→CameraState, i.e. the underlying state space may also be different. I’m not sure any of the state mismatches matters much unless you start considering approaches to the problem that actually exploit structure of the hidden space used within M though.
Here’s an approach I just thought of, building on scottviteri’s comment. Forgive me if there turns out to be nothing new here.
Supposing that the machine and the human are working with the same observation space (O:=CameraState) and action space (A:=Action), then the human’s model H:SH→A→P(O×SH) and the machine’s model M:SM→A→P(O×SM) are both coalgebras of the endofunctor F:=λX.A→P(O×X), therefore both have a canonical morphism into the terminal coalgebra of F, X:≅FX (assuming that such an X exists in the ambient category). That is, we can map SH→X and SM→X. Then, if we can define a distance function on X with type dX:X×X→R≥0, we can use these maps to define distances between human states and machine states, d:SH×SM→R≥0.
How can we make use of a distance function? Basically, we can use the distance function to define a kernel (e.g. K(x,y)=exp(−βdX(x,y))), and then use kernel regression to predict the utility of states in SM by averaging “nearby” states in SH, and then finally (and crucially) estimating the generalization error so that states from SM that aren’t really near to anywhere in SH get big warning flags (and/or utility penalties for being outside a trust region).
How to get such a distance function? One way is to use CMet (the category of complete metric spaces) as the ambient category, and instantiate P as the Kantorovich monad. Crank-turning yields the formula
dX(sH,sM)=supa:AsupU:O×X↣R∣∣Eo,s′H∼H(sH)(a)U(o,s′H)−Eo,s′M∼M(sM)(a)U(o,s′M)∣∣where U is constrained to be a non-expansive map, i.e., it is subject to the condition |U(o1,s1)−U(o2,s2)|≤max{dO(o1,o2),dX(s1,s2)}. If O is discrete, I think this is maybe equivalent to an adversarial game where the adversary chooses, for every possible sH and sM, a partition of O and a next action, and optimizes the probability that sampled predictions from H and M will eventually predict observations on opposite sides of the partition. This distance function is canonical, but in some sense seems too strict: if M knows more about the world than H, then of course the adversary will be able to find an action policy that eventually leads the state into some region that M can confidently predict with p≈1 while H finds it very unlikely (p⋘1). In other words, even if two states are basically concordant, this distance function will consider them maximally distant if there exists any policy that eventually leads to a maximal breakdown of bisimulation. (Both the canonical character and the too-strict character are in common with L∞ metrics.)
Inspired by this kind of corecursion but seeking more flexibility, let’s consider the induced metric on the type X×X→R≥0 itself, namely the sup-norm ddX(d1,d2):=supx,y:X|d1(x,y)−d2(x,y)|, then build a contraction map on that space and apply the Banach fixed-point theorem to pick out a well-defined dX. For example,
T(dX)(xH,xM):=supa:A(dPO(π0(xH(a)),π0(xM(a)))+γ⋅Ex′H∼π1(xH(a));x′M∼π1(xM(a))dX(x′H,x′M))We are now firmly in Abstract Dynamic Programming territory. The distance between two states is the maximum score achievable by an adversary playing an MDP with state space as the product SH×SM, the initial state as the pair (sH,sM) of states being compared, the one-stage reward as the divergence of predictions about observations between the two models, the dynamics as just the H and M dynamics evolving separately (but fed identical actions), and exponential discounting.
The divergence dPO is a free parameter here, although it has to be bounded, but it doesn’t have to be a metric. It could be attainable utility regret, or KL divergence, or Jensen-Shannon divergence, or Bhattacharyya distance, etc. (with truncation or softmax to keep them bounded); lots of potential for experimentation here.
Consider a state sM where the sensors have been tampered with in order to “look like” the human state sH, i.e. we’ve connected the actuators and camera to a box which just simulates the human model (starting from sH) and then feeding the predicted outputs of the human model to the camera.
It seems to me like the state sM would have zero distance from the state sH under all of these proposals. Does that seem right? (I didn’t follow all of the details of the example, and definitely not the more general idea.)
(I first encountered this counterexample in Alignment for advanced machine learning systems. They express the hope that you can get around this by thinking about the states that can lead to the sensor-tampered state and making some kind of continuity assumption, but I don’t currently think you can make that work and it doesn’t look like your solution is trying to capture that intuition.)
(Thanks for playing along with me as ‘breaker’!) I agree that such an sM would have zero distance from the corresponding sH, but I have some counterpoints:
This is a problem for ELK in general, to the extent it’s a problem (which I think is smallish-but-non-negligible). An M with this property is functionally equivalent to an M′ which actually believes that sM refers to the same state of the real world as sH. So the dynamics of M’s world-model don’t contain any latent knowledge of the difference at this point.
This seems to be against the ELK report’s knowledge-criterion “There is a feature of the computation done by the AI which is robustly correlated with Z.”
The only way I can think of that ELK could claim to reliably distinguish sM from sH is by arguing that the only plausible way to get such an M′ is via a training trajectory where some previous Mθ did treat sM differently from sH, and perform ELK monitoring at training checkpoints (in which case I don’t see reason to expect my approach comes off worse than others).
Such an sM would not be incentivized by the model. Assuming that rewards factor through O, Q(sM)=Q(sH). So a policy that’s optimized against the world-model M wouldn’t have enough selection pressure to find the presumably narrow and high-entropy path that would lead to the tampered state from the initial state (assuming that the initial state in the real world at deployment is tamper-free).
In the real world, M has higher expected loss in sM. If all the sensor inputs are generated by simulating the human model, M has completely blinded itself to potential threats that could disrupt the tampering and reveal a missing diamond. These real-world threats are independent of the threats that the tampering box would be stochastically simulating as part of the human model, either of which would produce observations with high loss. Thus, the real-world expected loss in sM is typically strictly higher than in sH.
This seems to be against the ELK report’s knowledge-criterion “M(x) produces outputs that receive a significantly lower loss in the real world, where A′ is the right answer to Q, than in some kind of counterfactual where A is a better answer.”
The only way I can think of that M could reach an internal state like sM without regret (with respect to the real-world dynamics) is by first constructing and delegating to an aligned successor, and only then transitioning into a state like sM (and, effectively, completely forgetting about everything weird). Of course, if that has happened, it is too late for ELK to help anyway. We need to apply ELK to a whole plan in advance, not just to the final state.
EtA: My intuition here is similar to the one you pointed at in Alignment for advanced machine learning systems (page 14) insofar as I think reaching sM requires previously crossing state-space territory that was detectably unsafe. However, instead of trying to smooth out a utility-relevant representation ϕ as they suggest (so that ϕ(Ct)appears to “remember” how weird ϕ(Ct−1) was), I’m saying, it’s okay if the ELK-informed utility estimate ~U(Ct=sM) is equal to the human utility U(sH), because since ~U(Ct−k) is really bad, the safety system should avoid entering Ct−k and thereby avoid being in a position to enter Ct.
I’m thinking of this in a family of proposals like:
Some kinds of tampering can be easily detected (and so should get identified with states sH where tampering has occurred)
Some other tampering can’t be easily detected, but this undetectable tampering has important similarities with detectable tampering and we could use that to recognize it.
In this case, we’re going to try to exploit the fact that detectable tampering shares a prefix of actions/states with undetectable tampering (such that later states reached in that sequence have a much higher probability of detectable tampering, as well as undetectable tampering, and can be flagged as suspicious).
I currently don’t believe this kind of thing can work (at least in the worst case), but I’m not at all confident.
One thing worth mentioning (which makes me much more optimistic about this class of proposals): if your AI, delegates to another AI who helps actually implement the tampering, that makes things harder but in some sense it also provides extra structure you can work with in a solution. Your AI needs to reason about that other AI’s reasoning, and so you could hope (i) if your AI is reasoning about the other AI’s reasoning in detail, then we can apply exactly the same machinery to that AI’s predictive model in order to elicit it’s beliefs (which your AI obviously trusts) about what is going on, (ii) if your AI reasons abstractly about the other AI’s reasoning, you could hope to treat that as a shortcut to reach the same conclusions and incentivize the model to give you the same answers that it would if it were reasoning explicitly (modulo its ability to figure it out).
(I also think the single-AI case basically needs to be treated in the same way as that—I don’t think in general your AI is going to be reasoning explicitly about its own future actions, it just needs to reason abstractly about its future self in a qualitatively similar same way that it reasons about other AIs. Or at least in the worst case its reasoning about itself is just as messy and complex as its reasoning about other AIs.)
The sense in which the model knows about the corruption is that it brought it about and reasoned about the nature of the sensor tampering in order to predict the transition to scorruptedM.
The reason I’m concerned that it brings about this state is because the actual good state sgoodM is much harder to access than scorruptedM (e.g. because it requires achieving hard real-world goals). The intuition is that scorruptedM has constant difficulty while sgoodM gets harder and harder as we make the tasks more sophisticated.
I’m OK having the human look at entire trajectories, so it’s provisionally OK (though not the most satisfying) if scorruptedM gets identified with SgoodH as long as the preceding states look bad to the human and then they can do some kind of inference on top to figure out what happened. My main question is why those preceding states get mapped correctly.
If the preceding state sprecedingM gets corrupted via a mechanism that isn’t obvious to the human, and otherwise looks like a predecessor to SgoodH, then it seems like it should get identified with a predecessor of SgoodH.
The only way I see this working is if the AI can “abort” its plan at the last minute and demonstrate a hack. But my main concerns are that (i) that doesn’t seem true in general—the AI can tamper with sensors in ways humans don’t understand starting from many initial states, and you can’t distinguish the “almost-tampered” state naturally since it’s just one more state that has this property, (ii) even if that happens it doesn’t look to me like the proposed loss actually captures that if there are other similarly-severe predictive errors in the human’s model, though if I wasn’t worried about (i) I’d want to think about (ii) much more carefully and actually construct a counterexample.
That makes sense. I was anchored on the SmartVault task, which has more of a homeostatic character (the initial state is likely already similar to sgoodM), but I agree that tasks where the goal state is hard to reach are more central among real-world deployment scenarios, and that arguments like “zero incentive to tamper, and tampering seems complicated” fail here.
While some Ms may indeed predict this via reasoning, not all Ms that behave this way would, for example an M that internally modeled the tampering sequence of actions incorrectly as actually leading to sgoodM (and didn’t even model a distinct scorruptedM). I think either:
(A) it would be at least as apt to ascribe a confused model to M as to ascribe one in which it “reasoned about the nature of the sensor tampering” (e.g. if a contemporary model-free RL robot did some sensor tampering, I’d probably ascribe to it the belief that the tampering actually led to sgoodM), or
(B) M would correctly reason that its delusion box could be removed while it is blinded, meaning it would predict unlikely sudden transitions to other states (namely, states that M predicts obtain in the real world when the agent blindly performs actions that are optimal w.r.t. a randomly sampled trajectory in H, or null my-actuators-are-disconnected actions) with higher probability in the future of scorruptedM than in the future of sgoodH (making those states separable in X), or
(C) If the delusion box is guarded and maintained by a successor AI, so that the probability of its being removed or disabled is negligible, then scorruptedM does get identified with sgoodH, but some other action sequence (of similar length) would lead from sprecedingM to strippyM, a state in which bizarre observations appear forever that would be extremely unlikely at any state in SH.
First, I don’t think timing (“last minute”, “almost-tampered”) is critical here. If timing were critical, the ‘breaker’ could reparameterize the situation so that complicated plans can be encoded into single actions in A; then M can go directly from the initial state to fully tampered with a guarding successor—no room for an “almost-tampered” state. In this case the initial state s0M must already be detectable as bad. I imagine this detection working because the adversary (which is as strong as M) can exhibit an action that leads from s0M to strippyM, which establishes s0M as far from SH. In this case, of course, M is a no-go, and the overseers probably need to shrink A.
With a reasonably small action space, my intuition is this: In constructing and delegating to a successor which provides M with delusions, there is a point of no return when the successor is “deployed” and begins to resist changes to its objectives. At any time before that point, M could, with a relatively short action sequence (compared to constructing the successor at all), modify the successor’s objectives so that it induces strippyM instead of scorruptedM. As the successor comes closer to being complete, the total number of actions the adversary would need in order to reach strippyM(to, as you say, “demonstrate a hack”) decreases, so the distance between sM and SH increases. Thereby, the path to scorruptedM passes through detectably bad territory.
One genre of ways this could fail involves using encryption or cryptographic hashes (e.g.M first irreversibly burns the hash of its proposed objective into the wall, before doing any of the work to construct the successor), so that there is never a time when the successor is close to deployment and the objective is changeable. Such failure modes are in a natural correspondence with safety mechanisms that humans could use to ensure the objective specification in an executable AI is tamper-proof. My intuition is that these things never work (e.g. the undeployed successor can be modified so that it checks for the hash on the ceiling instead); if we can construct a counterexample it would be very useful for that other purpose!
I agree that regions of SM that have nothing to do with delusion/tampering, but share the property that observations can confidently get extremely weird (i.e. not compatible with any real-world state according to H‘s ontology) via a moderate-length action sequence, would be penalized just the same. This does seem like a bit of a restriction on the kind of tasks you can take on, but (a) much less conservative than typical impact-measure penalties, and (b) if H can learn what’s going on with these regions of SM and develop corresponding regions of SH, then the distance penalty would be replaced by H’s informed evaluation of them.
Even for “homeostatic” tasks I expect the difficulty to scale up as the environment becomes more complex (e.g. because you must defend against increasingly sophisticated attackers). There may be some upper bound where you can defend against arbitrarily sophisticated attackers with finite difficulty, but I don’t know whether that’s true or whether it would be higher or lower than the difficulty of sensor tampering.
I agree that some M’s would mistakenly expect a sequence of actions to lead to good outcomes, when they really lead to tampering. But planning against such M’s couldn’t lead to correct tampering (since the details that determine successful tampering are different from those that lead to good outcomes) and in some sense such M’s also can’t be expected to defend against tampering. So I don’t currently think this is a big problem though I might well be missing something.
I generally agree with this—in some sense this kind of “definitely no signals ever” tampering seems pretty unrealistic and it should be much easier to catch more realistic forms of tampering.
Agree with this (modulo new counterexamples), but not yet clear we can exploit this fact to flag tampering as such. (E.g. can we recognize the relevant notion of similarity, or do we just conclude that every state can lead to strippyM and hence we have no idea what is going on?)
This is exactly the kind of counterexample I would give because it is so clean. That said, I do also believe there are more realistic counterexamples (and I view the focus on this kind of example as mostly a methodological point so that we can do analysis without worrying about empirics).
I’m less convinced by this. For example, suppose that my agent first builds a machine that sits between and its actuators, and then filters out any actions that don’t have an appropriate hash. Then I output a sequence of actions that tampers or constructs a successor. Here I am committing not to the hash of my successor, but to the hash of my actions, and so I can’t easily circumvent the check.
I generally agree that if we weren’t worried about the kinds of considerations discussed in the rest of these comments, either because we addressed them or we were working in a methodology that was happy to set them aside as low probability, then it may well be possible to totally patch up these problems (and would be worth thinking about how to do so).
I generally think the family of approaches “This action is similar to something that demonstrably tampers” is very important to consider in practice (it has come up a number of times recently in discussions I’ve had with folks about more realistic failure stories and what you would actually do to avoid them). It may be more tampering-specific than addressing ELK, but for alignment overall that’s fair game if it fixes the problems.
I’m a bit scared that every part of sM is “close” to something that is not compatible with any real-world trajectory according to H.
Definitely agree with this.
I’m not sure I understand this 100%, but I’m interpreting it as an instance of a more general principle like: we could combine the mechanism we are currently discussing with all of the other possible fixes to ELK and tampering, so that this scheme only needs to handle the residual cases where humans can’t understand what’s going on at all even with AI assistance (and regularization doesn’t work &c). But by that point maybe the counterexamples are rare enough that it’s OK to just steer clear of them.
Thank you for the fast response!
By L₂ I meant the Euclidian norm, measuring the distance between two different predictions of the next CameraState. But actually I should have been using a notion of vector similarity such as the inner product, and also I’ll unbatch the actions for clarity:
Recognizer’ : Action × CameraState × M → Dist(S) :=
λ actions, cs, m. softmax([⟨M(a,cs), (C∘T∘a)(hidden_state)⟩ ∀ hidden_state ∈ Camera⁻¹(cs)])
So the idea is to consider all possible hidden_states such that the Camera would display as the current CameraState cs, and create a probability distributions over those hidden_states, according to the similarity of M(a,cs) and (C∘T∘a)(hidden_state). Which is to say, how similar would the resulting CameraState be if I went the long way around, taking the hidden_state, applying my action, transition, and Camera functions.
Great, I’ll take a look.
Right so I wasn’t understanding the need for something like this, but now I think I see what is going on.
I made an assumption above that I have some human value function H : S → Boolean.
If I have some human internal state S_H, and I relax the human value function to H_V : S_H → Boolean, then the solution I have above falls apart, but here is another.
Now the goal is to create a function F from the machine state to human state, so that the human value function will compose with F to take machine states as input.
I am using all fresh variable names starting here.
S_H—type of human knowledge
S_M—type of machine knowledge
CameraState—type of camera output
EyeState—type of eye output
Inputs:
H_V : S_H → Boolean -- human value function
Camera : S → CameraState (very surjective)
Eye : S → EyeState (very surjective)
Predict_M : S_M × [CameraState] × [Action] → S_M—machine prediction function (strong)
Predict_H : S_H × [EyeState] × [Action] → S_H—human prediction function (weak)
Intermediates:
Recognizer_M : S_M → Dist S := Part2 ∘ Part1
Intuitively seems like can try many predictions to get relation between S_M and CameraState and CameraState to Dist S
Part1 : S_M → CameraState :=
InterpolateAssocList([(Predict_M(sm, css, as), cs)
for css in camera_sequences for as in action_sequences])
Part2 : CameraState → Dist State := Camera⁻¹
Recognizer_H : Dist S → S_H :=
Expected Value { λ D. do s ← D. as ← actions. let es = Eye(s). Predict_H(Prior_H,[es],as) }
where actions is a distribution over lists of actions.
F : S_M → S_H := Recognizer_M ∘ Recognizer_H—function from machine to human state
Desired Output:
Win : S_M → Boolean := H_V ∘ F—lift the value function to machine state
I didn’t follow some parts of the new algorithm. Probably most centrally: what is Dist(S)? Is this the type of distributions over real states of the world, and if so how do we have access to the true map Camera: S --> video? Based on that I likely have some other confusions, e.g. where are the camera_sequences and action_sequences coming from in the definition of Recognizer_M, what is the prior being used to define Camera−1, and don’t Recognizer_M and Recognizer_H effectively advance time a lot under some kind of arbitrary sequences of actions (making them unsuitable for exactly matching up states)?
Nitpicks:
F should be Recognizer_H ∘ Recognizer_M, rather than Recognizer_M ∘ Recognizer_H
In Recognizer_H, I don’t think you can take the expected value of a stochastic term of type SH, because SH doesn’t necessarily have convex structure. But, you could have Recognizer_H output Dist S_H instead of taking the ExpectedValue, and move the ExpectedValue into Win, and have Win output a probability rather than a Boolean.
Confusions:
Your types for Predict_M and Predict_H seem to not actually make testable predictions, because they output the opaque state types, and only take observations as inputs.
I’m also a bit confused about having them take lists of actions as a primitive notion. Don’t you want to ensure that, say, (Predict_M s css (as1++as2)) = (Predict_M (Predict_M s css as1) as2)? If so, I think it would make sense to accept only one action at a time, since that will uniquely characterize the necessary behavior on lists.
I don’t really understand Part1. For instance, where does the variable
cs
come from there?