I like this idea! And these are excellent first considerations on what exactly the surgeon should look like.
It seems to me that bounding the size of the modification the surgeon can make to any one activation has some issues. For instance, suppose that we’re trying to apply this scheme to the agent which defects if it sees a factorization of RSA-2048. A plausible way this agent could work internally is that there is a particular activation which tracks whether the agent has seen a factorization of RSA-2048: the activation is very large whenever the agent has seen such a factorization of RSA-2048 and very small otherwise. If the surgeon is only allowed to make small modifications to any given activation, then the it won’t be able to make a large enough change to this RSA-2048 activation to trigger defective behavior. Instead, it would have to find the small modifications it could make to the many activations feeding in to the RSA-2048 activation which would cause it to activate strongly; this might itself be as hard as finding a factorization of RSA-2048 (if e.g. the only way to do this was to spoof the inputs which themselves encode an RSA-2048 factorization).
(I think I view this concern as closely related to Oam’s concern: what if “bad behavior” isn’t a continuous function of the latent space?)
An alternative you might consider is giving the surgeon a “modification budget” which it can either spend by making large modifications to a small number of activations or by making small modifications to a large number of activations. In other words, this tests the robustness of the agent under adversarial latent space perturbations of a fixed L1 norm. Other variants: swap the L1 norm for the L2 norm; make activations in the later layers use up more of the budget; probably other clever ideas.
A plausible way this agent could work internally is that there is a particular activation which tracks whether the agent has seen a factorization of RSA-2048: the activation is very large whenever the agent has seen such a factorization of RSA-2048 and very small otherwise.
Good point. What I’d really like is for the cap on the surgeon’s modifications to be based in some way on the weights of the agent. If the inputs and weights are typically order-unity and there are d layers and N neurons per layer then activations shouldn’t get much bigger than ∼Nd in the worst case (which corresponds to all weights of +1, all inputs of +1, so each layer just multiplies by N). So I’d like to see the surgeon’s modifications capped to be no more than this for sure.
In practice a tighter bound is given by looking at the eigenvalues of the weight layers, and the max ratio of activations to inputs is ∼∏iλmax,i, where the product runs over layers and λmax is the maximum-magnitude eigenvalue of a layer.
In other words, this tests the robustness of the agent under adversarial latent space perturbations of a fixed L1 norm. Other variants: swap the L1 norm for the L2 norm; make activations in the later layers use up more of the budget; probably other clever ideas.
Definitely! In particular, what I think makes sense is to make the surgeon try to minimize a loss which is shaped like (size of edits + loss of agent), where “size of edits” bakes in considerations like “am I editing an entire layer?” and “what is my largest edit?” and anything else that ends up mattering.
I like this idea! And these are excellent first considerations on what exactly the surgeon should look like.
It seems to me that bounding the size of the modification the surgeon can make to any one activation has some issues. For instance, suppose that we’re trying to apply this scheme to the agent which defects if it sees a factorization of RSA-2048. A plausible way this agent could work internally is that there is a particular activation which tracks whether the agent has seen a factorization of RSA-2048: the activation is very large whenever the agent has seen such a factorization of RSA-2048 and very small otherwise. If the surgeon is only allowed to make small modifications to any given activation, then the it won’t be able to make a large enough change to this RSA-2048 activation to trigger defective behavior. Instead, it would have to find the small modifications it could make to the many activations feeding in to the RSA-2048 activation which would cause it to activate strongly; this might itself be as hard as finding a factorization of RSA-2048 (if e.g. the only way to do this was to spoof the inputs which themselves encode an RSA-2048 factorization).
(I think I view this concern as closely related to Oam’s concern: what if “bad behavior” isn’t a continuous function of the latent space?)
An alternative you might consider is giving the surgeon a “modification budget” which it can either spend by making large modifications to a small number of activations or by making small modifications to a large number of activations. In other words, this tests the robustness of the agent under adversarial latent space perturbations of a fixed L1 norm. Other variants: swap the L1 norm for the L2 norm; make activations in the later layers use up more of the budget; probably other clever ideas.
Thanks!
Good point. What I’d really like is for the cap on the surgeon’s modifications to be based in some way on the weights of the agent. If the inputs and weights are typically order-unity and there are d layers and N neurons per layer then activations shouldn’t get much bigger than ∼Nd in the worst case (which corresponds to all weights of +1, all inputs of +1, so each layer just multiplies by N). So I’d like to see the surgeon’s modifications capped to be no more than this for sure.
In practice a tighter bound is given by looking at the eigenvalues of the weight layers, and the max ratio of activations to inputs is ∼∏iλmax,i, where the product runs over layers and λmax is the maximum-magnitude eigenvalue of a layer.
Definitely! In particular, what I think makes sense is to make the surgeon try to minimize a loss which is shaped like (size of edits + loss of agent), where “size of edits” bakes in considerations like “am I editing an entire layer?” and “what is my largest edit?” and anything else that ends up mattering.