Random alignment-related idea: train and investigate a “Gradient Hacker Enzyme”
TL;DR, Use meta-learning methods like MAML to train a network submodule i.e. circuit that would resist gradient updates in a wide variety of contexts (various architectures, hyperparameters, modality, etc), and use mechanistic interpretability to see how it works.
It should be possible to have a training setup for goals other than “resist gradient updates,” such as restricting the meta-objective to a specific sub-sub-circuit. In that case, the outer circuit might (1) instrumentally resist updates, or (2) somehow get modified while keeping its original behavioral objective intact.
This setup doesn’t have to be restricted to circuits of course; there was a previous work which did this on the level of activations, although iiuc the model found a trivial solution by exploiting relu—it would be interesting to extend this to more diverse setup.
Anyways, varying this “sub-sub-circuit/activation-to-be-preserved” over different meta-learning episodes would incentivize the training process to find “general” Gradient Hacker designs that aren’t specific to a particular circuit/activation—a potential precursor for various forms of advanced Gradient Hackers (and some loose analogies to how enzymes accelerate reactions).
What is the Theory of Impact for training a “Gradient Hacker Enzyme”?
(note: while I think these are valid, they’re generated post-hoc and don’t reflect the actual process for me coming up with this idea)
Estimating the lower-bound for the emergence Gradient Hackers.
By varying the meta-learning setups we can get an empirical estimate for the conditions in which Gradient Hackers are possible.
Perhaps gradient hackers are actually trivial to construct using tricks we haven’t thought of before (like the relu example before). Maybe not! Perhaps they require [high-model-complexity/certain-modality/reflective-agent/etc].
Why lower-bound? In a real training environment, gradient hackers appear because of (presumably) convergent training incentives. Instead in the meta-learning setup, we’re directly optimizing for gradient hackers.
Mechanistically understanding how Gradient Hackers work.
Applying mechanistic interpretability here might not be too difficult, because the circuit is cleanly separated from the rest of the model.
There has been severalspeculationson how such circuits might emerge. Testing them empirically sounds like a good idea!
This is just a random idea and I’m probably not going to work on it; but if you’re interested, let me know. While I don’t think this is capabilities-relevant, this probably falls under AI gain-of-function research and should be done with caution.
Update: I’m trying to upskill mechanistic interpretability, and training a Gradient Hacker Enzyme seems like a fairly good project just to get myself started.
I don’t think this project would be highly valuable in and of itself (although I would definitely learn a lot!), so one failure mode I need to avoid is ending up investing too much of my time in this idea. I’ll probably spend a total of ~1 week working on it.
Random alignment-related idea: train and investigate a “Gradient Hacker Enzyme”
TL;DR, Use meta-learning methods like MAML to train a network submodule i.e. circuit that would resist gradient updates in a wide variety of contexts (various architectures, hyperparameters, modality, etc), and use mechanistic interpretability to see how it works.
It should be possible to have a training setup for goals other than “resist gradient updates,” such as restricting the meta-objective to a specific sub-sub-circuit. In that case, the outer circuit might (1) instrumentally resist updates, or (2) somehow get modified while keeping its original behavioral objective intact.
This setup doesn’t have to be restricted to circuits of course; there was a previous work which did this on the level of activations, although iiuc the model found a trivial solution by exploiting relu—it would be interesting to extend this to more diverse setup.
Anyways, varying this “sub-sub-circuit/activation-to-be-preserved” over different meta-learning episodes would incentivize the training process to find “general” Gradient Hacker designs that aren’t specific to a particular circuit/activation—a potential precursor for various forms of advanced Gradient Hackers (and some loose analogies to how enzymes accelerate reactions).
What is the Theory of Impact for training a “Gradient Hacker Enzyme”?
(note: while I think these are valid, they’re generated post-hoc and don’t reflect the actual process for me coming up with this idea)
Estimating the lower-bound for the emergence Gradient Hackers.
By varying the meta-learning setups we can get an empirical estimate for the conditions in which Gradient Hackers are possible.
Perhaps gradient hackers are actually trivial to construct using tricks we haven’t thought of before (like the relu example before). Maybe not! Perhaps they require [high-model-complexity/certain-modality/reflective-agent/etc].
Why lower-bound? In a real training environment, gradient hackers appear because of (presumably) convergent training incentives. Instead in the meta-learning setup, we’re directly optimizing for gradient hackers.
Mechanistically understanding how Gradient Hackers work.
Applying mechanistic interpretability here might not be too difficult, because the circuit is cleanly separated from the rest of the model.
There has been several speculations on how such circuits might emerge. Testing them empirically sounds like a good idea!
This is just a random idea and I’m probably not going to work on it; but if you’re interested, let me know. While I don’t think this is capabilities-relevant, this probably falls under AI gain-of-function research and should be done with caution.
Update: I’m trying to upskill mechanistic interpretability, and training a Gradient Hacker Enzyme seems like a fairly good project just to get myself started.
I don’t think this project would be highly valuable in and of itself (although I would definitely learn a lot!), so one failure mode I need to avoid is ending up investing too much of my time in this idea. I’ll probably spend a total of ~1 week working on it.