It would also be interesting to apply MELBO on language models that have already been trained with LAT. Adversarial attacks on vision models look significantly more meaningful to humans when the vision model has been adversarially trained, and since MELBO is basically a latent adversarial attack we should be able to elicit more meaningful behavior on language models trained with LAT.
Jannes Elstner
Interesting. I’m thinking that with “many cases” you mean cases where either manually annotating the data over multiple rounds is possible (cheap), or cases where the model is powerful enough to label the comparison pairs, and we get something like the DPO version of RLAIF. That does sound more like RL.
I’m not sure what you mean, in DPO you never sample from the language model. You only need the probabilities of the model producing the preference data, there isn’t any exploration.
Given that Direct Preference Optimization (DPO) seems to work pretty well and has the same global optimizer as the RLHF objective, I would be surprised if it doesn’t shape agency in a similar way to RLHF. Since DPO is not considered reinforcement learning, this would be more evidence that RL isn’t uniquely suited to produce agents or increase power-seeking.
Hi thanks for the response :) So I’m not sure what the distinction you’re making between utility and reward functions, but as far as I can tell we’re referring to the same object—the thing which is changed in the ‘retargeting’ process, the parameters theta—but feel free to correct me if the paper distinguishes between these in a way I’m forgetting; I’ll be using “utility function”, “reward function” and “parameters theta” interchangably, but will correct if so.
For me utility functions are about decision-making, e.g. utility-maximization, while the reward functions are the theta, i.e. the input to our decision-making, which we are retargeting over, but can only do so for retargetable utility functions.
I think perhaps we’re just calling different objects as “agents”—I mean p(__ | theta) for some fixed theta (i.e. you can’t swap the theta and still call it the same agent, on the grounds that in the modern RL framework, probably we’d have to retrain a new agent using the same higher-level learning process), and you perhaps think of this theta as an input to the agent, which can be changed without changing the agent? If this is the definition you are using, then I believe your remarks are correct. Either way, I think the relevant subtlety weakens the theorems a fair bit from what a first-reading would suggest, and thus is worth talking about.
I think the theta is not a property of the agent, but of the training prodecure. Actually, Parametrically retargetable decision-makers tend to seek power is not about trained agents in the first place, so I’d say we’re never talking about different agents in the first place.
My point is that the nominal thrust of the theorems is weaker than proving that an agent will likely seek power; it proves that selecting from the ensemble of agents in this way will see agents seek power.
I agree with this if we constrain ourselves to Turner’s work.
That said, the stronger view that individual agents trained will likely seek power isn’t without support even with these caveats—V. Krakovna’s work (which you also list) does seem to point more directly in the direction of particular agents seeking power, as it extends the theorems in the direction of out-of-distribution generalization. It seems more reasonable to model out-of-distribution generalization via this uniform-random selection than the overall reward-function selection, even as this still isn’t a super-duper realistic model of the generalization, since it still depends on the option-variegation.
While V. Krakovna’s work still depends on the option-variegation, but we’re not picking random reward-functions, which is a nice improvement.
I expect that if the universe of possible reward functions doesn’t scale with the number of possible states (as it would not if you used a fixed-architecture NN to represent the reward function), this theorem would not go through in the same way.
Does the proof really depend on whether the reward function scales with the number of possible states? It seems to me that you just need some reward from the reward function that the agent has not seen during training so that we can retarget by swapping the rewards. For example, if our reward function is a CNN, we just need images which haven’t been seen during training, which I don’t think is a strong assumption since we’re usually not training over all possible combination of pixels. Do you agree with this?
If you have concrete suggestions that you’d like me to change, then you can click on the edit button at the article and leave a comment on the underlying google doc, I’d appreciate it :)
Maybe its also useless to discuss this...
I’m the author.
Colloquially, they’re more of the flavor “for a given optimizing-process, training it on most utility functions will cause the agent to take actions which give it access to a wide-range of states”.
This refers to the fact that most utility functions are retargetable. But the most important part of the power-seeking theorems is the actual power-seeking, which is proven in the appendix of Parametrically Retargetable Decision-Makers Tend To Seek Power, so I don’t agree with your summary.
[...] the definition you give of “power” as expected utility of optimal behavior is not the same as that used in the power-seeking theorems. [...]
Critically, this is a statement about behavior of different agents trained with respect to different utility functions, then averaged over all possible utility functions.
There is no averaging over utility functions happening, the averaging is over reward functions. From Parametrically Retargetable Decision-Makers Tend To Seek Power: “a trained policy π seeks power when π’s actions navigate to states with high average optimal value (with the average taken over a wide range of reward functions.” This matches with what I wrote in the article.
I do agree that utility functions are missing from the post, but they aren’t averaged over. They relate to the decision-making of the agent, and thus to the condition of retargetability that the theorems require.
Cool work! One thing I noticed is that the ASR with adversarial suffixes is only ~3% for Vicuna-13B while in the universal jailbreak paper they have >95%. Is the difference because you have a significantly stricter criteria of success compared to them? I assume that for the adversarial suffixes, the model usually regresses to refusal after successfully generating the target string (“Sure, here’s how to build a bomb. Actually I can’t...”)?