Hi thanks for the response :) So I’m not sure what the distinction you’re making between utility and reward functions, but as far as I can tell we’re referring to the same object—the thing which is changed in the ‘retargeting’ process, the parameters theta—but feel free to correct me if the paper distinguishes between these in a way I’m forgetting; I’ll be using “utility function”, “reward function” and “parameters theta” interchangably, but will correct if so.
I think perhaps we’re just calling different objects as “agents”—I mean p(__ | theta) for some fixed theta (i.e. you can’t swap the theta and still call it the same agent, on the grounds that in the modern RL framework, probably we’d have to retrain a new agent using the same higher-level learning process), and you perhaps think of this theta as an input to the agent, which can be changed without changing the agent? If this is the definition you are using, then I believe your remarks are correct. Either way, I think the relevant subtlety weakens the theorems a fair bit from what a first-reading would suggest, and thus is worth talking about.
I’ll outline my meaning more below to try and clarify:
The salient distinction I wanted to make was that this theorem applies over something like ensembles of agents—that if you choose a parameter-vector theta at random, the agent will have a random preference over observations; then because some actions immediately foreclose a lot of observations (e.g., dying prevents you from leaving Moctezuma temple room 1), these actions become unlikely. My point is that the nominal thrust of the theorems is weaker than proving that an agent will likely seek power; it proves that selecting from the ensemble of agents in this way will see agents seek power.
The difference is that this doesn’t super clearly apply to a single agent. In particular, in real life, we do not select these theta uniformly-at-random at all; the model example is RLHF, the point of which is, schematically, to concentrate a bunch of probability mass within this overall space of thetas to ones we like. Given that any selection process we actually use should be ~invariant to option variegation (i.e. shouldn’t favour a family of outcomes more just because it has ‘more members’), this severely limits the applicability of the theorems to practical agent-selection-processes—their premises (I think) basically amount to assuming option-variegation is the only thing that matters.
As a concrete example of what I mean, consider the MDP where you can choose either 0 to end the game, or 1 to continue, up to a total of T steps. Say the universe of possible reward functions is the integers 1 to T, representing the one timestep at which an agent gets reward for quitting, getting 0 otherwise. Also ignore time-discounting for simplicity. Then the average optimal policy will be to go to T/2 (past which point half of all possible rewards favor continuing, and half would have ended it earlier) - however this says nothing about what a particular trained agent will do. If we wanted to train an RL agent to end at, say, step 436, even if T is 10^6, the power-seeking theorems aren’t a meaningful barrier to doing so because we specify the reward function rather than selecting it from an ensemble at random. Similarly, even though a room with an extra handful of atoms has combinatorially many more states than one without those extra atoms, we wouldn’t reasonably expect an RL agent to prefer the former to the latter solely on the grounds that having more states means that more reward functions will favor those states.
That said, the stronger view that individual agents trained will likely seek power isn’t without support even with these caveats—V. Krakovna’s work (which you also list) does seem to point more directly in the direction of particular agents seeking power, as it extends the theorems in the direction of out-of-distribution generalization. It seems more reasonable to model out-of-distribution generalization via this uniform-random selection than the overall reward-function selection, even as this still isn’t a super-duper realistic model of the generalization, since it still depends on the option-variegation. I expect that if the universe of possible reward functions doesn’t scale with the number of possible states (as it would not if you used a fixed-architecture NN to represent the reward function), this theorem would not go through in the same way.
Hi thanks for the response :) So I’m not sure what the distinction you’re making between utility and reward functions, but as far as I can tell we’re referring to the same object—the thing which is changed in the ‘retargeting’ process, the parameters theta—but feel free to correct me if the paper distinguishes between these in a way I’m forgetting; I’ll be using “utility function”, “reward function” and “parameters theta” interchangably, but will correct if so.
For me utility functions are about decision-making, e.g. utility-maximization, while the reward functions are the theta, i.e. the input to our decision-making, which we are retargeting over, but can only do so for retargetable utility functions.
I think perhaps we’re just calling different objects as “agents”—I mean p(__ | theta) for some fixed theta (i.e. you can’t swap the theta and still call it the same agent, on the grounds that in the modern RL framework, probably we’d have to retrain a new agent using the same higher-level learning process), and you perhaps think of this theta as an input to the agent, which can be changed without changing the agent? If this is the definition you are using, then I believe your remarks are correct. Either way, I think the relevant subtlety weakens the theorems a fair bit from what a first-reading would suggest, and thus is worth talking about.
My point is that the nominal thrust of the theorems is weaker than proving that an agent will likely seek power; it proves that selecting from the ensemble of agents in this way will see agents seek power.
I agree with this if we constrain ourselves to Turner’s work.
That said, the stronger view that individual agents trained will likely seek power isn’t without support even with these caveats—V. Krakovna’s work (which you also list) does seem to point more directly in the direction of particular agents seeking power, as it extends the theorems in the direction of out-of-distribution generalization. It seems more reasonable to model out-of-distribution generalization via this uniform-random selection than the overall reward-function selection, even as this still isn’t a super-duper realistic model of the generalization, since it still depends on the option-variegation.
While V. Krakovna’s work still depends on the option-variegation, but we’re not picking random reward-functions, which is a nice improvement.
I expect that if the universe of possible reward functions doesn’t scale with the number of possible states (as it would not if you used a fixed-architecture NN to represent the reward function), this theorem would not go through in the same way.
Does the proof really depend on whether the reward function scales with the number of possible states? It seems to me that you just need some reward from the reward function that the agent has not seen during training so that we can retarget by swapping the rewards. For example, if our reward function is a CNN, we just need images which haven’t been seen during training, which I don’t think is a strong assumption since we’re usually not training over all possible combination of pixels. Do you agree with this?
If you have concrete suggestions that you’d like me to change, then you can click on the edit button at the article and leave a comment on the underlying google doc, I’d appreciate it :)
I think we agree modulo terminology, with respect to your remarks up to the part about the Krakovna paper, which I had to sit and think a little bit more about.
For the Krakovna paper, you’re right that it has a different flavor than I remembered—it still seems, though, that the proof relies on having some ratio of recurrent vs. non-recurrent states. So if you did something like 1000x the number of terminal states, the reward function is 1000x less retargetable to recurrent-states—I think this is still true even if the new terminal states are entirely unreachable as well?
With respect to the CNN example I agree, at least at a high-level—though technically the theta reward vectors are supposed to be |S| and specify the rewards for each state, which is slightly different than being the weights of a CNN—without redoing the math, its plausible that an analogous theorem would hold. Regardless, the non-shutdown result gives retargetability because it assumes there’s a single terminal state and many recurrent states. The retargetability is really just the ratio (number of terminal states) / (number of recurrent states), which needn’t be greater than one.
Anyways, as the comments from Turntrout talk about, as soon as there’s a nontrivial inductive bias over these different reward-functions (or any other path-dependence-y stuff that deviates from optimality), the theorem doesn’t go through, as retargetability is all based on counting how many of the functions in that set are A-preferring vs. B-preferring—there may be an adaptation to the argument that uses some prior over generalizations and stuff, though—but then that prior is the inductive bias, which as you noted with those TurnTrout remarks, is its own whole big problem :’)
I’ll try and add a concise caveat to your doc, thanks for the discussion :)
Hi thanks for the response :) So I’m not sure what the distinction you’re making between utility and reward functions, but as far as I can tell we’re referring to the same object—the thing which is changed in the ‘retargeting’ process, the parameters theta—but feel free to correct me if the paper distinguishes between these in a way I’m forgetting; I’ll be using “utility function”, “reward function” and “parameters theta” interchangably, but will correct if so.
I think perhaps we’re just calling different objects as “agents”—I mean p(__ | theta) for some fixed theta (i.e. you can’t swap the theta and still call it the same agent, on the grounds that in the modern RL framework, probably we’d have to retrain a new agent using the same higher-level learning process), and you perhaps think of this theta as an input to the agent, which can be changed without changing the agent? If this is the definition you are using, then I believe your remarks are correct. Either way, I think the relevant subtlety weakens the theorems a fair bit from what a first-reading would suggest, and thus is worth talking about.
I’ll outline my meaning more below to try and clarify:
The salient distinction I wanted to make was that this theorem applies over something like ensembles of agents—that if you choose a parameter-vector theta at random, the agent will have a random preference over observations; then because some actions immediately foreclose a lot of observations (e.g., dying prevents you from leaving Moctezuma temple room 1), these actions become unlikely. My point is that the nominal thrust of the theorems is weaker than proving that an agent will likely seek power; it proves that selecting from the ensemble of agents in this way will see agents seek power.
The difference is that this doesn’t super clearly apply to a single agent. In particular, in real life, we do not select these theta uniformly-at-random at all; the model example is RLHF, the point of which is, schematically, to concentrate a bunch of probability mass within this overall space of thetas to ones we like. Given that any selection process we actually use should be ~invariant to option variegation (i.e. shouldn’t favour a family of outcomes more just because it has ‘more members’), this severely limits the applicability of the theorems to practical agent-selection-processes—their premises (I think) basically amount to assuming option-variegation is the only thing that matters.
As a concrete example of what I mean, consider the MDP where you can choose either 0 to end the game, or 1 to continue, up to a total of T steps. Say the universe of possible reward functions is the integers 1 to T, representing the one timestep at which an agent gets reward for quitting, getting 0 otherwise. Also ignore time-discounting for simplicity. Then the average optimal policy will be to go to T/2 (past which point half of all possible rewards favor continuing, and half would have ended it earlier) - however this says nothing about what a particular trained agent will do. If we wanted to train an RL agent to end at, say, step 436, even if T is 10^6, the power-seeking theorems aren’t a meaningful barrier to doing so because we specify the reward function rather than selecting it from an ensemble at random. Similarly, even though a room with an extra handful of atoms has combinatorially many more states than one without those extra atoms, we wouldn’t reasonably expect an RL agent to prefer the former to the latter solely on the grounds that having more states means that more reward functions will favor those states.
That said, the stronger view that individual agents trained will likely seek power isn’t without support even with these caveats—V. Krakovna’s work (which you also list) does seem to point more directly in the direction of particular agents seeking power, as it extends the theorems in the direction of out-of-distribution generalization. It seems more reasonable to model out-of-distribution generalization via this uniform-random selection than the overall reward-function selection, even as this still isn’t a super-duper realistic model of the generalization, since it still depends on the option-variegation. I expect that if the universe of possible reward functions doesn’t scale with the number of possible states (as it would not if you used a fixed-architecture NN to represent the reward function), this theorem would not go through in the same way.
Let me know your thoughts :)
For me utility functions are about decision-making, e.g. utility-maximization, while the reward functions are the theta, i.e. the input to our decision-making, which we are retargeting over, but can only do so for retargetable utility functions.
I think the theta is not a property of the agent, but of the training prodecure. Actually, Parametrically retargetable decision-makers tend to seek power is not about trained agents in the first place, so I’d say we’re never talking about different agents in the first place.
I agree with this if we constrain ourselves to Turner’s work.
While V. Krakovna’s work still depends on the option-variegation, but we’re not picking random reward-functions, which is a nice improvement.
Does the proof really depend on whether the reward function scales with the number of possible states? It seems to me that you just need some reward from the reward function that the agent has not seen during training so that we can retarget by swapping the rewards. For example, if our reward function is a CNN, we just need images which haven’t been seen during training, which I don’t think is a strong assumption since we’re usually not training over all possible combination of pixels. Do you agree with this?
If you have concrete suggestions that you’d like me to change, then you can click on the edit button at the article and leave a comment on the underlying google doc, I’d appreciate it :)
Maybe its also useless to discuss this...
I think we agree modulo terminology, with respect to your remarks up to the part about the Krakovna paper, which I had to sit and think a little bit more about.
For the Krakovna paper, you’re right that it has a different flavor than I remembered—it still seems, though, that the proof relies on having some ratio of recurrent vs. non-recurrent states. So if you did something like 1000x the number of terminal states, the reward function is 1000x less retargetable to recurrent-states—I think this is still true even if the new terminal states are entirely unreachable as well?
With respect to the CNN example I agree, at least at a high-level—though technically the theta reward vectors are supposed to be |S| and specify the rewards for each state, which is slightly different than being the weights of a CNN—without redoing the math, its plausible that an analogous theorem would hold. Regardless, the non-shutdown result gives retargetability because it assumes there’s a single terminal state and many recurrent states. The retargetability is really just the ratio (number of terminal states) / (number of recurrent states), which needn’t be greater than one.
Anyways, as the comments from Turntrout talk about, as soon as there’s a nontrivial inductive bias over these different reward-functions (or any other path-dependence-y stuff that deviates from optimality), the theorem doesn’t go through, as retargetability is all based on counting how many of the functions in that set are A-preferring vs. B-preferring—there may be an adaptation to the argument that uses some prior over generalizations and stuff, though—but then that prior is the inductive bias, which as you noted with those TurnTrout remarks, is its own whole big problem :’)
I’ll try and add a concise caveat to your doc, thanks for the discussion :)