Jannes Elstner comments on Stampy’s AI Safety Info—New Distillations #4 [July 2023]

Jannes Elstner Aug 20, 2023, 5:49 PM
2 points
0
Hi thanks for the response :) So I’m not sure what the distinction you’re making between utility and reward functions, but as far as I can tell we’re referring to the same object—the thing which is changed in the ‘retargeting’ process, the parameters theta—but feel free to correct me if the paper distinguishes between these in a way I’m forgetting; I’ll be using “utility function”, “reward function” and “parameters theta” interchangably, but will correct if so.
For me utility functions are about decision-making, e.g. utility-maximization, while the reward functions are the theta, i.e. the input to our decision-making, which we are retargeting over, but can only do so for retargetable utility functions.
I think perhaps we’re just calling different objects as “agents”—I mean p(__ | theta) for some fixed theta (i.e. you can’t swap the theta and still call it the same agent, on the grounds that in the modern RL framework, probably we’d have to retrain a new agent using the same higher-level learning process), and you perhaps think of this theta as an input to the agent, which can be changed without changing the agent? If this is the definition you are using, then I believe your remarks are correct. Either way, I think the relevant subtlety weakens the theorems a fair bit from what a first-reading would suggest, and thus is worth talking about.
I think the theta is not a property of the agent, but of the training prodecure. Actually, Parametrically retargetable decision-makers tend to seek power is not about trained agents in the first place, so I’d say we’re never talking about different agents in the first place.
My point is that the nominal thrust of the theorems is weaker than proving that an agent will likely seek power; it proves that selecting from the ensemble of agents in this way will see agents seek power.
I agree with this if we constrain ourselves to Turner’s work.
That said, the stronger view that individual agents trained will likely seek power isn’t without support even with these caveats—V. Krakovna’s work (which you also list) does seem to point more directly in the direction of particular agents seeking power, as it extends the theorems in the direction of out-of-distribution generalization. It seems more reasonable to model out-of-distribution generalization via this uniform-random selection than the overall reward-function selection, even as this still isn’t a super-duper realistic model of the generalization, since it still depends on the option-variegation.
While V. Krakovna’s work still depends on the option-variegation, but we’re not picking random reward-functions, which is a nice improvement.
I expect that if the universe of possible reward functions doesn’t scale with the number of possible states (as it would not if you used a fixed-architecture NN to represent the reward function), this theorem would not go through in the same way.
Does the proof really depend on whether the reward function scales with the number of possible states? It seems to me that you just need some reward from the reward function that the agent has not seen during training so that we can retarget by swapping the rewards. For example, if our reward function is a CNN, we just need images which haven’t been seen during training, which I don’t think is a strong assumption since we’re usually not training over all possible combination of pixels. Do you agree with this?
If you have concrete suggestions that you’d like me to change, then you can click on the edit button at the article and leave a comment on the underlying google doc, I’d appreciate it :)
Maybe its also useless to discuss this...
- CharlesRW Aug 21, 2023, 10:09 AM
  1 point
  0
  Parent
  I think we agree modulo terminology, with respect to your remarks up to the part about the Krakovna paper, which I had to sit and think a little bit more about.
  
  For the Krakovna paper, you’re right that it has a different flavor than I remembered—it still seems, though, that the proof relies on having some ratio of recurrent vs. non-recurrent states. So if you did something like 1000x the number of terminal states, the reward function is 1000x less retargetable to recurrent-states—I think this is still true even if the new terminal states are entirely unreachable as well?
  
  With respect to the CNN example I agree, at least at a high-level—though technically the theta reward vectors are supposed to be |S| and specify the rewards for each state, which is slightly different than being the weights of a CNN—without redoing the math, its plausible that an analogous theorem would hold. Regardless, the non-shutdown result gives retargetability because it assumes there’s a single terminal state and many recurrent states. The retargetability is really just the ratio (number of terminal states) / (number of recurrent states), which needn’t be greater than one.
  
  Anyways, as the comments from Turntrout talk about, as soon as there’s a nontrivial inductive bias over these different reward-functions (or any other path-dependence-y stuff that deviates from optimality), the theorem doesn’t go through, as retargetability is all based on counting how many of the functions in that set are A-preferring vs. B-preferring—there may be an adaptation to the argument that uses some prior over generalizations and stuff, though—but then that prior is the inductive bias, which as you noted with those TurnTrout remarks, is its own whole big problem :’)
  
  I’ll try and add a concise caveat to your doc, thanks for the discussion :)