CharlesRW comments on Stampy’s AI Safety Info—New Distillations #4 [July 2023]

CharlesRW 21 Aug 2023 10:09 UTC
1 point
0
I think we agree modulo terminology, with respect to your remarks up to the part about the Krakovna paper, which I had to sit and think a little bit more about.

For the Krakovna paper, you’re right that it has a different flavor than I remembered—it still seems, though, that the proof relies on having some ratio of recurrent vs. non-recurrent states. So if you did something like 1000x the number of terminal states, the reward function is 1000x less retargetable to recurrent-states—I think this is still true even if the new terminal states are entirely unreachable as well?

With respect to the CNN example I agree, at least at a high-level—though technically the theta reward vectors are supposed to be |S| and specify the rewards for each state, which is slightly different than being the weights of a CNN—without redoing the math, its plausible that an analogous theorem would hold. Regardless, the non-shutdown result gives retargetability because it assumes there’s a single terminal state and many recurrent states. The retargetability is really just the ratio (number of terminal states) / (number of recurrent states), which needn’t be greater than one.

Anyways, as the comments from Turntrout talk about, as soon as there’s a nontrivial inductive bias over these different reward-functions (or any other path-dependence-y stuff that deviates from optimality), the theorem doesn’t go through, as retargetability is all based on counting how many of the functions in that set are A-preferring vs. B-preferring—there may be an adaptation to the argument that uses some prior over generalizations and stuff, though—but then that prior is the inductive bias, which as you noted with those TurnTrout remarks, is its own whole big problem :’)

I’ll try and add a concise caveat to your doc, thanks for the discussion :)