Ever since I entered the community, I’ve definitely heard of people talking about policy gradient as “upweighting trajectories with positive reward/downweighting trajectories with negative reward” since 2016, albeit in person. I remember being shown a picture sometime in 2016⁄17 that looks something like this when someone (maybe Paul?) was explaining REINFORCE to me: (I couldn’t find it, so reconstructing it from memory)
Knowing how to reason about “upweighting trajectories” when explicitly prompted or in narrow contexts of algorithmic implementation is not sufficient to conclude “people basically knew this perspective” (but it’s certainly evidence). See Outside the Laboratory:
Now suppose we discover that a Ph.D. economist buys a lottery ticket every week. We have to ask ourselves: Does this person really understand expected utility, on a gut level? Or have they just been trained to perform certain algebra tricks?
Knowing “vanilla PG upweights trajectories”, and being able to explain the math—this is not enough to save someone from the rampant reward confusions. Certainly Yoshua Bengio could explain vanilla PG, and yet he goes on about how RL (almost certainly, IIRC) trains reward maximizers.
I contend these confusions were not due to a lack of exposure to the “rewards as weighting trajectories” perspective.
I personally disagree—although I think your list of alternative explanations is reasonable. If alignment theorists had been using this (simple and obvious-in-retrospect) “reward chisels circuits into the network” perspective, if they had really been using it and felt it deep within their bones, I think they would not have been particularly tempted by this family of mistakes.
What’s the difference between “Alice is falling victim to confusions/reasoning mistakes about X” and “Alice disagrees with me about X”?
I feel like using the former puts undue social pressure on observers to conclude that you’re right, and makes it less likely they correctly adjudicate between the perspectives.
(Perhaps you can empathise with me here, since arguably certain people taking this sort of tone is one of the reasons AI x-risk arguments have not always been vetted as carefully as they should!)
What’s the difference between “Alice is falling victim to confusions/reasoning mistakes about X” and “Alice disagrees with me about X”?
I suspect that, for Alex Turner, writing the former instead of the latter is a signal that he thinks he has identified the specific confusion/reasoning mistake his interlocutor is engaged in, likely as a result of having seen closely analogous arguments in the past from other people who turned out (or even admitted) to be confused about these matters after conversations with him.
Knowing how to reason about “upweighting trajectories” when explicitly prompted or in narrow contexts of algorithmic implementation is not sufficient to conclude “people basically knew this perspective” (but it’s certainly evidence). See Outside the Laboratory:
Knowing “vanilla PG upweights trajectories”, and being able to explain the math—this is not enough to save someone from the rampant reward confusions. Certainly Yoshua Bengio could explain vanilla PG, and yet he goes on about how RL (almost certainly, IIRC) trains reward maximizers.
I personally disagree—although I think your list of alternative explanations is reasonable. If alignment theorists had been using this (simple and obvious-in-retrospect) “reward chisels circuits into the network” perspective, if they had really been using it and felt it deep within their bones, I think they would not have been particularly tempted by this family of mistakes.
What’s the difference between “Alice is falling victim to confusions/reasoning mistakes about X” and “Alice disagrees with me about X”?
I feel like using the former puts undue social pressure on observers to conclude that you’re right, and makes it less likely they correctly adjudicate between the perspectives.
(Perhaps you can empathise with me here, since arguably certain people taking this sort of tone is one of the reasons AI x-risk arguments have not always been vetted as carefully as they should!)
I suspect that, for Alex Turner, writing the former instead of the latter is a signal that he thinks he has identified the specific confusion/reasoning mistake his interlocutor is engaged in, likely as a result of having seen closely analogous arguments in the past from other people who turned out (or even admitted) to be confused about these matters after conversations with him.
Do you have a reference to the problematic argument that Yoshua Bengio makes?