I do in fact think that few people actually already deeply internalized the points I’m making in this post, even including a few people who say they have or that this post is obvious. Therefore, I concluded that lots of alignment thinking is suspect until re-analyzed.
“Risks from Learned Optimization in Advanced Machine Learning Systems,” which we published three years ago and started writing four years ago, is extremely explicit that we don’t know how to get an agent that is actually optimizing for a specified reward function. The alignment research community has been heavily engaging with this idea since then. Though I agree that many alignment researchers used to be making this mistake, I think it’s extremely clear that by this point most serious alignment researchers understand the distinction.
I have relatively little idea how to “improve” a reward function so that it improves the inner cognition chiseled into the policy, because I don’t know the mapping from outer reward schedules to inner cognition within the agent. Does an “amplified” reward signal produce better cognition in the inner agent? Possibly? Even if that were true, how would I know it?
which we published three years ago and started writing four years ago, is extremely explicit that we don’t know how to get an agent that is actually optimizing for a specified reward function.
That isn’t the main point I had in mind. See my comment to Chris here.
That isn’t the main point I had in mind. See my comment to Chris here.
Left a comment.
Yup, the training story regime sounds good by my lights. Am I intended to conclude something further from this remark of yours, though?
Nope, just wanted to draw your attention to another instance of alignment researchers already understanding this point.
Also, I want to be clear that I like this post a lot and I’m glad you wrote it—I think it’s good to explain this sort of thing more, especially in different ways that are likely to click for different people. I just think your specific claim that most alignment researchers don’t understand this already is false.
I just think your specific claim that most alignment researchers don’t understand this already is false.
I have privately corresponded with a senior researcher who, when asked what they thought would result from a specific training scenario, made an explicit (and acknowledged) mistake along the lines of this post. Another respected researcher seemingly slipped on the same point, some time after already discussing this post with them. I am still not sure whether I’m on the same page with Paul, as well (I have general trouble understanding what he believes, though). And Rohin also has this experience of explaining the points in OP on a regular basis. All this among many other private communication events I’ve experienced.
(Out of everyone I would expect to already have understood this post, I think you and Rohin would be at the top of the list.)
So basically, the above screens off “Who said what in past posts?”, because whoever said whatever, it’s still producing my weekly experiences of explaining the points in this post. I still haven’t seen the antecedent-computation-reinforcement (ACR) emphasis thoroughly explained elsewhere, although I agree that some important bits (like training stories) are not novel to this post. (The point isn’t so much “What do I get credit for?” as much as “I am concerned about this situation.”)
Here’s more speculation. I think alignment theorists mostly reason via selection-level arguments. While they might answer correctly on “Reward is? optimization target” when pressed, and implicitly use ACR to reason about what’s going on in their ML training runs, I’d guess that probably don’t engage in mechanistic ACR reasoning in their day-to-day theorizing. (Again, I can only speculate, because I am not a mind-reader, but I do still have beliefs on the matter.)
(Just wanted to echo that I agree with TurnTrout that I find myself explaining the point that reward may not be the optimization target a lot, and I think I disagree somewhat with Ajeya’s recent post for similar reasons. I don’t think that the people I’m explaining it to literally don’t understand the point at all; I think it mostly hasn’t propagated into some parts of their other reasoning about alignment. I’m less on board with the “it’s incorrect to call reward a base objective” point but I think it’s pretty plausible that once I actually understand what TurnTrout is saying there I’ll agree with it.)
“Risks from Learned Optimization in Advanced Machine Learning Systems,” which we published three years ago and started writing four years ago, is extremely explicit that we don’t know how to get an agent that is actually optimizing for a specified reward function. The alignment research community has been heavily engaging with this idea since then. Though I agree that many alignment researchers used to be making this mistake, I think it’s extremely clear that by this point most serious alignment researchers understand the distinction.
This is precisely the point I make in “How do we become confident in the safety of a machine learning system is making,” btw.
That isn’t the main point I had in mind. See my comment to Chris here.
EDIT:
Yup, the training story regime sounds good by my lights. Am I intended to conclude something further from this remark of yours, though?
Left a comment.
Nope, just wanted to draw your attention to another instance of alignment researchers already understanding this point.
Also, I want to be clear that I like this post a lot and I’m glad you wrote it—I think it’s good to explain this sort of thing more, especially in different ways that are likely to click for different people. I just think your specific claim that most alignment researchers don’t understand this already is false.
I have privately corresponded with a senior researcher who, when asked what they thought would result from a specific training scenario, made an explicit (and acknowledged) mistake along the lines of this post. Another respected researcher seemingly slipped on the same point, some time after already discussing this post with them. I am still not sure whether I’m on the same page with Paul, as well (I have general trouble understanding what he believes, though). And Rohin also has this experience of explaining the points in OP on a regular basis. All this among many other private communication events I’ve experienced.
(Out of everyone I would expect to already have understood this post, I think you and Rohin would be at the top of the list.)
So basically, the above screens off “Who said what in past posts?”, because whoever said whatever, it’s still producing my weekly experiences of explaining the points in this post. I still haven’t seen the antecedent-computation-reinforcement (ACR) emphasis thoroughly explained elsewhere, although I agree that some important bits (like training stories) are not novel to this post. (The point isn’t so much “What do I get credit for?” as much as “I am concerned about this situation.”)
Here’s more speculation. I think alignment theorists mostly reason via selection-level arguments. While they might answer correctly on “Reward is? optimization target” when pressed, and implicitly use ACR to reason about what’s going on in their ML training runs, I’d guess that probably don’t engage in mechanistic ACR reasoning in their day-to-day theorizing. (Again, I can only speculate, because I am not a mind-reader, but I do still have beliefs on the matter.)
(Just wanted to echo that I agree with TurnTrout that I find myself explaining the point that reward may not be the optimization target a lot, and I think I disagree somewhat with Ajeya’s recent post for similar reasons. I don’t think that the people I’m explaining it to literally don’t understand the point at all; I think it mostly hasn’t propagated into some parts of their other reasoning about alignment. I’m less on board with the “it’s incorrect to call reward a base objective” point but I think it’s pretty plausible that once I actually understand what TurnTrout is saying there I’ll agree with it.)