At this point I think there are a number of potential replies from people who still insist that the LW models of AI alignment were never wrong, which I (depending on the speaker) think can often border on gaslighting:
This is one of the main reasons I’m not excited about engaging with LessWrong. Why bother? It feels like nothing I say will matter. Apparently, no pre-takeoff experiments matter to some folk.[1] And even if I successfully dismantle some philosophical argument, there’s a good chance they will use another argument to support their beliefs instead. Nothing changes.
When talking with pre-2020 alignment folks about these issues, I feel gaslit quite often. You have no idea how many times I’ve been told things like “most people already understood that reward is not the optimization target”[2] and “maybe you had a lesson you needed to learn, but I feel like I got this in 2018″, and so on. Almost always this comes from people who seem to still not understand what I’m talking about. I feel fine if[3] they disagree with me about specific ideas, but what really bothers me is the revisionism. It’s so annoying.
Like, just look at this quote from the post you mentioned:
Unfortunately, AI systems trained with reinforcement learning only optimize features specified in the reward function and are indifferent to anything we might’ve inadvertently left out.
And you probably didn’t even select that post for this particular misunderstanding. (EDIT: Note that I am not accusing Rohin of gaslighting on this topic, and I also think he already understood the “reward is not the optimization target” point when he wrote the above sentence. My critique was that the statement is false and would probably lead readers to incorrect beliefs about the purpose of reward in RL.)
I feel a lot of disappointment and sadness. In 2018, I came to this website when I really needed a new way to understand the world. I’d made a lot of epistemic mistakes I wasn’t proud of, and I didn’t want to live that way anymore. I wanted to think more clearly. I wanted it so, so badly. I came to rely and depend on this place and the fellow users. I looked up to and admired a bunch of them (and I still do so for a few).
But the things you mention—the revisionism, the unfalsifiability, the apparent gaslighting? We set out to do better than science. I think we often do worse.
As a general principle, truths are entangled with each other. It’s OK if a theory’s most extreme prediction (e.g. extinction from AI) is not testable at the current moment. It is a highly suspicious state of affairs if a theory yields no other testable predictions. Truths are generally entangled with each other in intricate and manifold ways. There are generally many clever ways to test a theory, given the necessary will and curiosity.
Sometimes I instead get pushback like “it seems to me like I’ve grasped the insights you’re trying to communicate, but I totally acknowledge that I might just not be seeing what you’re saying yet.” I respect and appreciate that response. It communicates the other person’s true perception (that they already understand) while not invalidating or assuming away my perspective.
I get why you feel that way. I think there are a lot of us on LessWrong who are less vocal and more openminded, and less aligned with either optimistic network thinkers or pessimistic agent foundations thinkers. People newer to the discussion and otherwise less polarized are listening and changing their minds in large or small ways.
I’m sorry you’re feeling so pessimistic about LessWrong. I think there is a breakdown in communication happening between the old guard and the new guard you exemplify. I don’t think that’s a product of venue, but of the sheer difficulty of the discussion. And polarization between different veiwpoints on alignment.
I think maintaining a good community falls on all of us. Formats and mods can help, but communities set their own standards.
I’m very, very interested to see a more thorough dialogue between you and similar thinkers, and MIRI-type thinkers. I think right now both sides feel frustrated that they’re not listened to and understood better.
Like, just look at this quote from the post you mentioned:
Unfortunately, AI systems trained with reinforcement learning only optimize features specified in the reward function and are indifferent to anything we might’ve inadvertently left out.
And you probably didn’t even select that post for this particular misunderstanding.
(Presumably you are talking about how reward is not the optimization target.)
While I agree that the statement is not literally true, I am still basically on board with that sentence and think it’s a reasonable shorthand for the true thing.
I expect that I understood the “reward is not the optimization target” point at the time of writing that post (though of course predicting what your ~5-years-ago self knew is quite challenging without specific quotes to refer to).
I am confident I understood the point by the time I was working on the goal misgeneralization project (late 2021), since almost every example we created involved predicting ahead of time a specific way in which reward would fail to be the optimization target.
Deep reinforcement learning agents will not come to intrinsically and primarily value their reward signal; reward is not the trained agent’s optimization target.
Utility functions express the relative goodness of outcomes. Reward is not best understood as being a kind of utility function. Reward has the mechanistic effect of chiseling cognition into the agent’s network. Therefore, properly understood, reward does not express relative goodness and is therefore not an optimization target at all.
I hope it doesn’t come across as revisionist to Alex, but I felt like both of these points were made by people at least as early as 2019, after the Mesa-Optimization sequence came out in mid-2019. As evidence, I’ll point to my post from December 2019 that was partially based on a conversation with Rohin, who seemed to agree with me,
consider a simple feedforward neural network trained by deep reinforcement learning to navigate my Chests and Keys environment. Since “go to the nearest key” is a good proxy for getting the reward, the neural network simply returns the action, that when given the board state, results in the agent getting closer to the nearest key.
Is the feedforward neural network optimizing anything here? Hardly, it’s just applying a heuristic. Note that you don’t need to do anything like an internal A* search to find keys in a maze, because in many environments, following a wall until the key is within sight, and then performing a very shallow search (which doesn’t have to be explicit) could work fairly well.
I think in this passage I’m imagining that “reward is not the trained agent’s optimization target” quite explicitly, since I’m pointing out that a neural network trained by RL will not necessarily optimize anything at all. In a subsequent post from January 2020 I gave a more explicit example, said this fact doesn’t merely apply to simple neural networks, and then offered my opinion that “it’s inaccurate to say that the source of malign generalization must come from an internal search being misaligned with the objective function we used during training”.
From the comments, and from my memory of conversations at the time, many people disagreed with my framing. They disagreed even when I pointed out that humans don’t seem to be “optimizers” that select for actions that maximize our “reward function” (I believe the most common response was to deny the premise, and say that humans are actually roughly optimizers. Another common response was to say that AI is different for some reason.)
However, even though some people disagreed with this framing, not everyone did. As I pointed out, Rohin seemed to agree with me at the time, and so at the very least I think there is credible evidence that this insight was already known to a few people in the community by late 2019.
Deep reinforcement learning agents will not come to intrinsically and primarily value their reward signal; reward is not the trained agent’s optimization target.
I have no stake in this debate, but how is this particular point any different than what Eliezer says when he makes the point about humans not optimizing for IGF? I think the entire mesaoptimization concern is built around this premise, no?
I didn’t mean to imply that you in particular didn’t understand the reward point, and I apologize for not writing my original comment more clearly in that respect. Out of nearly everyone on the site, I am most persuaded that you understood this “back in the day.”
I meant to communicate something like “I think the quoted segment from Rohin and Dmitrii’s post is incorrect and will reliably lead people to false beliefs.”
As I mentioned elsewhere (not this website) I don’t agree with “will reliably lead people to false beliefs”, if we’re talking about ML people rather than LW people (as was my audience for that blog post).
I do think that it’s a reasonable hypothesis to have, and I assign it more likelihood than I would have a year ago (in large part from you pushing some ML people on this point, and them not getting it as fast as I would have expected).
This is one of the main reasons I’m not excited about engaging with LessWrong. Why bother? It feels like nothing I say will matter. Apparently, no pre-takeoff experiments matter to some folk.[1] And even if I successfully dismantle some philosophical argument, there’s a good chance they will use another argument to support their beliefs instead. Nothing changes.
So there we are. It doesn’t matter what my experiments say, because (it is claimed) there are no testable predictions before The End. But also, everyone important already knew in advance that it’d be easy to get GPT-4 to interpret and execute your value-laden requests in a human-reasonable fashion. Even though ~no one said so ahead of time.
When talking with pre-2020 alignment folks about these issues, I feel gaslit quite often. You have no idea how many times I’ve been told things like “most people already understood that reward is not the optimization target”[2] and “maybe you had a lesson you needed to learn, but I feel like I got this in 2018″, and so on. Almost always this comes from people who seem to still not understand what I’m talking about. I feel fine if[3] they disagree with me about specific ideas, but what really bothers me is the revisionism. It’s so annoying.
Like, just look at this quote from the post you mentioned:
And you probably didn’t even select that post for this particular misunderstanding. (EDIT: Note that I am not accusing Rohin of gaslighting on this topic, and I also think he already understood the “reward is not the optimization target” point when he wrote the above sentence. My critique was that the statement is false and would probably lead readers to incorrect beliefs about the purpose of reward in RL.)
I feel a lot of disappointment and sadness. In 2018, I came to this website when I really needed a new way to understand the world. I’d made a lot of epistemic mistakes I wasn’t proud of, and I didn’t want to live that way anymore. I wanted to think more clearly. I wanted it so, so badly. I came to rely and depend on this place and the fellow users. I looked up to and admired a bunch of them (and I still do so for a few).
But the things you mention—the revisionism, the unfalsifiability, the apparent gaslighting? We set out to do better than science. I think we often do worse.
As a general principle, truths are entangled with each other. It’s OK if a theory’s most extreme prediction (e.g. extinction from AI) is not testable at the current moment. It is a highly suspicious state of affairs if a theory yields no other testable predictions. Truths are generally entangled with each other in intricate and manifold ways. There are generally many clever ways to test a theory, given the necessary will and curiosity.
I could give more concrete examples, but that feels indecorous.
Sometimes I instead get pushback like “it seems to me like I’ve grasped the insights you’re trying to communicate, but I totally acknowledge that I might just not be seeing what you’re saying yet.” I respect and appreciate that response. It communicates the other person’s true perception (that they already understand) while not invalidating or assuming away my perspective.
I get why you feel that way. I think there are a lot of us on LessWrong who are less vocal and more openminded, and less aligned with either optimistic network thinkers or pessimistic agent foundations thinkers. People newer to the discussion and otherwise less polarized are listening and changing their minds in large or small ways.
I’m sorry you’re feeling so pessimistic about LessWrong. I think there is a breakdown in communication happening between the old guard and the new guard you exemplify. I don’t think that’s a product of venue, but of the sheer difficulty of the discussion. And polarization between different veiwpoints on alignment.
I think maintaining a good community falls on all of us. Formats and mods can help, but communities set their own standards.
I’m very, very interested to see a more thorough dialogue between you and similar thinkers, and MIRI-type thinkers. I think right now both sides feel frustrated that they’re not listened to and understood better.
(Presumably you are talking about how reward is not the optimization target.)
While I agree that the statement is not literally true, I am still basically on board with that sentence and think it’s a reasonable shorthand for the true thing.
I expect that I understood the “reward is not the optimization target” point at the time of writing that post (though of course predicting what your ~5-years-ago self knew is quite challenging without specific quotes to refer to).
I am confident I understood the point by the time I was working on the goal misgeneralization project (late 2021), since almost every example we created involved predicting ahead of time a specific way in which reward would fail to be the optimization target.
(I didn’t follow this argument at the time, so I might be missing key context.)
The blog post “Reward is not the optimization target” gives the following summary of its thesis,
I hope it doesn’t come across as revisionist to Alex, but I felt like both of these points were made by people at least as early as 2019, after the Mesa-Optimization sequence came out in mid-2019. As evidence, I’ll point to my post from December 2019 that was partially based on a conversation with Rohin, who seemed to agree with me,
I think in this passage I’m imagining that “reward is not the trained agent’s optimization target” quite explicitly, since I’m pointing out that a neural network trained by RL will not necessarily optimize anything at all. In a subsequent post from January 2020 I gave a more explicit example, said this fact doesn’t merely apply to simple neural networks, and then offered my opinion that “it’s inaccurate to say that the source of malign generalization must come from an internal search being misaligned with the objective function we used during training”.
From the comments, and from my memory of conversations at the time, many people disagreed with my framing. They disagreed even when I pointed out that humans don’t seem to be “optimizers” that select for actions that maximize our “reward function” (I believe the most common response was to deny the premise, and say that humans are actually roughly optimizers. Another common response was to say that AI is different for some reason.)
However, even though some people disagreed with this framing, not everyone did. As I pointed out, Rohin seemed to agree with me at the time, and so at the very least I think there is credible evidence that this insight was already known to a few people in the community by late 2019.
I have no stake in this debate, but how is this particular point any different than what Eliezer says when he makes the point about humans not optimizing for IGF? I think the entire mesaoptimization concern is built around this premise, no?
I didn’t mean to imply that you in particular didn’t understand the reward point, and I apologize for not writing my original comment more clearly in that respect. Out of nearly everyone on the site, I am most persuaded that you understood this “back in the day.”
I meant to communicate something like “I think the quoted segment from Rohin and Dmitrii’s post is incorrect and will reliably lead people to false beliefs.”
Thanks for the edit :)
As I mentioned elsewhere (not this website) I don’t agree with “will reliably lead people to false beliefs”, if we’re talking about ML people rather than LW people (as was my audience for that blog post).
I do think that it’s a reasonable hypothesis to have, and I assign it more likelihood than I would have a year ago (in large part from you pushing some ML people on this point, and them not getting it as fast as I would have expected).