I appreciate you writing this, Rohin. I don’t work in ML, or do safety research, and it’s certainly possible I misunderstand how this meta-RL architecture works, or that I misunderstand what’s normal.
That said, I feel confused by a number of your arguments, so I’m working on a reply. Before I post it, I’d be grateful if you could help me make sure I understand your objections, so as to avoid accidentally publishing a long post in response to a position nobody holds.
I currently understand you to be making four main claims:
The system is just doing the totally normal thing “conditioning on observations,” rather than something it makes sense to describe as “giving rise to a separate learning algorithm.”
It is probably not the case that in this system, “learning is implemented in neural activation changes rather than neural weight changes.”
The system does not encode a search algorithm, so it provides “~zero evidence” about e.g. the hypothesis that mesa-optimization is convergently useful, or likely to be a common feature of future systems.
The above facts should be obvious to people familiar with ML.
Does this summary feel like it reasonably characterizes your objections?
I appreciate you writing this, Rohin. I don’t work in ML, or do safety research, and it’s certainly possible I misunderstand how this meta-RL architecture works, or that I misunderstand what’s normal.
Thanks. I know I came off pretty confrontational, sorry about that. I didn’t mean to target you specifically; I really do see this as bad at the community level but fine at the individual level.
I don’t think you’ve exactly captured what I meant, some comments below.
The system is just doing the totally normal thing “conditioning on observations,” rather than something it makes sense to describe as “giving rise to a separate learning algorithm.”
I think it is reasonable to describe it both as “conditioning on observations” and as “giving rise to a separate learning algorithm”.
It is probably not the case that in this system, “learning is implemented in neural activation changes rather than neural weight changes.”
On my interpretation of “learning” in this context, I would agree with that claim (i.e. I agree that learning is implemented in activation changes rather then weight changes via gradient descent). Idk what other people mean by “learning” though.
The system does not encode a search algorithm, so it provides “~zero evidence” about e.g. the hypothesis that mesa-optimization is convergently useful, or likely to be a common feature of future systems.
This sounds roughly right if you use the words as I mean them, but I suspect you aren’t using the words as I mean them.
There’s this thing where the mesa-optimization paper talks about a neural net that performs “search” via activation changes. When I read the paper, I took this to be an illustrative example, that was meant to stand in for “learning” more broadly, but that made more concrete and easier to reason about. (I didn’t think this consciously.) However, whenever I talk to people about this paper, they have different understandings of what is meant by “search”, and varying opinions on how much mesa optimization should be tied to “search”. But I think the typical opinion is that whether or not mesa optimization is happening depends on what algorithm the neural net weights encode, and you can’t deduce whether mesa optimization is happening just by looking at the behavior in the training environment, as it may just have “memorized” what good behavior is rather than “performing search”.
If you use this meaning of “search algorithm”, then you can’t tell whether a good policy is a “search algorithm” or not just by looking at behavior. Since this paper only talked about behavior of a good policy, it can’t be evidence in favor of “mesa-optimization-via-search-algorithm”.
The above facts should be obvious to people familiar with ML.
Oh definitely not those, most people in ML have never heard of “mesa optimization”.
----
I think my response to Vaniver better illustrates my concerns, but let me take a stab at making a simple list of claims.
1. The optimal policy in the bandit environment considered in the paper requires keeping track of the rewards you have gotten in the past, and basing your future decisions on this information.
2. You shouldn’t be surprised when applying an RL algorithm to a problem leads to a near-optimal policy for that problem. (This has many caveats, but they aren’t particularly relevant.)
3. Therefore, you shouldn’t be surprised by the results in this paper.
4. Therefore, you shouldn’t be updating based on this paper.
5. Claims 1 and 2 require only basic knowledge about RL.
I feel confused about why, on this model, the researchers were surprised that this occurred, and seem to think it was a novel finding that it will inevitably occur given the three conditions described. Above, you mentioned the hypothesis that maybe they just weren’t very familiar with AI. But looking at the author list, and their publications (e.g.1, 2, 3, 4, 5, 6, 7, 8), this seems implausible to me. Most of the co-authors are neuroscientists by training, but a few have CS degrees, and all but one have co-authored previous ML papers. It’s hard for me to imagine their surprise was due to them lacking basic knowledge about RL?
Also, this OpenAI paper (whose authors seem quite familiar with ML)—which the summary of Wang et al. on DeepMind’s website describes as “closely related work,” and which appears to me to involve a very similar setup— describes their result similarly:
We structure the agent as a recurrent neural network, which receives past rewards, actions, and termination flags as inputs in addition to the normally received observations. Furthermore, its internal state is preserved across episodes, so that it has the capacity to perform learning in its own hidden activations. The learned agent thus also acts as the learning algorithm, and can adapt to the task at hand when deployed.
As I understand it, the OpenAI authors also think they can gather evidence about the structure of the algorithm simply by looking at its behavior. Given a similar series of experiments (mostly bandit tasks, but also a maze solver), they conclude:
the dynamics of the recurrent network come to implement a learning algorithm entirely separate from the one used to train the network weights… the procedure the recurrent network implements is itself a full-fledged reinforcement learning algorithm, which negotiates the exploration-exploitation tradeoff and improves the agent’s policy based on reward outcomes… this learned RL procedure can differ starkly from the algorithm used to train the network’s weights.
They then run an experiment designed specifically to distinguish whether meta-RL was giving rise to a model-free system, or “a model-based system which learns an internal model of the environment and evaluates the value of actions at the time of decision-making through look-ahead planning,” and suggest the evidence implies the latter. This sounds like a description of search to me—do you think I’m confused?
I get the impression from your comments that you think it’s naive to describe this result as “learning algorithms spontaneously emerging.” You describe the lack of LW/AF pushback against that description as “a community-wide failure,” and mention updating as a result toward thinking AF members “automatically believe anything written in a post without checking it.”
But my impression is that OpenAI describes their similar result in a similar way. Do you think my impression is wrong? Or that e.g. their description is also misleading?
--
I’ve been feeling very confused lately about how people talk about “search,” and have started joking that I’m a search panpsychist. Lots of interesting phenomenon look like piles of thermostats when viewed from the wrong angle, and I worry the conventional lens is deceptively narrow.
That said, when I condition on (what I understand to be) the conventional conception, it’s difficult for me to imagine how e.g. the maze-solver described in the OpenAI paper can quickly and reliably locate maze exits, without doing something reasonably describable as searching for them.
And it seems to me that Wang et al. should be taken as evidence that “learning algorithms producing other search-performing learning algorithms” is convergently useful/likely to be a common feature of future systems, even if you don’t think that’s what happened in their paper, as long as you assign decent credence to their underlying model that this is what’s going on in PFC, and that search occurs in PFC.
If the primary difference between the DeepMind and OpenAI meta-RL architecture and the PFC/DA architecture is scale, I think there’s reasonable reason to suspect something much like mesa-optimization will emerge in future meta-RL systems, even if it hasn’t yet. That is, I interpret this result as evidence for the hypothesis that highly competent general-ish learners might tend to exhibit this feature, since (among other reasons) it increased my credence that it is already exhibited by the only existing member of that reference class.
Evan mentions agreeing that this result isn’t new evidence in favor of mesa-optimization. But he also mentions that Risks from Learned Optimization references these two papers, and describes them as “the closest to producing mesa-optimizers of any existing machine learning research.” I feel confused about how to reconcile these two claims. I didn’t realize these papers were mentioned in Risks from Learned Optimization, but if I had, I think I would have been even more inclined to post this/try to ensure people knew about the results, since my (perhaps naive, perhaps not understanding ways this is disanalogous) prior is that the closest existing example to this problem might provide evidence about its nature or likelihood.
I get the impression from your comments that you think it’s naive to describe this result as “learning algorithms spontaneously emerge.”
I think that’s a fine characterization (and I said so in the grandparent comment? Looking back, I said I agreed with the claim that learning is happening via neural net activations, which I guess doesn’t necessarily imply that I think it’s a fine characterization).
You describe the lack of LW/AF pushback against that description as “a community-wide failure,”
I think my original comment didn’t do a great job of phrasing my objection. My actual critique is that the community as a whole seems to be updating strongly on data-that-has-high-probability-if-you-know-basic-RL.
updating as a result toward thinking AF members “automatically believe anything written in a post without checking it.”
That was one of three possible explanations; I don’t have a strong view on which explanation is the primary cause (if any of them are). It’s more like “I observe clearly-to-me irrational behavior, this seems bad, even if I don’t know what’s causing it”. If I had to guess, I’d guess that the explanation is a combination of readers not bothering to check details and those who are checking details not knowing enough to point out that this is expected.
I feel confused about why, given your model of the situation, the researchers were surprised that this phenomenon occurred, and seem to think it was a novel finding that it will inevitably occur given the three conditions described.
Indeed, I am also confused by this, as I noted in the original comment:
I don’t understand why this was surprising to the original researchers
I have a couple of hypotheses, none of which seem particularly likely given that the authors are familiar with AI, so I just won’t speculate. I agree this is evidence against my claim that this would be obvious to RL researchers.
And this OpenAI paper [...] describes their result in similar terms:
Again, I don’t object to the description of this as learning a learning algorithm. I object to updating strongly on this. Note that the paper does not claim their results are surprising—it is written in a style of “we figured out how to make this approach work”. (The DeepMind paper does claim that the results are novel / surprising, but it is targeted at a neuroscience audience, to whom the results may indeed be surprising.)
I’ve been feeling very confused lately about how people talk about “search,” and have started joking that I’m a search panpsychist.
On the search panpsychist view, my position is that if you use deep RL to train an AGI policy, it is definitionally a mesa optimizer. (Like, anything that is “generally intelligent” has the ability to learn quickly, which on the search panpsychist view means that it is a mesa optimizer.) So in this world, “likelihood of mesa optimization via deep RL” is equivalent to “likelihood of AGI via deep RL”, and “likelihood that more general systems trained by deep RL will be mesa optimizers” is ~1 and you ~can’t update on it.
I appreciate you writing this, Rohin. I don’t work in ML, or do safety research, and it’s certainly possible I misunderstand how this meta-RL architecture works, or that I misunderstand what’s normal.
That said, I feel confused by a number of your arguments, so I’m working on a reply. Before I post it, I’d be grateful if you could help me make sure I understand your objections, so as to avoid accidentally publishing a long post in response to a position nobody holds.
I currently understand you to be making four main claims:
The system is just doing the totally normal thing “conditioning on observations,” rather than something it makes sense to describe as “giving rise to a separate learning algorithm.”
It is probably not the case that in this system, “learning is implemented in neural activation changes rather than neural weight changes.”
The system does not encode a search algorithm, so it provides “~zero evidence” about e.g. the hypothesis that mesa-optimization is convergently useful, or likely to be a common feature of future systems.
The above facts should be obvious to people familiar with ML.
Does this summary feel like it reasonably characterizes your objections?
Thanks. I know I came off pretty confrontational, sorry about that. I didn’t mean to target you specifically; I really do see this as bad at the community level but fine at the individual level.
I don’t think you’ve exactly captured what I meant, some comments below.
I think it is reasonable to describe it both as “conditioning on observations” and as “giving rise to a separate learning algorithm”.
On my interpretation of “learning” in this context, I would agree with that claim (i.e. I agree that learning is implemented in activation changes rather then weight changes via gradient descent). Idk what other people mean by “learning” though.
This sounds roughly right if you use the words as I mean them, but I suspect you aren’t using the words as I mean them.
There’s this thing where the mesa-optimization paper talks about a neural net that performs “search” via activation changes. When I read the paper, I took this to be an illustrative example, that was meant to stand in for “learning” more broadly, but that made more concrete and easier to reason about. (I didn’t think this consciously.) However, whenever I talk to people about this paper, they have different understandings of what is meant by “search”, and varying opinions on how much mesa optimization should be tied to “search”. But I think the typical opinion is that whether or not mesa optimization is happening depends on what algorithm the neural net weights encode, and you can’t deduce whether mesa optimization is happening just by looking at the behavior in the training environment, as it may just have “memorized” what good behavior is rather than “performing search”.
If you use this meaning of “search algorithm”, then you can’t tell whether a good policy is a “search algorithm” or not just by looking at behavior. Since this paper only talked about behavior of a good policy, it can’t be evidence in favor of “mesa-optimization-via-search-algorithm”.
Oh definitely not those, most people in ML have never heard of “mesa optimization”.
----
I think my response to Vaniver better illustrates my concerns, but let me take a stab at making a simple list of claims.
1. The optimal policy in the bandit environment considered in the paper requires keeping track of the rewards you have gotten in the past, and basing your future decisions on this information.
2. You shouldn’t be surprised when applying an RL algorithm to a problem leads to a near-optimal policy for that problem. (This has many caveats, but they aren’t particularly relevant.)
3. Therefore, you shouldn’t be surprised by the results in this paper.
4. Therefore, you shouldn’t be updating based on this paper.
5. Claims 1 and 2 require only basic knowledge about RL.
I feel confused about why, on this model, the researchers were surprised that this occurred, and seem to think it was a novel finding that it will inevitably occur given the three conditions described. Above, you mentioned the hypothesis that maybe they just weren’t very familiar with AI. But looking at the author list, and their publications (e.g.1, 2, 3, 4, 5, 6, 7, 8), this seems implausible to me. Most of the co-authors are neuroscientists by training, but a few have CS degrees, and all but one have co-authored previous ML papers. It’s hard for me to imagine their surprise was due to them lacking basic knowledge about RL?
Also, this OpenAI paper (whose authors seem quite familiar with ML)—which the summary of Wang et al. on DeepMind’s website describes as “closely related work,” and which appears to me to involve a very similar setup— describes their result similarly:
As I understand it, the OpenAI authors also think they can gather evidence about the structure of the algorithm simply by looking at its behavior. Given a similar series of experiments (mostly bandit tasks, but also a maze solver), they conclude:
They then run an experiment designed specifically to distinguish whether meta-RL was giving rise to a model-free system, or “a model-based system which learns an internal model of the environment and evaluates the value of actions at the time of decision-making through look-ahead planning,” and suggest the evidence implies the latter. This sounds like a description of search to me—do you think I’m confused?
I get the impression from your comments that you think it’s naive to describe this result as “learning algorithms spontaneously emerging.” You describe the lack of LW/AF pushback against that description as “a community-wide failure,” and mention updating as a result toward thinking AF members “automatically believe anything written in a post without checking it.”
But my impression is that OpenAI describes their similar result in a similar way. Do you think my impression is wrong? Or that e.g. their description is also misleading?
--
I’ve been feeling very confused lately about how people talk about “search,” and have started joking that I’m a search panpsychist. Lots of interesting phenomenon look like piles of thermostats when viewed from the wrong angle, and I worry the conventional lens is deceptively narrow.
That said, when I condition on (what I understand to be) the conventional conception, it’s difficult for me to imagine how e.g. the maze-solver described in the OpenAI paper can quickly and reliably locate maze exits, without doing something reasonably describable as searching for them.
And it seems to me that Wang et al. should be taken as evidence that “learning algorithms producing other search-performing learning algorithms” is convergently useful/likely to be a common feature of future systems, even if you don’t think that’s what happened in their paper, as long as you assign decent credence to their underlying model that this is what’s going on in PFC, and that search occurs in PFC.
If the primary difference between the DeepMind and OpenAI meta-RL architecture and the PFC/DA architecture is scale, I think there’s reasonable reason to suspect something much like mesa-optimization will emerge in future meta-RL systems, even if it hasn’t yet. That is, I interpret this result as evidence for the hypothesis that highly competent general-ish learners might tend to exhibit this feature, since (among other reasons) it increased my credence that it is already exhibited by the only existing member of that reference class.
Evan mentions agreeing that this result isn’t new evidence in favor of mesa-optimization. But he also mentions that Risks from Learned Optimization references these two papers, and describes them as “the closest to producing mesa-optimizers of any existing machine learning research.” I feel confused about how to reconcile these two claims. I didn’t realize these papers were mentioned in Risks from Learned Optimization, but if I had, I think I would have been even more inclined to post this/try to ensure people knew about the results, since my (perhaps naive, perhaps not understanding ways this is disanalogous) prior is that the closest existing example to this problem might provide evidence about its nature or likelihood.
I think that’s a fine characterization (and I said so in the grandparent comment? Looking back, I said I agreed with the claim that learning is happening via neural net activations, which I guess doesn’t necessarily imply that I think it’s a fine characterization).
I think my original comment didn’t do a great job of phrasing my objection. My actual critique is that the community as a whole seems to be updating strongly on data-that-has-high-probability-if-you-know-basic-RL.
That was one of three possible explanations; I don’t have a strong view on which explanation is the primary cause (if any of them are). It’s more like “I observe clearly-to-me irrational behavior, this seems bad, even if I don’t know what’s causing it”. If I had to guess, I’d guess that the explanation is a combination of readers not bothering to check details and those who are checking details not knowing enough to point out that this is expected.
Indeed, I am also confused by this, as I noted in the original comment:
I have a couple of hypotheses, none of which seem particularly likely given that the authors are familiar with AI, so I just won’t speculate. I agree this is evidence against my claim that this would be obvious to RL researchers.
Again, I don’t object to the description of this as learning a learning algorithm. I object to updating strongly on this. Note that the paper does not claim their results are surprising—it is written in a style of “we figured out how to make this approach work”. (The DeepMind paper does claim that the results are novel / surprising, but it is targeted at a neuroscience audience, to whom the results may indeed be surprising.)
On the search panpsychist view, my position is that if you use deep RL to train an AGI policy, it is definitionally a mesa optimizer. (Like, anything that is “generally intelligent” has the ability to learn quickly, which on the search panpsychist view means that it is a mesa optimizer.) So in this world, “likelihood of mesa optimization via deep RL” is equivalent to “likelihood of AGI via deep RL”, and “likelihood that more general systems trained by deep RL will be mesa optimizers” is ~1 and you ~can’t update on it.