I think I can guess what your disagreements are regarding too narrow a conception of inner alignment/mesa-optimization (that the paper overly focuses on models mechanistically implementing optimization), though I’m not sure what model of AI development it relies that you don’t think is accurate and would be curious for details there. I’d also be interested in what sorts of worse research topics you think it has tended to encourage (on my view, I think this paper should make you more excited about directions like transparency and robustness and less excited about directions involving careful incentive/environment design). Also, for the paper giving people a “but what about mesa-optimization” response, I’m imagining you’re referring to things like this post, though I’d appreciate some clarification there as well.
As a preamble, I should note that I’m putting on my “critical reviewer” hat here. I’m not intentionally being negative—I am reporting my inside-view beliefs on each question—but as a general rule, I expect these to be biased negatively; someone looking at research from the outside doesn’t have the same intuitions for its utility and so will usually inside-view underestimate its value.
This is also all things I’m saying with the benefit of hindsight, idk what I would have said at the time the sequence was published. I’m not trying to be “fair” to the sequence here, that is, I’m not considering what it would have been reasonable to believe at the time.
the paper overly focuses on models mechanistically implementing optimization
Yup, that’s right.
I’m not sure what model of AI development it relies that you don’t think is accurate
There seems to be an implicit model that when you do machine learning you get out a complicated mess of a neural net that is hard to interpret, but at its core it still is learning something akin to a program, and hence concepts like “explicit (mechanistic) search algorithm” are reasonable to expect. (Or at least, that this will be true for sufficiently intelligent AI systems.)
I don’t think this model (implicit claim?) is correct. (For comparison, I also don’t think this model would be correct if applied to human cognition.)
worse research topics you think it has tended to encourage
A couple of examples:
Attempting to create an example of a learned mechanistic search algorithm (I know of at least one proposal that was trying to do this)
Of your concrete experiments, I don’t expect to learn anything of interest from the first two (they aren’t the sort of thing that would generalize from small environments to large environments); I like the third; the fourth and fifth seem like interesting AI research but I don’t think they’d shed light on mesa-optimization / inner alignment or its solutions.
I think this paper should make you more excited about directions like transparency and robustness and less excited about directions involving careful incentive/environment design
I agree with this. Maybe people have gotten more interested in transparency as a result of this paper? That seems plausible.
I’m imagining you’re referring to things like this post,
Actually, not that one. This is more like “why are you working on reward learning—even if you solved it we’d still be worried about mesa optimization”. Possibly no one believes this, but I often feel like this implication is present. I don’t have any concrete examples at the moment; it’s possible that I’m imagining it where it doesn’t exist, or that this is only a fact about how I interpret other people rather than what they actually believe.
I think I can guess what your disagreements are regarding too narrow a conception of inner alignment/mesa-optimization (that the paper overly focuses on models mechanistically implementing optimization), though I’m not sure what model of AI development it relies that you don’t think is accurate and would be curious for details there. I’d also be interested in what sorts of worse research topics you think it has tended to encourage (on my view, I think this paper should make you more excited about directions like transparency and robustness and less excited about directions involving careful incentive/environment design). Also, for the paper giving people a “but what about mesa-optimization” response, I’m imagining you’re referring to things like this post, though I’d appreciate some clarification there as well.
As a preamble, I should note that I’m putting on my “critical reviewer” hat here. I’m not intentionally being negative—I am reporting my inside-view beliefs on each question—but as a general rule, I expect these to be biased negatively; someone looking at research from the outside doesn’t have the same intuitions for its utility and so will usually inside-view underestimate its value.
This is also all things I’m saying with the benefit of hindsight, idk what I would have said at the time the sequence was published. I’m not trying to be “fair” to the sequence here, that is, I’m not considering what it would have been reasonable to believe at the time.
Yup, that’s right.
There seems to be an implicit model that when you do machine learning you get out a complicated mess of a neural net that is hard to interpret, but at its core it still is learning something akin to a program, and hence concepts like “explicit (mechanistic) search algorithm” are reasonable to expect. (Or at least, that this will be true for sufficiently intelligent AI systems.)
I don’t think this model (implicit claim?) is correct. (For comparison, I also don’t think this model would be correct if applied to human cognition.)
A couple of examples:
Attempting to create an example of a learned mechanistic search algorithm (I know of at least one proposal that was trying to do this)
Of your concrete experiments, I don’t expect to learn anything of interest from the first two (they aren’t the sort of thing that would generalize from small environments to large environments); I like the third; the fourth and fifth seem like interesting AI research but I don’t think they’d shed light on mesa-optimization / inner alignment or its solutions.
I agree with this. Maybe people have gotten more interested in transparency as a result of this paper? That seems plausible.
Actually, not that one. This is more like “why are you working on reward learning—even if you solved it we’d still be worried about mesa optimization”. Possibly no one believes this, but I often feel like this implication is present. I don’t have any concrete examples at the moment; it’s possible that I’m imagining it where it doesn’t exist, or that this is only a fact about how I interpret other people rather than what they actually believe.