[NB: this is a review of the paper, which I have recently read, not of the post series, which I have not]
For a while before this paper was published, several people in AI alignment had discussed things like mesa-optimization as serious concerns. That being said, these concerns had not been published in their most convincing form in great details. The two counterexamples that I’m aware of are the posts What does the universal prior actually look like? by Paul Christiano, and Optimization daemons on Arbital. However, the first post only discussed the issue in the context of Solomonoff induction, where the dynamics are somewhat different, and the second is short and hard to discover.
I see the value in this paper as taking these concerns, laying out (a) a better (altho still imperfectly precise) concretization of what the object of concern is and (b) how it could happen, and putting it in a discoverable and citable format. By doing so, it moves the discussion forward by giving people something concrete to actually reason and argue about.
I am relatively convinced that mesa-optimization (somewhat more broadly construed than in the paper, see below) is a problem for AI alignment, and I think the arguments in the paper are persuasive enough to be concerning. I think the weakest argument is in the deceptive alignment section: it is not really made clear why mesa-optimizers would have objectives that extend across parameter updates.
As I see it, the two biggest flaws with the paper are:
Its heuristic nature. The arguments given do not reach the certainty of proofs, and no experimental evidence is provided. This means that one can have at most provisional confidence that the arguments are correct and that the concerns are real (which is not to imply that certainty is required to warrant concern and further research).
Premature formalization. I do not believe that we have a great characterization of optimization, and as adamShimi points out, it’s not at all clear that search is the right abstraction to use.
Overall, I see the paper as sketching out a research paradigm that I hope to see fleshed out.
[NB: this is a review of the paper, which I have recently read, not of the post series, which I have not]
For a while before this paper was published, several people in AI alignment had discussed things like mesa-optimization as serious concerns. That being said, these concerns had not been published in their most convincing form in great details. The two counterexamples that I’m aware of are the posts What does the universal prior actually look like? by Paul Christiano, and Optimization daemons on Arbital. However, the first post only discussed the issue in the context of Solomonoff induction, where the dynamics are somewhat different, and the second is short and hard to discover.
I see the value in this paper as taking these concerns, laying out (a) a better (altho still imperfectly precise) concretization of what the object of concern is and (b) how it could happen, and putting it in a discoverable and citable format. By doing so, it moves the discussion forward by giving people something concrete to actually reason and argue about.
I am relatively convinced that mesa-optimization (somewhat more broadly construed than in the paper, see below) is a problem for AI alignment, and I think the arguments in the paper are persuasive enough to be concerning. I think the weakest argument is in the deceptive alignment section: it is not really made clear why mesa-optimizers would have objectives that extend across parameter updates.
As I see it, the two biggest flaws with the paper are: Its heuristic nature. The arguments given do not reach the certainty of proofs, and no experimental evidence is provided. This means that one can have at most provisional confidence that the arguments are correct and that the concerns are real (which is not to imply that certainty is required to warrant concern and further research). Premature formalization. I do not believe that we have a great characterization of optimization, and as adamShimi points out, it’s not at all clear that search is the right abstraction to use.
Overall, I see the paper as sketching out a research paradigm that I hope to see fleshed out.