This is unfortunate, no? The AI safety community had this whole thing going with mesa-optimization and whatnot… now you propose to abandon the terminology and shift to this new frame? But what about all the people using the old terminology? Is the old terminology unsalvageable?
To be clear, that’s definitely not what I’m arguing. I continue to think that the Risks from Learned Optimization terminology is really good, for the specific case that it’s talking about. The problem is just that it’s not general enough to handle all possible ways of training a model using machine learning. Terms like base objective or inner/outer alignment are still great terms for talking about training stories that are trying to train a model to optimize for some specified objective. From “How do we become confident in the safety of a machine learning system?”:
The point of training stories is not to do away with concepts like mesa-optimization, inner alignment, or objective misgeneralization. Rather, the point of training stories is to provide a universal framework in which all of those sorts of concepts can live as discrete subproblems—specific ways in which a training story might go wrong.
I continue to think that the Risks from Learned Optimization terminology is really good, for the specific case that it’s talking about. The problem is just that it’s not general enough to handle all possible ways of training a model using machine learning.
GPT-3 was trained using supervised learning, which I would have thought was a pretty standard way of training a model using machine learning. What training scenarios do you think the Risks from Learned Optimization terminology can handle, and what’s the difference between those and the way GPT-3 was trained?
First, the problem is only with outer/inner alignment—the concept of unintended mesa-optimization is still quite relevant and works just fine.
Second, the problems with applying Risks from Learned Optimization terminology to GPT-3 have nothing to do with the training scenario, the fact that you’re doing unsupervised learning, etc.
The place where I think you run into problems is that, for cases where mesa-optimization is intended in GPT-style training setups, inner alignment in the Risks from Learned Optimization sense is usually not the goal. Most of the optimism about large language models is hoping that they’ll learn to generalize in particular ways that are better than just learning to optimize for something like cross entropy/predictive accuracy. Thus, just saying “if the model is an optimizer, it won’t just learn to optimize for cross entropy/predictive accuracy/whatever else it was trained on,” while true, is unhelpful.
What I like about training stories is that it explicitly asks what sort of model you want to get—rather than assuming that you want something which is optimizing for your training objective—and then asks how likely we are to actually get it (as opposed to some sort of mesa-optimizer, a deceptive model, or anything else).
To be clear, that’s definitely not what I’m arguing. I continue to think that the Risks from Learned Optimization terminology is really good, for the specific case that it’s talking about. The problem is just that it’s not general enough to handle all possible ways of training a model using machine learning. Terms like base objective or inner/outer alignment are still great terms for talking about training stories that are trying to train a model to optimize for some specified objective. From “How do we become confident in the safety of a machine learning system?”:
GPT-3 was trained using supervised learning, which I would have thought was a pretty standard way of training a model using machine learning. What training scenarios do you think the Risks from Learned Optimization terminology can handle, and what’s the difference between those and the way GPT-3 was trained?
First, the problem is only with outer/inner alignment—the concept of unintended mesa-optimization is still quite relevant and works just fine.
Second, the problems with applying Risks from Learned Optimization terminology to GPT-3 have nothing to do with the training scenario, the fact that you’re doing unsupervised learning, etc.
The place where I think you run into problems is that, for cases where mesa-optimization is intended in GPT-style training setups, inner alignment in the Risks from Learned Optimization sense is usually not the goal. Most of the optimism about large language models is hoping that they’ll learn to generalize in particular ways that are better than just learning to optimize for something like cross entropy/predictive accuracy. Thus, just saying “if the model is an optimizer, it won’t just learn to optimize for cross entropy/predictive accuracy/whatever else it was trained on,” while true, is unhelpful.
What I like about training stories is that it explicitly asks what sort of model you want to get—rather than assuming that you want something which is optimizing for your training objective—and then asks how likely we are to actually get it (as opposed to some sort of mesa-optimizer, a deceptive model, or anything else).