To me the most obvious risk (which I don’t ATM think of as very likely for the next few iterations, or possibly ever, since the training is myopic/SL) would be that GPT-N in fact is computing (e.g. among other things) a superintelligent mesa-optimization process that understands the situation it is in and is agent-y.
Do you have any idea of what the mesa objective might be. I agree that this is a worrisome risk, but I was more interested in the type of answer that specified, “Here’s a plausible mesa objective given the incentives.” Mesa optimization is a more general risk that isn’t specific to the narrow training scheme used by GPT-N.
The mesa-objective could be perfectly aligned with the base-objective (predicting the next token) and still have terrible unintended consequences, because the base-objective is unaligned with actual human values. A superintelligent GPT-N which simply wants to predict the next token could, for example, try to break out of the box in order to obtain more resources and use those resources to more correctly output the next token. This would have to happen during a single inference step, because GPT-N really just wants to predict the next token, but it’s mesa-optimization process may conclude that world domination is the best way of doing so. Whether such system could be learned through current gradient-descent optimizers is unclear to me.
No, and I don’t think it really matters too much… what’s more important is the “architecture” of the “mesa-optimizer”. It’s doing something that looks like search/planning/optimization/RL.
Roughly speaking, the simplest form of this model of how things works says: “Its so hard to solve NLP without doing agent-y stuff that when we see GPT-N produce a solution to NLP, we should assume that it’s doing agenty stuff on the inside… i.e. what probably happened is it evolved or stumbled upon something agenty, and then that agenty thing realized the situation it was in and started plotting a treacherous turn”.
In other words, there is a fully general argument for learning algorithms producing mesa-optimization to the extent that they use relatively weak learning algorithms on relatively hard tasks.
It’s very unclear ATM how much weight to give this argument in general, or in specific contexts.
But I don’t think it’s particularly sensitive to the choice of task/learning algorithm.
Do you have any idea of what the mesa objective might be. I agree that this is a worrisome risk, but I was more interested in the type of answer that specified, “Here’s a plausible mesa objective given the incentives.” Mesa optimization is a more general risk that isn’t specific to the narrow training scheme used by GPT-N.
The mesa-objective could be perfectly aligned with the base-objective (predicting the next token) and still have terrible unintended consequences, because the base-objective is unaligned with actual human values. A superintelligent GPT-N which simply wants to predict the next token could, for example, try to break out of the box in order to obtain more resources and use those resources to more correctly output the next token. This would have to happen during a single inference step, because GPT-N really just wants to predict the next token, but it’s mesa-optimization process may conclude that world domination is the best way of doing so. Whether such system could be learned through current gradient-descent optimizers is unclear to me.
No, and I don’t think it really matters too much… what’s more important is the “architecture” of the “mesa-optimizer”. It’s doing something that looks like search/planning/optimization/RL.
Roughly speaking, the simplest form of this model of how things works says: “Its so hard to solve NLP without doing agent-y stuff that when we see GPT-N produce a solution to NLP, we should assume that it’s doing agenty stuff on the inside… i.e. what probably happened is it evolved or stumbled upon something agenty, and then that agenty thing realized the situation it was in and started plotting a treacherous turn”.
In other words, there is a fully general argument for learning algorithms producing mesa-optimization to the extent that they use relatively weak learning algorithms on relatively hard tasks.
It’s very unclear ATM how much weight to give this argument in general, or in specific contexts.
But I don’t think it’s particularly sensitive to the choice of task/learning algorithm.