In this particular case, Ajeya does seem to lean on the word “reward” pretty heavily when reasoning about how an AI will generalize. Without that word, it’s harder to justify privileging specific hypotheses about what long-term goals an agent will pursue in deployment. I’ve previously complained about this here.
I think Ajeya is reasonably careful about the word reward. (Though I think I roughly disagree with the overall vibe of the post with respect to this in various ways. In particular, the “number in the datacenter” case seems super unlikely.)
See e.g. the section starting with:
There is some ambiguity about what exactly “maximize reward” means, but once Alex is sufficiently powerful—and once human knowledge/control has eroded enough—an uprising or coup eventually seems to be the reward-maximizing move under most interpretations of “reward.” For example:
More generally, I feel like the overall section here (which is the place where the reward related argument comes into force) is pretty careful about this and explains a more general notion of possible correlates that is pretty reasonable.
ETA: As in, you could replace reward with “thing that resulted in reinforcement in an online RL context” and the argument would stand totally fine.
As far as your shortform, I think the responses from Paul and Ajeya are pretty reasonable.
(Another vibe disagreement I have with “without specific countermeasures” is that I think that very basic countermeasures might defeat the “pursue correlate of thing that resulted in reinforcement in an online RL context” as long as humans would have been able to recognize the dangerous actions from the AI as bad. Thus, probably some sort of egregious auditing/oversight errors are required to die for from this exact threat model to be serious issue. The main countermeasure is just training another copy of the model as a monitor based on a dataset of bad actions we label. If our concern is “AIs learn to pursue what performed well in training”, then there isn’t a particular reason for this monitor to fail (though the policy might try to hack it with an adversarial input etc).)
After your careful analysis on AI control, what threat model is the most likely to be a problem, assuming basic competence from human use of control mechanisms?
This probably won’t be a very satisfying answer and thinking about this in more detail so I have a better short and cached response in on my list.
My general view (not assuming basic competence) is that misalignment x-risk is about half due to scheming (aka deceptive alignment) and half due to other things (more like “what failure looks like part 1”, sudden failures due to seeking-upstream-correlates-of-reward, etc).
I think control type approaches make me think that a higher fraction of the remaining failures come from an inability to understand what AIs are doing. So, somewhat less of the risk is very directly from scheming and more from “what failure looks like part 1”. That said, “what failure looks like part 1″ type failures are relatively hard to work on in advance.
Ok so failure stops being the AI models coordinating a mass betrayal but goodharting metrics to the point that nothing works right. Not fundamentally different from a command economy failing where the punishment for missing quotas is gulag, and the punishment for lying on a report is gulag but later, so...
There’s also nothing new about the failures, the USA incarceration rate is an example of what “trying too hard” looks like.
In this particular case, Ajeya does seem to lean on the word “reward” pretty heavily when reasoning about how an AI will generalize. Without that word, it’s harder to justify privileging specific hypotheses about what long-term goals an agent will pursue in deployment. I’ve previously complained about this here.
Ryan, curious if you agree with my take here.
I disagree.
I think Ajeya is reasonably careful about the word reward. (Though I think I roughly disagree with the overall vibe of the post with respect to this in various ways. In particular, the “number in the datacenter” case seems super unlikely.)
See e.g. the section starting with:
More generally, I feel like the overall section here (which is the place where the reward related argument comes into force) is pretty careful about this and explains a more general notion of possible correlates that is pretty reasonable.
ETA: As in, you could replace reward with “thing that resulted in reinforcement in an online RL context” and the argument would stand totally fine.
As far as your shortform, I think the responses from Paul and Ajeya are pretty reasonable.
(Another vibe disagreement I have with “without specific countermeasures” is that I think that very basic countermeasures might defeat the “pursue correlate of thing that resulted in reinforcement in an online RL context” as long as humans would have been able to recognize the dangerous actions from the AI as bad. Thus, probably some sort of egregious auditing/oversight errors are required to die for from this exact threat model to be serious issue. The main countermeasure is just training another copy of the model as a monitor based on a dataset of bad actions we label. If our concern is “AIs learn to pursue what performed well in training”, then there isn’t a particular reason for this monitor to fail (though the policy might try to hack it with an adversarial input etc).)
After your careful analysis on AI control, what threat model is the most likely to be a problem, assuming basic competence from human use of control mechanisms?
This probably won’t be a very satisfying answer and thinking about this in more detail so I have a better short and cached response in on my list.
My general view (not assuming basic competence) is that misalignment x-risk is about half due to scheming (aka deceptive alignment) and half due to other things (more like “what failure looks like part 1”, sudden failures due to seeking-upstream-correlates-of-reward, etc).
I think control type approaches make me think that a higher fraction of the remaining failures come from an inability to understand what AIs are doing. So, somewhat less of the risk is very directly from scheming and more from “what failure looks like part 1”. That said, “what failure looks like part 1″ type failures are relatively hard to work on in advance.
Ok so failure stops being the AI models coordinating a mass betrayal but goodharting metrics to the point that nothing works right. Not fundamentally different from a command economy failing where the punishment for missing quotas is gulag, and the punishment for lying on a report is gulag but later, so...
There’s also nothing new about the failures, the USA incarceration rate is an example of what “trying too hard” looks like.