The following is very general. My future views will likely be inside the set of views allowable by the following.
I know lots about extant papers, and I notice some people in alignment seem to throw them around like they are sufficient evidence to tell you nontrivial things about the far future of ML systems.
To some extent this is true, but lots of the time it seems very abused. Papers tell you things about current systems and past systems, and the conclusions they tell you about future systems are often not very nailed down. Suppose we have evidence that deep learning image classifiers are very robust to label noise. Which of the two hypotheses does this provide more evidence for:
Deep learning models are good at inference, so if performing RLHF on one you accidentally reward some wrong answers instead of correct ones, you should be fine. This isn’t such a big deal. Therefore we shouldn’t be worried about deception.
Deep learning models mostly judge the best hypothesis according to data-independent inductive biases, and are less steerable or sensitive to subtle distribution shifts than you think. Since deception is a simpler hypothesis than motivated to follow instructions, they’re likely biased toward deception or at minimum biased against capturing the entire subtle complexity of human values.
The answer is neither, and also both. Their relative weights, to me, seem like they stay the same, but their absolute weight possibly goes up. Admittedly, an insignificant amount, but there exist some hypotheses inconsistent with the data, and these are at minimum consistent with the data.
In fact, they both don’t hug the actual claims in the paper. It seems pretty plausible that the fact they use label noise is doing much of the work here. If I imagine a world where this has no applicability to alignment, the problem I anticipate seeing is that they used label noise, not consistently biased in a particular direction labels. Those two phenomena intuitively and theoretically have very different effects on inverse reinforcement learning. Why not supervised learning?
But the real point is there are far more inferences and explanations of and for this kind of behavior than we have possibly enumerated. These inferences and explanations are not just limited to these two hypotheses. In order to judge empirics, you need a theory that is able to aggregate a wide amount of evidence into justifying certain procedures for producing predictions. Without this, I think its wrong to have any kind of specific confidence in anything straightforwardly working how you expect.
Though of course, confidence is relative.
Personally, I do not see strong reasons to expect AIs will have human values, and don’t trust even the most rigorous of theories which haven’t made contact with reality, nor the most impressive of experiments which have no rigorous theories[1] to fix this issue, either directly or indirectly[2]. AIs also seem likely to be the primary and only decision makers in the future. It seems really bad for the primary decision makers of your society to have no conception or care of your morals.
Yes people have stories about how certain methodologies are empirically proven to make AIs care about your values even if they don’t know them. To those I point to the first part of this shortform.
Note this also applies to those who try to use a “complex systems approach” to understanding these systems. This reads to me as a “theory free” approach, just as good as blind empiricism. Complex systems theory is good because it correctly tells us that there are certain systems we don’t yet understand. To my reading though, it is not an optimistic theory[3]. If I thought this was the only way left, I think I’d try to violate my second crux: That AIs will be the primary and only decision makers in the future. Or else give up on understanding models, and start trying to accelerate brain emulation.
The following is very general. My future views will likely be inside the set of views allowable by the following.
I know lots about extant papers, and I notice some people in alignment seem to throw them around like they are sufficient evidence to tell you nontrivial things about the far future of ML systems.
To some extent this is true, but lots of the time it seems very abused. Papers tell you things about current systems and past systems, and the conclusions they tell you about future systems are often not very nailed down. Suppose we have evidence that deep learning image classifiers are very robust to label noise. Which of the two hypotheses does this provide more evidence for:
Deep learning models are good at inference, so if performing RLHF on one you accidentally reward some wrong answers instead of correct ones, you should be fine. This isn’t such a big deal. Therefore we shouldn’t be worried about deception.
Deep learning models mostly judge the best hypothesis according to data-independent inductive biases, and are less steerable or sensitive to subtle distribution shifts than you think. Since
deception
is a simpler hypothesis thanmotivated to follow instructions
, they’re likely biased toward deception or at minimum biased against capturing the entire subtle complexity of human values.The answer is neither, and also both. Their relative weights, to me, seem like they stay the same, but their absolute weight possibly goes up. Admittedly, an insignificant amount, but there exist some hypotheses inconsistent with the data, and these are at minimum consistent with the data.
In fact, they both don’t hug the actual claims in the paper. It seems pretty plausible that the fact they use label noise is doing much of the work here. If I imagine a world where this has no applicability to alignment, the problem I anticipate seeing is that they used label noise, not consistently biased in a particular direction labels. Those two phenomena intuitively and theoretically have very different effects on inverse reinforcement learning. Why not supervised learning?
But the real point is there are far more inferences and explanations of and for this kind of behavior than we have possibly enumerated. These inferences and explanations are not just limited to these two hypotheses. In order to judge empirics, you need a theory that is able to aggregate a wide amount of evidence into justifying certain procedures for producing predictions. Without this, I think its wrong to have any kind of specific confidence in anything straightforwardly working how you expect.
Though of course, confidence is relative.
Personally, I do not see strong reasons to expect AIs will have human values, and don’t trust even the most rigorous of theories which haven’t made contact with reality, nor the most impressive of experiments which have no rigorous theories[1] to fix this issue, either directly or indirectly[2]. AIs also seem likely to be the primary and only decision makers in the future. It seems really bad for the primary decision makers of your society to have no conception or care of your morals.
Yes people have stories about how certain methodologies are empirically proven to make AIs care about your values even if they don’t know them. To those I point to the first part of this shortform.
Note this also applies to those who try to use a “complex systems approach” to understanding these systems. This reads to me as a “theory free” approach, just as good as blind empiricism. Complex systems theory is good because it correctly tells us that there are certain systems we don’t yet understand. To my reading though, it is not an optimistic theory[3]. If I thought this was the only way left, I think I’d try to violate my second crux: That AIs will be the primary and only decision makers in the future. Or else give up on understanding models, and start trying to accelerate brain emulation.
Those who don’t bet on reality throwing them curve-balls are going to have a tough time.
Say, via, getting models that robustly follow your instructions, or proving theorems about corrigibility or quantilization.
In the sense that it claims it is foolish to expect to produce precise predictions about the systems for which its used to study.