If we adopt a little bit of deltonian pessimism (though not the whole hog), and model present-day language models as doing something vaguely like nearest-neighbor interpolation in a slightly sophisticated latent space (while still being very impressive), then we might predict that there are going to be some ways of getting honest answers an impressive percentage of the time while staying entirely within the interpolation regime.
And then if you look at the extrapolation regime, it’s basically the entire alignment problem squeezed into a smaller space! So I worry that people are going to do the obvious things, get good answers on 90%+ of human questions, and then feel some kind of pressure to write off the remainder as not that important (“we’ve got honest answers 98% of the time, so the alignment problem is like 98% solved”). When what I want is for people to use language models as a laboratory to keep being ambitions, and do theory-informed experiments that try to push the envelope in terms of extrapolating human preferences in a human-approved way.
I can think of a few different interpretations of your concern (and am interested to hear if these don’t cover it):
There will be insufficient attention paid to robustness.
There will be insufficient attention paid to going beyond naive human supervision.
The results of the research will be misinterpreted as representing more progress than is warranted.
I agree that all of these are possibilities, and that the value of the endeavor could well depend on whether the people conducting (and communicating) the research are able to avoid pitfalls such as these.
There’s certainly more object-level discussion to be had about how much emphasis should be placed on avoiding these particular pitfalls, and I’m happy to dig in to them further if you’re able to clarify which if any of them capture your main concern.
I think there are different kinds of robustness, and people focused on present-day applications (including tests that are easy to do in the present) are going to focus on the kinds of robustness that help with present-day problems. Being robust to malicious input from human teenagers will only marginally help make you robust to input from a future with lots of extra technological progress. They might have very different-looking solutions, because of factors like interpolation vs. extrapolation.
Framing it this way suggests one concrete thing I might hope for you to do, which is to create artificial problems for the language model that you think will exercise kinds of robustness and generalization not represented by the problem of fine-tuning GPT (or a BERT-based classifier) to be robust to the teenager distribution.
I think this is included in what I intended by “adversarial training”: we’d try to find tasks that cause the model to produce negligent falsehoods, train the model to perform better at those tasks, and aim for a model that is robust to someone searching for such tasks.
Sure—another way of phrasing what I’m saying is that I’m not super interested (as alignment research, at least) in adversarial training that involves looking at difficult subsets of the training distribution, or adversarial training where the proposed solution is to give the AI more labeled examples that effectively extend the training distribution to include the difficult cases.
It would be bad if we build an AI that wasn’t robust on the training distribution, of course, but I think of this as a problem already being addressed by the field of ML without any need for looking ahead to AGI.
Here’s my worry.
If we adopt a little bit of deltonian pessimism (though not the whole hog), and model present-day language models as doing something vaguely like nearest-neighbor interpolation in a slightly sophisticated latent space (while still being very impressive), then we might predict that there are going to be some ways of getting honest answers an impressive percentage of the time while staying entirely within the interpolation regime.
And then if you look at the extrapolation regime, it’s basically the entire alignment problem squeezed into a smaller space! So I worry that people are going to do the obvious things, get good answers on 90%+ of human questions, and then feel some kind of pressure to write off the remainder as not that important (“we’ve got honest answers 98% of the time, so the alignment problem is like 98% solved”). When what I want is for people to use language models as a laboratory to keep being ambitions, and do theory-informed experiments that try to push the envelope in terms of extrapolating human preferences in a human-approved way.
I can think of a few different interpretations of your concern (and am interested to hear if these don’t cover it):
There will be insufficient attention paid to robustness.
There will be insufficient attention paid to going beyond naive human supervision.
The results of the research will be misinterpreted as representing more progress than is warranted.
I agree that all of these are possibilities, and that the value of the endeavor could well depend on whether the people conducting (and communicating) the research are able to avoid pitfalls such as these.
There’s certainly more object-level discussion to be had about how much emphasis should be placed on avoiding these particular pitfalls, and I’m happy to dig in to them further if you’re able to clarify which if any of them capture your main concern.
I think there are different kinds of robustness, and people focused on present-day applications (including tests that are easy to do in the present) are going to focus on the kinds of robustness that help with present-day problems. Being robust to malicious input from human teenagers will only marginally help make you robust to input from a future with lots of extra technological progress. They might have very different-looking solutions, because of factors like interpolation vs. extrapolation.
Framing it this way suggests one concrete thing I might hope for you to do, which is to create artificial problems for the language model that you think will exercise kinds of robustness and generalization not represented by the problem of fine-tuning GPT (or a BERT-based classifier) to be robust to the teenager distribution.
I think this is included in what I intended by “adversarial training”: we’d try to find tasks that cause the model to produce negligent falsehoods, train the model to perform better at those tasks, and aim for a model that is robust to someone searching for such tasks.
Sure—another way of phrasing what I’m saying is that I’m not super interested (as alignment research, at least) in adversarial training that involves looking at difficult subsets of the training distribution, or adversarial training where the proposed solution is to give the AI more labeled examples that effectively extend the training distribution to include the difficult cases.
It would be bad if we build an AI that wasn’t robust on the training distribution, of course, but I think of this as a problem already being addressed by the field of ML without any need for looking ahead to AGI.