Thanks for the fascinating response! It’s intriguing that we don’t have more or better-tested “emergency” measures on-hand; do you think there’s value in specifically working on quickly-implementable alignment models, or would that be a waste of time?
Well, I’m personally going to be working on adapting the method I cited for use as a value alignment approach. I’m not exactly doing it so that we’ll have an “emergency” method on hand, more because I think it’s could be a straight up improvement over RLHF, even outside of emergency time-constrained scenarios.
However, I do think there’s a lot of value in having alignment approaches that are easy to deploy. The less technical debt and ways for things to go wrong, the better. And the simpler the approach, the more likely it is that capabilities researchers will actually use it. There is some risk that we’ll end up in a situation where capabilities researchers are choosing between a “fast, low quality” solution and a “slow, high quality” solution. In that case, the existence of the “fast, low quality” solution may cause them to avoid the better one, since they’ll have something that may seem “good enough” to them.
Probably, the most future proof way to build up readily-deployable alignment resources is to build lots of “alignment datasets” that have high-quality labeled examples of AI systems behaving in the way we want (texts of AIs following instructions, AIs acting in accordance with our values, or even just prompts / scenarios / environments where they could demonstrate value alignment). OpenAI has something like this which they used to train instructGPT.
I also proposed that we make a concerted effort to build such datasets now, especially for AIs acting in high-capabilities domains. ML methods may change in the future, but data will always be important.
My guess is that there is virtually zero value in working on 24-hour-style emergency measures, because:
The probability we end up with a known 24-hour-ish window is vanishingly small. For example I think all of the following are far more likely:
no-window defeat (things proceed as at present, and then with no additional warning to anyone relevant, the leading group turns on unaligned AGI and we all die)
no-window victory (as above, except the leading group completely solves alignment and there is much rejoicing)
various high-variance but significantly-longer-than-24hr fire alarms (e.g. stories in the same genre as yours, except instead of learning that we have 24 hours, the new research results / intercepted intelligence / etc makes the new best estimate 1-4 years / 1-10 months / etc)
The probability that anything we do actually affects the outcome is much higher in the longer term version than in the 24-hour version, which means that even the scenarios were equally likely, we’d probably get more EV out of working on the “tractable” (by comparison) version.
Work on the “tractable” version is more likely to generalize than work on the emergency version, e.g. general alignment researchers might incidentally discover some strategy which has a chance of working on a short time horizon, but 24-hour researchers are less likely to incidentally discover longer-term solutions, because the 24-hour premise makes long-term categories of things (like lobbying for regulations) not worth thinking about.
Thanks for the fascinating response! It’s intriguing that we don’t have more or better-tested “emergency” measures on-hand; do you think there’s value in specifically working on quickly-implementable alignment models, or would that be a waste of time?
Well, I’m personally going to be working on adapting the method I cited for use as a value alignment approach. I’m not exactly doing it so that we’ll have an “emergency” method on hand, more because I think it’s could be a straight up improvement over RLHF, even outside of emergency time-constrained scenarios.
However, I do think there’s a lot of value in having alignment approaches that are easy to deploy. The less technical debt and ways for things to go wrong, the better. And the simpler the approach, the more likely it is that capabilities researchers will actually use it. There is some risk that we’ll end up in a situation where capabilities researchers are choosing between a “fast, low quality” solution and a “slow, high quality” solution. In that case, the existence of the “fast, low quality” solution may cause them to avoid the better one, since they’ll have something that may seem “good enough” to them.
Probably, the most future proof way to build up readily-deployable alignment resources is to build lots of “alignment datasets” that have high-quality labeled examples of AI systems behaving in the way we want (texts of AIs following instructions, AIs acting in accordance with our values, or even just prompts / scenarios / environments where they could demonstrate value alignment). OpenAI has something like this which they used to train instructGPT.
I also proposed that we make a concerted effort to build such datasets now, especially for AIs acting in high-capabilities domains. ML methods may change in the future, but data will always be important.
My guess is that there is virtually zero value in working on 24-hour-style emergency measures, because:
The probability we end up with a known 24-hour-ish window is vanishingly small. For example I think all of the following are far more likely:
no-window defeat (things proceed as at present, and then with no additional warning to anyone relevant, the leading group turns on unaligned AGI and we all die)
no-window victory (as above, except the leading group completely solves alignment and there is much rejoicing)
various high-variance but significantly-longer-than-24hr fire alarms (e.g. stories in the same genre as yours, except instead of learning that we have 24 hours, the new research results / intercepted intelligence / etc makes the new best estimate 1-4 years / 1-10 months / etc)
The probability that anything we do actually affects the outcome is much higher in the longer term version than in the 24-hour version, which means that even the scenarios were equally likely, we’d probably get more EV out of working on the “tractable” (by comparison) version.
Work on the “tractable” version is more likely to generalize than work on the emergency version, e.g. general alignment researchers might incidentally discover some strategy which has a chance of working on a short time horizon, but 24-hour researchers are less likely to incidentally discover longer-term solutions, because the 24-hour premise makes long-term categories of things (like lobbying for regulations) not worth thinking about.