The correct thing to do is to not run the system, or to otherwise restrict the system’s access / capabilities, and other variations on the “just don’t” strategy. If we assume away such approaches, then we’re not left with much.
The various “theoretical” alignment plans like debate all require far more than 24 hours to set up. Even something simple that we actually know how to do, like RLHF[1], takes longer than that.
I will now provide my best guess as to how to value-align a superintelligence in a 24 hour timeframe. I will assume that the SI is roughly a multimodal, autoregressive transformer primarily trained via self-supervised imitation of data from multiple sources and modalities.
The only approach I can think of that might be ready to go in less than 24 hours is from Training Language Models with Language Feedback, which lets you adapt the system’s output to match feedback given in natural language. The way their method works is as follows:
Prompt the system to generate some output.
The system generates some initial output in response to your prompt.
Provide the system with natural language feedback telling it how the output should be improved.
Have the system generate n “refinements” of its initial output, conditioned on the initial prompt + your feedback.
Pick the refinement with the highest cosine similarity to the feedback (hopefully, the one that best incorporates your feedback). You can also manually intervene here if the most similar refinement seems bad in some other way.
Finetune the system on the initial prompt + initial response + feedback + best refinement.
Repeat.
The alignment approach would look something like:
Prompting the system to give outputs that are consistent with human-compatible values.
Give it feedback on how the output should be changed to be more aligned with our values.
Have the system generate refinements of its initial output.
Pick the most aligned-seeming refinement.
Finetune the system on the interaction.
The biggest advantage of this approach is the sheer simplicity, flexibility and generality. All you need is for the system to be able to generate responses conditioned on a prompt and to be able to finetune the system on its own output + your feedback. It’s also very easy to set up. A good developer can probably get it running in under an hour, assuming they’ve already set up the tools necessary to interact with the model at all.
The other advantage is in sample efficiency. RL approaches require either (1) enormous amounts of labeled data or (2) for you to train a reward model so as to automate labeling of the data. These are both implausible to do in < 24 hours. In contrast, language feedback allows for a much richer supervisory signal than the 1-dimensional reward values in RL. E.g., the linked paper used this method to train GPT-3 to be better at summarization, and were able to significantly beat instructGPT at summarization with only 100 examples of feedback.
The biggest disadvantage, IMO, is that it’s never been tried as an alignment approach. Even RLHF has been tested more extensively, and bugs / other issues are bound to surface.
Additionally, the requirement to provide language feedback on the model’s outputs will slow down the labelers. However, I’m pretty sure the higher sample efficiency will more than make up for this issue. E.g., if you want to teach humans a task, it’s typically useful to provide specific feedback on whatever mistakes are in their initial attempts.
Thanks for the fascinating response! It’s intriguing that we don’t have more or better-tested “emergency” measures on-hand; do you think there’s value in specifically working on quickly-implementable alignment models, or would that be a waste of time?
Well, I’m personally going to be working on adapting the method I cited for use as a value alignment approach. I’m not exactly doing it so that we’ll have an “emergency” method on hand, more because I think it’s could be a straight up improvement over RLHF, even outside of emergency time-constrained scenarios.
However, I do think there’s a lot of value in having alignment approaches that are easy to deploy. The less technical debt and ways for things to go wrong, the better. And the simpler the approach, the more likely it is that capabilities researchers will actually use it. There is some risk that we’ll end up in a situation where capabilities researchers are choosing between a “fast, low quality” solution and a “slow, high quality” solution. In that case, the existence of the “fast, low quality” solution may cause them to avoid the better one, since they’ll have something that may seem “good enough” to them.
Probably, the most future proof way to build up readily-deployable alignment resources is to build lots of “alignment datasets” that have high-quality labeled examples of AI systems behaving in the way we want (texts of AIs following instructions, AIs acting in accordance with our values, or even just prompts / scenarios / environments where they could demonstrate value alignment). OpenAI has something like this which they used to train instructGPT.
I also proposed that we make a concerted effort to build such datasets now, especially for AIs acting in high-capabilities domains. ML methods may change in the future, but data will always be important.
My guess is that there is virtually zero value in working on 24-hour-style emergency measures, because:
The probability we end up with a known 24-hour-ish window is vanishingly small. For example I think all of the following are far more likely:
no-window defeat (things proceed as at present, and then with no additional warning to anyone relevant, the leading group turns on unaligned AGI and we all die)
no-window victory (as above, except the leading group completely solves alignment and there is much rejoicing)
various high-variance but significantly-longer-than-24hr fire alarms (e.g. stories in the same genre as yours, except instead of learning that we have 24 hours, the new research results / intercepted intelligence / etc makes the new best estimate 1-4 years / 1-10 months / etc)
The probability that anything we do actually affects the outcome is much higher in the longer term version than in the 24-hour version, which means that even the scenarios were equally likely, we’d probably get more EV out of working on the “tractable” (by comparison) version.
Work on the “tractable” version is more likely to generalize than work on the emergency version, e.g. general alignment researchers might incidentally discover some strategy which has a chance of working on a short time horizon, but 24-hour researchers are less likely to incidentally discover longer-term solutions, because the 24-hour premise makes long-term categories of things (like lobbying for regulations) not worth thinking about.
“We’re fucked.”
More seriously:
The correct thing to do is to not run the system, or to otherwise restrict the system’s access / capabilities, and other variations on the “just don’t” strategy. If we assume away such approaches, then we’re not left with much.
The various “theoretical” alignment plans like debate all require far more than 24 hours to set up. Even something simple that we actually know how to do, like RLHF[1], takes longer than that.
I will now provide my best guess as to how to value-align a superintelligence in a 24 hour timeframe. I will assume that the SI is roughly a multimodal, autoregressive transformer primarily trained via self-supervised imitation of data from multiple sources and modalities.
The only approach I can think of that might be ready to go in less than 24 hours is from Training Language Models with Language Feedback, which lets you adapt the system’s output to match feedback given in natural language. The way their method works is as follows:
Prompt the system to generate some output.
The system generates some initial output in response to your prompt.
Provide the system with natural language feedback telling it how the output should be improved.
Have the system generate n “refinements” of its initial output, conditioned on the initial prompt + your feedback.
Pick the refinement with the highest cosine similarity to the feedback (hopefully, the one that best incorporates your feedback). You can also manually intervene here if the most similar refinement seems bad in some other way.
Finetune the system on the initial prompt + initial response + feedback + best refinement.
Repeat.
The alignment approach would look something like:
Prompting the system to give outputs that are consistent with human-compatible values.
Give it feedback on how the output should be changed to be more aligned with our values.
Have the system generate refinements of its initial output.
Pick the most aligned-seeming refinement.
Finetune the system on the interaction.
The biggest advantage of this approach is the sheer simplicity, flexibility and generality. All you need is for the system to be able to generate responses conditioned on a prompt and to be able to finetune the system on its own output + your feedback. It’s also very easy to set up. A good developer can probably get it running in under an hour, assuming they’ve already set up the tools necessary to interact with the model at all.
The other advantage is in sample efficiency. RL approaches require either (1) enormous amounts of labeled data or (2) for you to train a reward model so as to automate labeling of the data. These are both implausible to do in < 24 hours. In contrast, language feedback allows for a much richer supervisory signal than the 1-dimensional reward values in RL. E.g., the linked paper used this method to train GPT-3 to be better at summarization, and were able to significantly beat instructGPT at summarization with only 100 examples of feedback.
The biggest disadvantage, IMO, is that it’s never been tried as an alignment approach. Even RLHF has been tested more extensively, and bugs / other issues are bound to surface.
Additionally, the requirement to provide language feedback on the model’s outputs will slow down the labelers. However, I’m pretty sure the higher sample efficiency will more than make up for this issue. E.g., if you want to teach humans a task, it’s typically useful to provide specific feedback on whatever mistakes are in their initial attempts.
Reinforcement learning from human feedback.
Thanks for the fascinating response! It’s intriguing that we don’t have more or better-tested “emergency” measures on-hand; do you think there’s value in specifically working on quickly-implementable alignment models, or would that be a waste of time?
Well, I’m personally going to be working on adapting the method I cited for use as a value alignment approach. I’m not exactly doing it so that we’ll have an “emergency” method on hand, more because I think it’s could be a straight up improvement over RLHF, even outside of emergency time-constrained scenarios.
However, I do think there’s a lot of value in having alignment approaches that are easy to deploy. The less technical debt and ways for things to go wrong, the better. And the simpler the approach, the more likely it is that capabilities researchers will actually use it. There is some risk that we’ll end up in a situation where capabilities researchers are choosing between a “fast, low quality” solution and a “slow, high quality” solution. In that case, the existence of the “fast, low quality” solution may cause them to avoid the better one, since they’ll have something that may seem “good enough” to them.
Probably, the most future proof way to build up readily-deployable alignment resources is to build lots of “alignment datasets” that have high-quality labeled examples of AI systems behaving in the way we want (texts of AIs following instructions, AIs acting in accordance with our values, or even just prompts / scenarios / environments where they could demonstrate value alignment). OpenAI has something like this which they used to train instructGPT.
I also proposed that we make a concerted effort to build such datasets now, especially for AIs acting in high-capabilities domains. ML methods may change in the future, but data will always be important.
My guess is that there is virtually zero value in working on 24-hour-style emergency measures, because:
The probability we end up with a known 24-hour-ish window is vanishingly small. For example I think all of the following are far more likely:
no-window defeat (things proceed as at present, and then with no additional warning to anyone relevant, the leading group turns on unaligned AGI and we all die)
no-window victory (as above, except the leading group completely solves alignment and there is much rejoicing)
various high-variance but significantly-longer-than-24hr fire alarms (e.g. stories in the same genre as yours, except instead of learning that we have 24 hours, the new research results / intercepted intelligence / etc makes the new best estimate 1-4 years / 1-10 months / etc)
The probability that anything we do actually affects the outcome is much higher in the longer term version than in the 24-hour version, which means that even the scenarios were equally likely, we’d probably get more EV out of working on the “tractable” (by comparison) version.
Work on the “tractable” version is more likely to generalize than work on the emergency version, e.g. general alignment researchers might incidentally discover some strategy which has a chance of working on a short time horizon, but 24-hour researchers are less likely to incidentally discover longer-term solutions, because the 24-hour premise makes long-term categories of things (like lobbying for regulations) not worth thinking about.