I still don’t feel like I’ve read a convincing case for why GPT-6 would mean certain-doom. I can see the danger in prompts like “this is the output of a superintelligence optimising for human happiness:”, but a prompt like “Advanced AI Alignment, by Eliezer Yudkowsky, release date: March 2067, Chapter 1: ” is liable to produce GPT-6′s estimate of a future AI safety textbook. This seems like a ridiculously valuable thing unlikely to contain directly world-destroying knowledge. GPT-6 won’t be directly coding, and will only be outputting things it expects future Eliezer to write in such a textbook. This isn’t quite a pivotal-grade event, but it seems to be good enough to enable one.
I don’t think the issue is the existence of safe prompts, the issue is proving the non-existence of unsafe prompts. And it’s not at all clear that a GPT-6 that can produce chapters from 2067EliezerSafetyTextbook is not already past the danger threshold.
There would clearly be unsafe prompts for such a model, and it would be a complete disaster to release it publicly, but a small safety-oriented team carefully poking at it in secret in a closed room without internet is something different. In general such a team can place really very harsh safety restrictions on a model like this, especially one that isn’t very agentic at all like GPT, and I think we have a decent shot at throwing enough of these heuristic restrictions at the model that produces the safety textbook that it would not automatically destroy the earth if used carefully.
Sure, but you have essentially no guarantee that such a model would remain contained to that group, or that the insights gleaned from that group could be applied unilaterally across the world before a “bad”* actor reimplemented the model and started asking it unsafe prompts.
Much of the danger here is that once any single lab on earth can make such a model, state actors probably aren’t more than 5 years behind, and likely aren’t more than1 year behind based on the economic value that an AGI represents.
“bad” here doesn’t really mean evil in intent, just an actor that is unconcerned with the safety of their prompts, and thus likely to (in Eliezer’s words) end the world
So first it is really unclear what you would actually get from gtp6 in this situation. (As an aside I tried with gptj and it outputted an index with some chapter names). You might just get the rest of your own comment or something similar.… Or maybe you get some article about Eliezer’s book, some joke book written now or the actual book but it contains sutle errors Eliezer might make, a fake article an AGI that gpt6 predicts would likely take over the world by then would write… etc.
Since in general gpt6 would be optimized to predict (in the training distribution) what it followed from that kind of text, which is not the same as helpfully responding to prompts(for a current example, codex outputs bad code when prompted with bad code).
It seems to me like the result depends on unknown things about what really big transformer models do internally which seem really hard to predict.
But for you to get something like what you want from this gpt6 needs to be modeling future Eliezer in great detail, complete with lots of thought and interactions. And while gtp6 could have been optimized into having a very specific human modeling algorithm that happens to do that, it seems more likely that before the optimization process finds the complicated algorithm necessary it gets something simpler and more consequentialist, that does some more general thinking process to achieve some goal that happens to output the right completions on the training distribution. Which is really dangerous.
And if you instead trained it with human feedback to ensure you get helpful responses (which sounds exactly the kind of thing people would do if they wanted to actually use gpt6 to do things like answer questions) it would be even worse because you are directly optimizing it for human feedback and it seems clearer there that you are running a search for strategies that make the human feedback number higher.
I think the issues where GPT-6 avoids actually outputting a serious book are fairly easy to solve. For one, you can annotate every item in the training corpus with a tag containing its provenance (arxiv, the various scientific journals, publishing houses, reddit, etc.) and the publication date (and maybe some other things like the number of words), these tags are made available to the network during training. Then the prompt you give to GPT can contain the tag for the origin of the text you want it to produce and the date it was produced, this avoids the easy failure mode of GPT-6 outputting my comment or some random blog post because these things will not have been annotated as “official published book” in the training set, nor will they have the tagged word count.
GPT-6 predicting AI takeover of the publishing houses and therefore producing a malicious AI safety book is a possibility, but I think most future paths where the world is destroyed by AI don’t involve Elsevier still existing and publishing malicious safety books. But even if this is a possibility, we can just re-sample GPT-6 on this prompt to get a variety of books corresponding to the distribution of future outcomes expected by GPT-6, which are then checked by a team of safety researchers. As with most problems, generating interesting solutions is harder than verifying them, it doesn’t have to be perfect to be ridiculoulsy useful.
This general approach of “run GPT-6 in a secret room without internet, patching safety bugs with various heuristics, making it generate AI safety work that is then verified by a team” seems promising to me. You can even do stuff like train GPT-6 on an internal log of the various safety patches the team is working on, then have GPT-6 predict the next patch or possible safety problem. This approach is not safe at extreme levels of AI capability, and some prompts are safer than others, but it doesn’t strike me as “obviously the world ends if someone tries this”.
with a tag containing its provenance (arxiv, the various scientific journals, publishing houses, reddit, etc.) and the publication date (and maybe some other things like the number of words), these tags are made available to the network during training. Then the prompt you give to GPT can contain the tag for the origin of the text you want it to produce and the date it was produced, this avoids the easy failure mode of GPT-6 outputting my comment or some random blog post because these things will not have been annotated as “official published book” in the training set, nor will they have the tagged word count.
If you include something like reviews or quotes praising its accuracy, then you’re moving towards Decision Transformer territory with feedback loops...
I still don’t feel like I’ve read a convincing case for why GPT-6 would mean certain-doom. I can see the danger in prompts like “this is the output of a superintelligence optimising for human happiness:”, but a prompt like “Advanced AI Alignment, by Eliezer Yudkowsky, release date: March 2067, Chapter 1: ” is liable to produce GPT-6′s estimate of a future AI safety textbook. This seems like a ridiculously valuable thing unlikely to contain directly world-destroying knowledge. GPT-6 won’t be directly coding, and will only be outputting things it expects future Eliezer to write in such a textbook. This isn’t quite a pivotal-grade event, but it seems to be good enough to enable one.
I don’t think the issue is the existence of safe prompts, the issue is proving the non-existence of unsafe prompts. And it’s not at all clear that a GPT-6 that can produce chapters from 2067EliezerSafetyTextbook is not already past the danger threshold.
There would clearly be unsafe prompts for such a model, and it would be a complete disaster to release it publicly, but a small safety-oriented team carefully poking at it in secret in a closed room without internet is something different. In general such a team can place really very harsh safety restrictions on a model like this, especially one that isn’t very agentic at all like GPT, and I think we have a decent shot at throwing enough of these heuristic restrictions at the model that produces the safety textbook that it would not automatically destroy the earth if used carefully.
Sure, but you have essentially no guarantee that such a model would remain contained to that group, or that the insights gleaned from that group could be applied unilaterally across the world before a “bad”* actor reimplemented the model and started asking it unsafe prompts.
Much of the danger here is that once any single lab on earth can make such a model, state actors probably aren’t more than 5 years behind, and likely aren’t more than1 year behind based on the economic value that an AGI represents.
“bad” here doesn’t really mean evil in intent, just an actor that is unconcerned with the safety of their prompts, and thus likely to (in Eliezer’s words) end the world
So first it is really unclear what you would actually get from gtp6 in this situation.
(As an aside I tried with gptj and it outputted an index with some chapter names).
You might just get the rest of your own comment or something similar.…
Or maybe you get some article about Eliezer’s book, some joke book written now or the actual book but it contains sutle errors Eliezer might make, a fake article an AGI that gpt6 predicts would likely take over the world by then would write… etc.
Since in general gpt6 would be optimized to predict (in the training distribution) what it followed from that kind of text, which is not the same as helpfully responding to prompts(for a current example, codex outputs bad code when prompted with bad code).
It seems to me like the result depends on unknown things about what really big transformer models do internally which seem really hard to predict.
But for you to get something like what you want from this gpt6 needs to be modeling future Eliezer in great detail, complete with lots of thought and interactions.
And while gtp6 could have been optimized into having a very specific human modeling algorithm that happens to do that, it seems more likely that before the optimization process finds the complicated algorithm necessary it gets something simpler and more consequentialist, that does some more general thinking process to achieve some goal that happens to output the right completions on the training distribution.
Which is really dangerous.
And if you instead trained it with human feedback to ensure you get helpful responses (which sounds exactly the kind of thing people would do if they wanted to actually use gpt6 to do things like answer questions) it would be even worse because you are directly optimizing it for human feedback and it seems clearer there that you are running a search for strategies that make the human feedback number higher.
I think the issues where GPT-6 avoids actually outputting a serious book are fairly easy to solve. For one, you can annotate every item in the training corpus with a tag containing its provenance (arxiv, the various scientific journals, publishing houses, reddit, etc.) and the publication date (and maybe some other things like the number of words), these tags are made available to the network during training. Then the prompt you give to GPT can contain the tag for the origin of the text you want it to produce and the date it was produced, this avoids the easy failure mode of GPT-6 outputting my comment or some random blog post because these things will not have been annotated as “official published book” in the training set, nor will they have the tagged word count.
GPT-6 predicting AI takeover of the publishing houses and therefore producing a malicious AI safety book is a possibility, but I think most future paths where the world is destroyed by AI don’t involve Elsevier still existing and publishing malicious safety books. But even if this is a possibility, we can just re-sample GPT-6 on this prompt to get a variety of books corresponding to the distribution of future outcomes expected by GPT-6, which are then checked by a team of safety researchers. As with most problems, generating interesting solutions is harder than verifying them, it doesn’t have to be perfect to be ridiculoulsy useful.
This general approach of “run GPT-6 in a secret room without internet, patching safety bugs with various heuristics, making it generate AI safety work that is then verified by a team” seems promising to me. You can even do stuff like train GPT-6 on an internal log of the various safety patches the team is working on, then have GPT-6 predict the next patch or possible safety problem. This approach is not safe at extreme levels of AI capability, and some prompts are safer than others, but it doesn’t strike me as “obviously the world ends if someone tries this”.
If you include something like reviews or quotes praising its accuracy, then you’re moving towards Decision Transformer territory with feedback loops...