I don’t think the issue is the existence of safe prompts, the issue is proving the non-existence of unsafe prompts. And it’s not at all clear that a GPT-6 that can produce chapters from 2067EliezerSafetyTextbook is not already past the danger threshold.
There would clearly be unsafe prompts for such a model, and it would be a complete disaster to release it publicly, but a small safety-oriented team carefully poking at it in secret in a closed room without internet is something different. In general such a team can place really very harsh safety restrictions on a model like this, especially one that isn’t very agentic at all like GPT, and I think we have a decent shot at throwing enough of these heuristic restrictions at the model that produces the safety textbook that it would not automatically destroy the earth if used carefully.
Sure, but you have essentially no guarantee that such a model would remain contained to that group, or that the insights gleaned from that group could be applied unilaterally across the world before a “bad”* actor reimplemented the model and started asking it unsafe prompts.
Much of the danger here is that once any single lab on earth can make such a model, state actors probably aren’t more than 5 years behind, and likely aren’t more than1 year behind based on the economic value that an AGI represents.
“bad” here doesn’t really mean evil in intent, just an actor that is unconcerned with the safety of their prompts, and thus likely to (in Eliezer’s words) end the world
I don’t think the issue is the existence of safe prompts, the issue is proving the non-existence of unsafe prompts. And it’s not at all clear that a GPT-6 that can produce chapters from 2067EliezerSafetyTextbook is not already past the danger threshold.
There would clearly be unsafe prompts for such a model, and it would be a complete disaster to release it publicly, but a small safety-oriented team carefully poking at it in secret in a closed room without internet is something different. In general such a team can place really very harsh safety restrictions on a model like this, especially one that isn’t very agentic at all like GPT, and I think we have a decent shot at throwing enough of these heuristic restrictions at the model that produces the safety textbook that it would not automatically destroy the earth if used carefully.
Sure, but you have essentially no guarantee that such a model would remain contained to that group, or that the insights gleaned from that group could be applied unilaterally across the world before a “bad”* actor reimplemented the model and started asking it unsafe prompts.
Much of the danger here is that once any single lab on earth can make such a model, state actors probably aren’t more than 5 years behind, and likely aren’t more than1 year behind based on the economic value that an AGI represents.
“bad” here doesn’t really mean evil in intent, just an actor that is unconcerned with the safety of their prompts, and thus likely to (in Eliezer’s words) end the world