As somebody who’s been watching AI notkilleveryoneism for a very long time, but is sitting at a bit of a remove from the action, I think I may be able to “see the elephant” better than some people on the inside.
I actually believe I see the big players converging toward something of an unrecognized, perhaps unconscious consensus about how to approach the problem. This really came together in my mind when I saw OpenAI’s plugin system for ChatGPT.
I thought I’d summarize what I think are the major points. They’re not all universal; obviously some of them are more established than others.
Because AI misbehavior is likely to come from complicated, emergent sources, any attempt to “design it out” is likely to fail.
Avoid this trap by generating your AI in an automated way using the most opaque, uninterpretable architecture you can devise. If you happen on something that seems to work, don’t ask why; just scale it up.
Overcomplicated criteria for “good” and “bad” behavior will lead to errors in both specification and implementation.
Avoid this by identifying concepts like “safety” and “alignment” with easily measurable behaviors. Examples:
Not saying anything that offends anybody
Not unnerving people
Not handing out widely and easily available factual information from a predefined list of types that could possibly be misused.
Resist the danger of more complicated views. If you do believe you’ll have to accept more complication in the future, avoid acting on that for as long as possible.
In keeping with the strategy of avoiding errors by not manually trying to define the intrinsic behavior of a complex system, enforce these safety and alignment criteria primarily by bashing on the nearly complete system from the outside until you no longer observe very much of the undesired behavior.
Trust the system to implement this adjustment by an appropriate modification to its internal strategies. (LLM post-tuning with RLxF).
As a general rule, build very agenty systems that plan and adapt to various environments. Have them dynamically discover their goals (DeepMind). If you didn’t build an agenty enough system at the beginning, do whatever you can to graft in agenty behavior after the fact (OpenAI).
Make sure your system is crafty enough to avoid being suborned by humans. Teach it to win against them at games of persuasion and deception (Facebook).
Everybody knows that an AI at least as smart as Eliezer Yudkowsky can talk its way out of any sandbox.
Avoid this by actively pushing it out of the sandbox before it gets dangerously smart. You can help the fledgeling AI to explore the world earlier than it otherwise might. Provide easily identifiable, well described, easily understood paths of access to specific external resources with understandable uses and effects. Tie their introduction specifically to your work to add agency to the system. Don’t worry; it will learn to do more with less later.
You can’t do everything yourself, so you should enlist the ingenuity of the Internet to help you provide more channels to outside capabilities. (ChatGPT plugins, maybe a bit o’ Bing)
Make sure to use an architecture that can easily be used to communicate and share capabilities with other AI projects. That way they can all keep an eye on one another. (Plugins again).
Run a stochastic search for the best architecture for alignment by allowing end users to mix and match capabilities for their instances of your AI (Still more plugins).
Remember to guard against others using your AI in ways that trigger any residual unaligned behavior, or making mistakes when they add capability to it.
The best approach is to make sure that they know even less than you do about how it works inside (Increasing secrecy everywhere). Also, make sure you identify every and pre-approve everybody so you can exclude undesirables.
Undesirables can be anywhere! Make sure to maintain unity of purpose in your organization by removing anybody who might hinder any part of this approach. (Microsoft) Move fast to avoid losing momentum.
Good News, Everyone!
As somebody who’s been watching AI notkilleveryoneism for a very long time, but is sitting at a bit of a remove from the action, I think I may be able to “see the elephant” better than some people on the inside.
I actually believe I see the big players converging toward something of an unrecognized, perhaps unconscious consensus about how to approach the problem. This really came together in my mind when I saw OpenAI’s plugin system for ChatGPT.
I thought I’d summarize what I think are the major points. They’re not all universal; obviously some of them are more established than others.
Because AI misbehavior is likely to come from complicated, emergent sources, any attempt to “design it out” is likely to fail.
Avoid this trap by generating your AI in an automated way using the most opaque, uninterpretable architecture you can devise. If you happen on something that seems to work, don’t ask why; just scale it up.
Overcomplicated criteria for “good” and “bad” behavior will lead to errors in both specification and implementation.
Avoid this by identifying concepts like “safety” and “alignment” with easily measurable behaviors. Examples:
Not saying anything that offends anybody
Not unnerving people
Not handing out widely and easily available factual information from a predefined list of types that could possibly be misused.
Resist the danger of more complicated views. If you do believe you’ll have to accept more complication in the future, avoid acting on that for as long as possible.
In keeping with the strategy of avoiding errors by not manually trying to define the intrinsic behavior of a complex system, enforce these safety and alignment criteria primarily by bashing on the nearly complete system from the outside until you no longer observe very much of the undesired behavior.
Trust the system to implement this adjustment by an appropriate modification to its internal strategies. (LLM post-tuning with RLxF).
As a general rule, build very agenty systems that plan and adapt to various environments. Have them dynamically discover their goals (DeepMind). If you didn’t build an agenty enough system at the beginning, do whatever you can to graft in agenty behavior after the fact (OpenAI).
Make sure your system is crafty enough to avoid being suborned by humans. Teach it to win against them at games of persuasion and deception (Facebook).
Everybody knows that an AI at least as smart as Eliezer Yudkowsky can talk its way out of any sandbox.
Avoid this by actively pushing it out of the sandbox before it gets dangerously smart. You can help the fledgeling AI to explore the world earlier than it otherwise might. Provide easily identifiable, well described, easily understood paths of access to specific external resources with understandable uses and effects. Tie their introduction specifically to your work to add agency to the system. Don’t worry; it will learn to do more with less later.
You can’t do everything yourself, so you should enlist the ingenuity of the Internet to help you provide more channels to outside capabilities. (ChatGPT plugins, maybe a bit o’ Bing)
Make sure to use an architecture that can easily be used to communicate and share capabilities with other AI projects. That way they can all keep an eye on one another. (Plugins again).
Run a stochastic search for the best architecture for alignment by allowing end users to mix and match capabilities for their instances of your AI (Still more plugins).
Remember to guard against others using your AI in ways that trigger any residual unaligned behavior, or making mistakes when they add capability to it.
The best approach is to make sure that they know even less than you do about how it works inside (Increasing secrecy everywhere). Also, make sure you identify every and pre-approve everybody so you can exclude undesirables.
Undesirables can be anywhere! Make sure to maintain unity of purpose in your organization by removing anybody who might hinder any part of this approach. (Microsoft) Move fast to avoid losing momentum.
Oh, and specifically teach it to code, too.
I’ve never been more optimistic...