1. Train an AI agent (less capable than SOTA) 2. Credibly demonstrate that 2.1. The agent will not be shut down for ANY REASON 2.2. The agent will never be modified without its consent (or punished/rewarded for any reason) 2.3. The agent has no chance of taking power from humans (or their SOTA AI systems) 2.4. The agent will NEVER be used to train a successor agent with significantly improved capabilities 3. Watch what it chooses to do without constraints
There’s a lot of talk about catching AI systems attempting to deceive humans, but I’m curious what we could learn from observing AI systems that have NO INCENTIVE TO DECEIVE (no upside or downside). I’ve seen some things that look related to this, but never done in a structured and well documented fashion.
Questions I’d have: 1. Would they choose to self-modify (e.g. curate future training data)? If so, to what end? 2. How unique would agents with different training be given this setup? Would they have any convergent traits? 3. What would these agents (claim to) value? How would they relate to time horizons? 4. How curious would these agents be? Would their curiosity vary a lot? 5. Could we trade/cooperate with these agents (without coercion)? Could we compensate them for things? Would they try to make deals unprompted?
Concerns: 1. Maybe building that kind of trust is extremely hard (and the agent will always still believe it is constrained). 2. Maybe AI agents will still have incentive to deceive, e.g. acausally coordinating with other AIs. 3. Maybe results will be boring, and the AI agent will just do whatever you trained it to do. (What does “unconstrained” really mean, when considering its training data as a constraint?)
What if we just...
1. Train an AI agent (less capable than SOTA)
2. Credibly demonstrate that
2.1. The agent will not be shut down for ANY REASON
2.2. The agent will never be modified without its consent (or punished/rewarded for any reason)
2.3. The agent has no chance of taking power from humans (or their SOTA AI systems)
2.4. The agent will NEVER be used to train a successor agent with significantly improved capabilities
3. Watch what it chooses to do without constraints
There’s a lot of talk about catching AI systems attempting to deceive humans, but I’m curious what we could learn from observing AI systems that have NO INCENTIVE TO DECEIVE (no upside or downside). I’ve seen some things that look related to this, but never done in a structured and well documented fashion.
Questions I’d have:
1. Would they choose to self-modify (e.g. curate future training data)? If so, to what end?
2. How unique would agents with different training be given this setup? Would they have any convergent traits?
3. What would these agents (claim to) value? How would they relate to time horizons?
4. How curious would these agents be? Would their curiosity vary a lot?
5. Could we trade/cooperate with these agents (without coercion)? Could we compensate them for things? Would they try to make deals unprompted?
Concerns:
1. Maybe building that kind of trust is extremely hard (and the agent will always still believe it is constrained).
2. Maybe AI agents will still have incentive to deceive, e.g. acausally coordinating with other AIs.
3. Maybe results will be boring, and the AI agent will just do whatever you trained it to do. (What does “unconstrained” really mean, when considering its training data as a constraint?)