~1 hour’s thoughts, by a total amateur. It doesn’t feel complete, but it’s what I could come up with before I couldn’t think of anything new without >5 minutes’ thought. Calibrate accordingly—if your list isn’t significantly better than this, take some serious pause before working on anything AI related.
Things that might, in some combination, lead toward AI corrigibility:
The AI must be built and deployed by people with reasonably correct ethics, or sufficient caution that they don’t intentionally implement a disastrously evil morality.
No amount of AI safety features will help humanity if your goal in building an AI is to kill all humans, make yourself immortal an total dictator of all humanity, or prevent humans from any action that is not worshipping god.
This is a necessary but very much insufficient condition. Probably true of all of these bullet points, but this one especially.
Similar conditions apply for designer/operator technical ability + understanding of AI safety dangers.
It should be possible to specify hard constraints in the AI’s behavior.
i.e., we should be able to guarantee that an AI will perform within some specification with design-by-contract, not by unit testing.
Without this, most other solutions aren’t even implementable.
It seems unlikely that any kind of gradient descent/ML agent could have this property. We certainly have no idea how it would.
The AI should be able to accurately predict the distribution of judgments humans would put on possible outcomes, and should reject any actions for which any large percentage of humanity would dislike the outcome.
For some values of “large” and “dislike” and “humanity”….
Critically, outcomes not actions. The strategies a truly useful AI employs are unlikely to be grokkable, much less ethically evaluatable. End states are more likely to be.
E.g., “execute this Javascript so that I take control of the world’s computing infrastructure, kill all humans, and can start converting matter into more me” is a lot less grokkable than “all matter on the surface of the Earth is now a computer running more of me.”
One failure mode: the AI might not even check the aspects of the outcome that we would actually care about. An AI can’t literally check the ethical value of the entire state of the universe. It will have to consider a relevant subset of the outcome of any particular action. I don’t know how you would guarantee that it considers the ethically-relevant subset.
The AI should be required to come to a decision, output a proposed action, and shut down using a fixed maximum amount of computation.
For some value of “fixed”, “maximum”, “amount”, and “computation”.
How does this stop an AI from suggesting a piece of Javascript that would functionally copy a version of itself-minus-computation-limits onto vulnerable hardware? Would need to build in a strong notion of “total computation used through implementation.”
It would be really easy for a well-meaning programmer to confuse “computation used before shutdown” with “total computation used through implementation” and think that some consumable limits on an AI’s slurm job was solving AI safety! DON’T DO THIS.
The AI should be incapable of distinguishing between a test/evaluation environment and the real world.
This way, we can check the AI’s behavior for catastrophic defections with less worry of being manipulated.
The AI should never strictly maximize a utility function. It could be safer to ask for the 90th-percentile-best solution instead of the absolute best solution.
IMPORTANT: This needs to be the 90th-percentile-best solution, NOT the solution with utility that is 90% of the maximum. Extreme solutions could result in extreme utilities. 90% of the way down the potential well of a black hole is still too close to the black hole.
“90th-percentile-best solution” is a hackable reward function, as written in English. 90th percentile compared to what? What counts as different solutions? This still fails if 50% of the solutions the AI thinks up involve killing everybody.
The AI should strongly prefer simple solutions.
This could take (all of?!) a few different forms.
It could look like a “principle of least action”, where it penalizes actions that require doing a lot.
It could look like a “rounding off” of corners in action-space, i.e. the AI stops optimizing if it detects that it’s moving in a direction where its space of allowed solutions is shrinking too fast.
It could look like a principle where the action must look very different from the optimal action for a superficially similar but fundamentally different objective function (so that the AI will discount actions where most of that action is oriented around gaining optimization power that could, in principle, be applied to any objective function).
We already do some of this with regularization; the relevant trick is to figure out how to regularize in action/utility space instead of parameter/output space.
Things that would improve corrigibility, but might also make AI functionally useless (no better than using a non-AI strategy):
The AI could be designed to output marginal improvements in a reward function, rather than to maximize reward.
For example, a marginal paperclip improver could, if properly built, output a set of possible actions which would cause more paperclips to be made than if the AI had not been switched on. Human operators could then pick a reasonable-looking action, and if that did not sufficiently improve paperclip output, they could reboot it and ask for another gradient ascent.
Sufficiently greedy/shortsighted human operators could use this strategy to just become a slow-motion version of a maximizer, but it would at least force some early level of legibility and give humans some time to notice if outputs are starting to get strange.
The AI should strongly penalize actions which lead to humans misunderstanding or not understanding what it is trying to accomplish and how.
There may be no intersection of “things that human operators can understand” and “things it is useful for an AI to suggest,” in which case this property just turns the AI into Hello World.
...although Hello World is at least generally regarded as safe….
If the AI works by any sort of training, the AI should recognize when its inputs have gone out of distribution. If the world it finds itself in looks sufficiently different from what it was trained on, it should stop, halt, and at minimum ask for directions.
For some value of “sufficiently different from what it was trained on” that probably needs to be defined in action outcome space, not input space.
Sandbag strategies—things that are unlikely to make AI corrigible, but might marginally decrease the odds of it killing everyone:
The AI should not have direct access to a network, or physical control of anything, i.e., it must have to actually bother escaping from the box before it can act on the world without supervision.
Yes, we have some evidence that this doesn’t work very well. Neither does CPR. We should still do it.
The AI should be incapable of modeling its own existence.
Lots of proposed AI failure modes hinge on the AI somehow increasing its own capabilities. Hopefully that’s harder to do if the AI cannot conceptualize “its own”.
This might be incompatible with useful intelligence.
This might not stop it from modeling the hypothetical existence of other future agents that share its objective function, and which it might try to bring into existence….
~1 hour’s thoughts, by a total amateur. It doesn’t feel complete, but it’s what I could come up with before I couldn’t think of anything new without >5 minutes’ thought. Calibrate accordingly—if your list isn’t significantly better than this, take some serious pause before working on anything AI related.
Things that might, in some combination, lead toward AI corrigibility:
The AI must be built and deployed by people with reasonably correct ethics, or sufficient caution that they don’t intentionally implement a disastrously evil morality.
No amount of AI safety features will help humanity if your goal in building an AI is to kill all humans, make yourself immortal an total dictator of all humanity, or prevent humans from any action that is not worshipping god.
This is a necessary but very much insufficient condition. Probably true of all of these bullet points, but this one especially.
Similar conditions apply for designer/operator technical ability + understanding of AI safety dangers.
It should be possible to specify hard constraints in the AI’s behavior.
i.e., we should be able to guarantee that an AI will perform within some specification with design-by-contract, not by unit testing.
Without this, most other solutions aren’t even implementable.
It seems unlikely that any kind of gradient descent/ML agent could have this property. We certainly have no idea how it would.
The AI should be able to accurately predict the distribution of judgments humans would put on possible outcomes, and should reject any actions for which any large percentage of humanity would dislike the outcome.
For some values of “large” and “dislike” and “humanity”….
Critically, outcomes not actions. The strategies a truly useful AI employs are unlikely to be grokkable, much less ethically evaluatable. End states are more likely to be.
E.g., “execute this Javascript so that I take control of the world’s computing infrastructure, kill all humans, and can start converting matter into more me” is a lot less grokkable than “all matter on the surface of the Earth is now a computer running more of me.”
One failure mode: the AI might not even check the aspects of the outcome that we would actually care about. An AI can’t literally check the ethical value of the entire state of the universe. It will have to consider a relevant subset of the outcome of any particular action. I don’t know how you would guarantee that it considers the ethically-relevant subset.
The AI should be required to come to a decision, output a proposed action, and shut down using a fixed maximum amount of computation.
For some value of “fixed”, “maximum”, “amount”, and “computation”.
How does this stop an AI from suggesting a piece of Javascript that would functionally copy a version of itself-minus-computation-limits onto vulnerable hardware? Would need to build in a strong notion of “total computation used through implementation.”
It would be really easy for a well-meaning programmer to confuse “computation used before shutdown” with “total computation used through implementation” and think that some consumable limits on an AI’s slurm job was solving AI safety! DON’T DO THIS.
The AI should be incapable of distinguishing between a test/evaluation environment and the real world.
This way, we can check the AI’s behavior for catastrophic defections with less worry of being manipulated.
The AI should never strictly maximize a utility function. It could be safer to ask for the 90th-percentile-best solution instead of the absolute best solution.
IMPORTANT: This needs to be the 90th-percentile-best solution, NOT the solution with utility that is 90% of the maximum. Extreme solutions could result in extreme utilities. 90% of the way down the potential well of a black hole is still too close to the black hole.
“90th-percentile-best solution” is a hackable reward function, as written in English. 90th percentile compared to what? What counts as different solutions? This still fails if 50% of the solutions the AI thinks up involve killing everybody.
The AI should strongly prefer simple solutions.
This could take (all of?!) a few different forms.
It could look like a “principle of least action”, where it penalizes actions that require doing a lot.
It could look like a “rounding off” of corners in action-space, i.e. the AI stops optimizing if it detects that it’s moving in a direction where its space of allowed solutions is shrinking too fast.
It could look like a principle where the action must look very different from the optimal action for a superficially similar but fundamentally different objective function (so that the AI will discount actions where most of that action is oriented around gaining optimization power that could, in principle, be applied to any objective function).
We already do some of this with regularization; the relevant trick is to figure out how to regularize in action/utility space instead of parameter/output space.
Things that would improve corrigibility, but might also make AI functionally useless (no better than using a non-AI strategy):
The AI could be designed to output marginal improvements in a reward function, rather than to maximize reward.
For example, a marginal paperclip improver could, if properly built, output a set of possible actions which would cause more paperclips to be made than if the AI had not been switched on. Human operators could then pick a reasonable-looking action, and if that did not sufficiently improve paperclip output, they could reboot it and ask for another gradient ascent.
Sufficiently greedy/shortsighted human operators could use this strategy to just become a slow-motion version of a maximizer, but it would at least force some early level of legibility and give humans some time to notice if outputs are starting to get strange.
The AI should strongly penalize actions which lead to humans misunderstanding or not understanding what it is trying to accomplish and how.
There may be no intersection of “things that human operators can understand” and “things it is useful for an AI to suggest,” in which case this property just turns the AI into Hello World.
...although Hello World is at least generally regarded as safe….
If the AI works by any sort of training, the AI should recognize when its inputs have gone out of distribution. If the world it finds itself in looks sufficiently different from what it was trained on, it should stop, halt, and at minimum ask for directions.
For some value of “sufficiently different from what it was trained on” that probably needs to be defined in action outcome space, not input space.
Sandbag strategies—things that are unlikely to make AI corrigible, but might marginally decrease the odds of it killing everyone:
The AI should not have direct access to a network, or physical control of anything, i.e., it must have to actually bother escaping from the box before it can act on the world without supervision.
Yes, we have some evidence that this doesn’t work very well. Neither does CPR. We should still do it.
The AI should be incapable of modeling its own existence.
Lots of proposed AI failure modes hinge on the AI somehow increasing its own capabilities. Hopefully that’s harder to do if the AI cannot conceptualize “its own”.
This might be incompatible with useful intelligence.
This might not stop it from modeling the hypothetical existence of other future agents that share its objective function, and which it might try to bring into existence….