3 more posts I feel like I need to write at some point:
In defense of dumb value alignment
Solving all of ethics and morality and getting an AI to implement it seems hard. There are possible worlds where we would need to work with half measures. Some of these paths rely on lower auto-doom densities, but there seem to be enough of those potential worlds to consider it.
Example of ‘good enough to not x/s-risk’ dumb value alignment. Required assumptions for stability. Shape of questions implied that may differ from more complete solutions.
What I currently believe, in pictures
Make a bunch of diagrams of things I believe relevant to alignmentstuff and how they interact, plus the implications of those things.
The real point of the post is to encourage people to try to make more explicit and extremely legible models so people can actually figure out where they disagree instead of running around in loops for several years.
Preparation for unknown adversaries is regularization
3 more posts I feel like I need to write at some point:
In defense of dumb value alignment
Solving all of ethics and morality and getting an AI to implement it seems hard. There are possible worlds where we would need to work with half measures. Some of these paths rely on lower auto-doom densities, but there seem to be enough of those potential worlds to consider it.
Example of ‘good enough to not x/s-risk’ dumb value alignment. Required assumptions for stability. Shape of questions implied that may differ from more complete solutions.
What I currently believe, in pictures
Make a bunch of diagrams of things I believe relevant to alignmentstuff and how they interact, plus the implications of those things.
The real point of the post is to encourage people to try to make more explicit and extremely legible models so people can actually figure out where they disagree instead of running around in loops for several years.
Preparation for unknown adversaries is regularization
Generalizing the principle from policy regularization.
Adversaries need not be actual agents working against you.
“Sharp” models that aggressively exploit specific features have a fragile dependence on those features. Such models are themselves exploitable.
Uncertainty and chaos are strong regularizers. The amount of capability required to overcome even relatively small quantities of chaos can be extreme.
Applications in prediction.