the name of the game is just to figure out how to prevent this much optimization pressure being applied without imposing too high a capabilities tax
Hmm. I wonder if you’d agree that the above relies on at least the following assumptions being true:
(i) It will actually be possible to (measure and) limit the amount of “optimization pressure” that an advanced A(G)I exerts (towards a given goal).
(ii) It will be possible to end the acute risk period using an A(G)I that is limited in the above way.
If so, how likely do you think (i) is to be true? If you have any ideas (even very rough/vague ones) for how to realize (i), I’d be curious to read them.
I think realizing (i) would probably be at least nearly as hard as the whole alignment problem. Possibly harder. (I don’t see how one would in actual practice even measure “optimization pressure”.)
(i) It will actually be possible to (measure and) limit the amount of “optimization pressure” that an advanced A(G)I exerts (towards a given goal).If so, how likely do you think (i) is to be true?
If you have any ideas (even very rough/vague ones) for how to realize (i), I’d be curious to read them.
For this, it is not clear to me that it is impossible or even extremely difficult to do this, at least in a heuristic way. I think that managing to successfully limit the optimization power applied against our defences is fundamental to coming up with alignment techniques that can work in practice. We need some way to bound the adversary otherwise we are essentially doomed by construction.
There is a whole bunch of ideas you can try here which work mostly independently and in parallel—examples of this are:
1.) Quantilization
2.) Impact regularization
3.) General regularisation against energy use, thinking time, compute cost
4.) Myopic objectives and reward functions. High discount rates
5.) limiting serial compute of the model
6.) Action randomisation / increasing entropy—something like dropout over actions.
7.) Satisficing utility/reward functions
8.) Distribution matching objectives instead of argmaxing
9.) penalisation of divergence from a ‘prior’ of human behaviour
10.) Maintaining value uncertainty estimates and acting conservatively within the outcome distribution
These are just examples I have thought of immediately. There are a whole load more if you sit down and brainstorm for a while.
In terms of measuring optimziation power I don’t think this is that hard to do roughly. We can definitely define it in terms of outcomes as KL divergence of achieved distribution vs some kind of prior ‘uncontrolled’ distribution. We already implement KL penalties in RL like this. Additionally, rough proxies are serial compute, energy expenditure, compute expenditure, divergence from previous behaviour etc.
It will be possible to end the acute risk period using an A(G)I that is limited in the above way.
The major issue is what level of alignment tax these solutions impose and whether it is competitive with other players. This ultimately depends on the amount of slack that is available in the immediately post-AGI world. My feeling is that it is possible there is quite a lot of slack here, at least at first, and that most of the behaviours we really want to penalise for alignment purposes are quite far from most likely behaviour—i.e. there is very little benefit to us of having the AGI having such a low discount rate it is planning about tiling the universe with paperclips in billions of years.
I also don’t think of these so much as solutions but as part of the solution—i.e. we still need to find good robust ways of encoding human values as goals, detect and prevent inner misalignment, and have some approach to manage goodhearting.
Hmm. I wonder if you’d agree that the above relies on at least the following assumptions being true:
(i) It will actually be possible to (measure and) limit the amount of “optimization pressure” that an advanced A(G)I exerts (towards a given goal).
(ii) It will be possible to end the acute risk period using an A(G)I that is limited in the above way.
If so, how likely do you think (i) is to be true? If you have any ideas (even very rough/vague ones) for how to realize (i), I’d be curious to read them.
I think realizing (i) would probably be at least nearly as hard as the whole alignment problem. Possibly harder. (I don’t see how one would in actual practice even measure “optimization pressure”.)
For this, it is not clear to me that it is impossible or even extremely difficult to do this, at least in a heuristic way. I think that managing to successfully limit the optimization power applied against our defences is fundamental to coming up with alignment techniques that can work in practice. We need some way to bound the adversary otherwise we are essentially doomed by construction.
There is a whole bunch of ideas you can try here which work mostly independently and in parallel—examples of this are:
1.) Quantilization
2.) Impact regularization
3.) General regularisation against energy use, thinking time, compute cost
4.) Myopic objectives and reward functions. High discount rates
5.) limiting serial compute of the model
6.) Action randomisation / increasing entropy—something like dropout over actions.
7.) Satisficing utility/reward functions
8.) Distribution matching objectives instead of argmaxing
9.) penalisation of divergence from a ‘prior’ of human behaviour
10.) Maintaining value uncertainty estimates and acting conservatively within the outcome distribution
These are just examples I have thought of immediately. There are a whole load more if you sit down and brainstorm for a while.
In terms of measuring optimziation power I don’t think this is that hard to do roughly. We can definitely define it in terms of outcomes as KL divergence of achieved distribution vs some kind of prior ‘uncontrolled’ distribution. We already implement KL penalties in RL like this. Additionally, rough proxies are serial compute, energy expenditure, compute expenditure, divergence from previous behaviour etc.
The major issue is what level of alignment tax these solutions impose and whether it is competitive with other players. This ultimately depends on the amount of slack that is available in the immediately post-AGI world. My feeling is that it is possible there is quite a lot of slack here, at least at first, and that most of the behaviours we really want to penalise for alignment purposes are quite far from most likely behaviour—i.e. there is very little benefit to us of having the AGI having such a low discount rate it is planning about tiling the universe with paperclips in billions of years.
I also don’t think of these so much as solutions but as part of the solution—i.e. we still need to find good robust ways of encoding human values as goals, detect and prevent inner misalignment, and have some approach to manage goodhearting.