Whether we can build artificial empathy into AI systems also has clear relevance to AI alignment.
I disagree. My tentative guess would be that in the majority of worlds where humanity survives and flourishes, {AGI having empathy} contributed ~nothing to achieving that success. (For most likely interpretations of “empathy”.)
If we can create empathic AIs, then it may become easier to make an AI be receptive to human values, even if humans can no longer completely control it.
I suspect that {the cognitive process that produced the above sentence} is completely devoid of security mindset. If so, might be worth trying to develop security mindset? And/or recognize that one is liable to (i.a.) be wildly over-optimistic about various alignment approaches. (I notice that that sounded unkind; sorry, not meaning to be unkind.)
You pointed out that empathy is not a silver bullet. I have a vague (but poignant) intuition that says that the problem is a lot worse than that: Not only is empathy not a silver bullet, it’s a really really imprecise heuristic/proxy/shard for {what we actually care about}, and is practically guaranteed to break down when subjected to strong optimization pressure.
Also, doing a quick bit of Rationalist Taboo on “empathy”, it looks to me like that word is pointing at a rather complicated, messy swath of territory. I think that swath contains many subtly and not-so-subtly different things, most of which would not begin to be sufficient for alignment (albeit that some might be necessary).
I suspect that {the cognitive process that produced the above sentence} is completely devoid of security mindset. If so, might be worth trying to develop security mindset? And/or recognize that one is liable to (i.a.) be wildly over-optimistic about various alignment approaches. (I notice that that sounded unkind; sorry, not meaning to be unkind.)
Yep this is definitely not proposed as some kind of secure solution to alignment (if only the world were so nice!). The primary point is that if this mechanism exists it might provide some kind of base signal which we can then further optimize to get the agent to assign some kind of utility to others. The majority of the work will of course be getting that to actually work in a robust way.
You pointed out that empathy is not a silver bullet. I have a vague (but poignant) intuition that says that the problem is a lot worse than that: Not only is empathy not a silver bullet, it’s a really really imprecise heuristic/proxy/shard for {what we actually care about}, and is practically guaranteed to break down when subjected to strong optimization pressure.
Yes. Realistically, I think almost any proxy like this will break down under strong enough optimization pressure, and the name of the game is just to figure out how to prevent this much optimization pressure being applied without imposing too high a capabilities tax.
the name of the game is just to figure out how to prevent this much optimization pressure being applied without imposing too high a capabilities tax
Hmm. I wonder if you’d agree that the above relies on at least the following assumptions being true:
(i) It will actually be possible to (measure and) limit the amount of “optimization pressure” that an advanced A(G)I exerts (towards a given goal).
(ii) It will be possible to end the acute risk period using an A(G)I that is limited in the above way.
If so, how likely do you think (i) is to be true? If you have any ideas (even very rough/vague ones) for how to realize (i), I’d be curious to read them.
I think realizing (i) would probably be at least nearly as hard as the whole alignment problem. Possibly harder. (I don’t see how one would in actual practice even measure “optimization pressure”.)
(i) It will actually be possible to (measure and) limit the amount of “optimization pressure” that an advanced A(G)I exerts (towards a given goal).If so, how likely do you think (i) is to be true?
If you have any ideas (even very rough/vague ones) for how to realize (i), I’d be curious to read them.
For this, it is not clear to me that it is impossible or even extremely difficult to do this, at least in a heuristic way. I think that managing to successfully limit the optimization power applied against our defences is fundamental to coming up with alignment techniques that can work in practice. We need some way to bound the adversary otherwise we are essentially doomed by construction.
There is a whole bunch of ideas you can try here which work mostly independently and in parallel—examples of this are:
1.) Quantilization
2.) Impact regularization
3.) General regularisation against energy use, thinking time, compute cost
4.) Myopic objectives and reward functions. High discount rates
5.) limiting serial compute of the model
6.) Action randomisation / increasing entropy—something like dropout over actions.
7.) Satisficing utility/reward functions
8.) Distribution matching objectives instead of argmaxing
9.) penalisation of divergence from a ‘prior’ of human behaviour
10.) Maintaining value uncertainty estimates and acting conservatively within the outcome distribution
These are just examples I have thought of immediately. There are a whole load more if you sit down and brainstorm for a while.
In terms of measuring optimziation power I don’t think this is that hard to do roughly. We can definitely define it in terms of outcomes as KL divergence of achieved distribution vs some kind of prior ‘uncontrolled’ distribution. We already implement KL penalties in RL like this. Additionally, rough proxies are serial compute, energy expenditure, compute expenditure, divergence from previous behaviour etc.
It will be possible to end the acute risk period using an A(G)I that is limited in the above way.
The major issue is what level of alignment tax these solutions impose and whether it is competitive with other players. This ultimately depends on the amount of slack that is available in the immediately post-AGI world. My feeling is that it is possible there is quite a lot of slack here, at least at first, and that most of the behaviours we really want to penalise for alignment purposes are quite far from most likely behaviour—i.e. there is very little benefit to us of having the AGI having such a low discount rate it is planning about tiling the universe with paperclips in billions of years.
I also don’t think of these so much as solutions but as part of the solution—i.e. we still need to find good robust ways of encoding human values as goals, detect and prevent inner misalignment, and have some approach to manage goodhearting.
I disagree. My tentative guess would be that in the majority of worlds where humanity survives and flourishes, {AGI having empathy} contributed ~nothing to achieving that success. (For most likely interpretations of “empathy”.)
I suspect that {the cognitive process that produced the above sentence} is completely devoid of security mindset. If so, might be worth trying to develop security mindset? And/or recognize that one is liable to (i.a.) be wildly over-optimistic about various alignment approaches. (I notice that that sounded unkind; sorry, not meaning to be unkind.)
You pointed out that empathy is not a silver bullet. I have a vague (but poignant) intuition that says that the problem is a lot worse than that: Not only is empathy not a silver bullet, it’s a really really imprecise heuristic/proxy/shard for {what we actually care about}, and is practically guaranteed to break down when subjected to strong optimization pressure.
Also, doing a quick bit of Rationalist Taboo on “empathy”, it looks to me like that word is pointing at a rather complicated, messy swath of territory. I think that swath contains many subtly and not-so-subtly different things, most of which would not begin to be sufficient for alignment (albeit that some might be necessary).
Yep this is definitely not proposed as some kind of secure solution to alignment (if only the world were so nice!). The primary point is that if this mechanism exists it might provide some kind of base signal which we can then further optimize to get the agent to assign some kind of utility to others. The majority of the work will of course be getting that to actually work in a robust way.
Yes. Realistically, I think almost any proxy like this will break down under strong enough optimization pressure, and the name of the game is just to figure out how to prevent this much optimization pressure being applied without imposing too high a capabilities tax.
Hmm. I wonder if you’d agree that the above relies on at least the following assumptions being true:
(i) It will actually be possible to (measure and) limit the amount of “optimization pressure” that an advanced A(G)I exerts (towards a given goal).
(ii) It will be possible to end the acute risk period using an A(G)I that is limited in the above way.
If so, how likely do you think (i) is to be true? If you have any ideas (even very rough/vague ones) for how to realize (i), I’d be curious to read them.
I think realizing (i) would probably be at least nearly as hard as the whole alignment problem. Possibly harder. (I don’t see how one would in actual practice even measure “optimization pressure”.)
For this, it is not clear to me that it is impossible or even extremely difficult to do this, at least in a heuristic way. I think that managing to successfully limit the optimization power applied against our defences is fundamental to coming up with alignment techniques that can work in practice. We need some way to bound the adversary otherwise we are essentially doomed by construction.
There is a whole bunch of ideas you can try here which work mostly independently and in parallel—examples of this are:
1.) Quantilization
2.) Impact regularization
3.) General regularisation against energy use, thinking time, compute cost
4.) Myopic objectives and reward functions. High discount rates
5.) limiting serial compute of the model
6.) Action randomisation / increasing entropy—something like dropout over actions.
7.) Satisficing utility/reward functions
8.) Distribution matching objectives instead of argmaxing
9.) penalisation of divergence from a ‘prior’ of human behaviour
10.) Maintaining value uncertainty estimates and acting conservatively within the outcome distribution
These are just examples I have thought of immediately. There are a whole load more if you sit down and brainstorm for a while.
In terms of measuring optimziation power I don’t think this is that hard to do roughly. We can definitely define it in terms of outcomes as KL divergence of achieved distribution vs some kind of prior ‘uncontrolled’ distribution. We already implement KL penalties in RL like this. Additionally, rough proxies are serial compute, energy expenditure, compute expenditure, divergence from previous behaviour etc.
The major issue is what level of alignment tax these solutions impose and whether it is competitive with other players. This ultimately depends on the amount of slack that is available in the immediately post-AGI world. My feeling is that it is possible there is quite a lot of slack here, at least at first, and that most of the behaviours we really want to penalise for alignment purposes are quite far from most likely behaviour—i.e. there is very little benefit to us of having the AGI having such a low discount rate it is planning about tiling the universe with paperclips in billions of years.
I also don’t think of these so much as solutions but as part of the solution—i.e. we still need to find good robust ways of encoding human values as goals, detect and prevent inner misalignment, and have some approach to manage goodhearting.