Unfortunately, it’s basically impossible for an RL algorithm to learn to avoid this, because the negative consequences only appear over a very long timescale. In fact, the timescale for the negative consequences is longer than the timescale over which the RL agent adjusts its policy— which is too long for a traditional RL system to possibly do the credit assignment.
Engaging in implicit extortion seems to require thinking about long-term consequences on a time scale similar to avoiding implicit extortion, and if RL can’t handle long-term consequences, are you assuming there are other kinds of agents in the environment?
In particular, every agent can engage in implicit extortion and so it doesn’t seem to shift the relative balance of influence amongst competing agents.
I can think of a couple of ways this might be false:
If RL can’t handle long-term consequences, and these are the only kinds of agents we can build, that seems to favor short-term values over long-term values. (I guess this is a more general observation not directly related to implicit extortion per se.)
If alignment can only be done through RL-like agents that can’t handle long-term consequences, but there are other ways to build unaligned AIs which can handle long-term consequences, that would shift the relative balance of influence to those kinds of unaligned AIs.
It seems to me that 1 and 2 are potentially serious problems that we should keep in mind, and it’s too early to conclude that these problems should “wash out in the long run.” (If you had instead framed the conclusion as something like “It’s not currently clear whether implicit extortion will shift the relative balance of influence towards unaligned AIs, so we should prioritize other problems that we know definitely are problems.” I think I’d find that less objectionable.)
are you assuming there are other kinds of agents in the environment
Yes, e.g. humans, AIs trained to imitate humans, AIs trained by amplification, RL agents with reward functions that encourage implicit extortion (e.g. approval-directed agents whose overseer endorse implicit extortion).
If RL can’t handle long-term consequences, and these are the only kinds of agents we can build, that seems to favor short-term values over long-term values. (I guess this is a more general observation not directly related to implicit extortion per se.)
I agree this can shift our values (and indeed that justified my work on alternatives to RL), but doesn’t seem related to implicit extortion in particular.
If alignment can only be done through RL-like agents that can’t handle long-term consequences, but there are other ways to build unaligned AIs which can handle long-term consequences, that would shift the relative balance of influence to those kinds of unaligned AIs.
I agree with this. I’m happy to say that implicit extortion affects long-term values by changing which skills are important, or by changing which types of AI are most economically important.
This effect seems less important to me than the direct negative impact of introducing new surface area for conflict, which probably decreases our ability to solve problems like AI alignment. My best guess is that this effect is positive since RL seems relatively hard to align.
If you had instead framed the conclusion as something like “It’s not currently clear whether implicit extortion will shift the relative balance of influence towards unaligned AIs, so we should prioritize other problems that we know definitely are problems.” I think I’d find that less objectionable.
“Doesn’t seem to” feels like a fair expression of my current epistemic state. I can adjust “should wash out” to “doesn’t seem to have a big effect.”
Engaging in implicit extortion seems to require thinking about long-term consequences on a time scale similar to avoiding implicit extortion, and if RL can’t handle long-term consequences, are you assuming there are other kinds of agents in the environment?
I can think of a couple of ways this might be false:
If RL can’t handle long-term consequences, and these are the only kinds of agents we can build, that seems to favor short-term values over long-term values. (I guess this is a more general observation not directly related to implicit extortion per se.)
If alignment can only be done through RL-like agents that can’t handle long-term consequences, but there are other ways to build unaligned AIs which can handle long-term consequences, that would shift the relative balance of influence to those kinds of unaligned AIs.
It seems to me that 1 and 2 are potentially serious problems that we should keep in mind, and it’s too early to conclude that these problems should “wash out in the long run.” (If you had instead framed the conclusion as something like “It’s not currently clear whether implicit extortion will shift the relative balance of influence towards unaligned AIs, so we should prioritize other problems that we know definitely are problems.” I think I’d find that less objectionable.)
Yes, e.g. humans, AIs trained to imitate humans, AIs trained by amplification, RL agents with reward functions that encourage implicit extortion (e.g. approval-directed agents whose overseer endorse implicit extortion).
I agree this can shift our values (and indeed that justified my work on alternatives to RL), but doesn’t seem related to implicit extortion in particular.
I agree with this. I’m happy to say that implicit extortion affects long-term values by changing which skills are important, or by changing which types of AI are most economically important.
This effect seems less important to me than the direct negative impact of introducing new surface area for conflict, which probably decreases our ability to solve problems like AI alignment. My best guess is that this effect is positive since RL seems relatively hard to align.
“Doesn’t seem to” feels like a fair expression of my current epistemic state. I can adjust “should wash out” to “doesn’t seem to have a big effect.”