paulfchristiano comments on Implicit extortion

paulfchristiano 14 Apr 2018 1:17 UTC
5 points
are you assuming there are other kinds of agents in the environment
Yes, e.g. humans, AIs trained to imitate humans, AIs trained by amplification, RL agents with reward functions that encourage implicit extortion (e.g. approval-directed agents whose overseer endorse implicit extortion).
If RL can’t handle long-term consequences, and these are the only kinds of agents we can build, that seems to favor short-term values over long-term values. (I guess this is a more general observation not directly related to implicit extortion per se.)
I agree this can shift our values (and indeed that justified my work on alternatives to RL), but doesn’t seem related to implicit extortion in particular.
If alignment can only be done through RL-like agents that can’t handle long-term consequences, but there are other ways to build unaligned AIs which can handle long-term consequences, that would shift the relative balance of influence to those kinds of unaligned AIs.
I agree with this. I’m happy to say that implicit extortion affects long-term values by changing which skills are important, or by changing which types of AI are most economically important.
This effect seems less important to me than the direct negative impact of introducing new surface area for conflict, which probably decreases our ability to solve problems like AI alignment. My best guess is that this effect is positive since RL seems relatively hard to align.
If you had instead framed the conclusion as something like “It’s not currently clear whether implicit extortion will shift the relative balance of influence towards unaligned AIs, so we should prioritize other problems that we know definitely are problems.” I think I’d find that less objectionable.
“Doesn’t seem to” feels like a fair expression of my current epistemic state. I can adjust “should wash out” to “doesn’t seem to have a big effect.”