Charlie Steiner comments on A sufficiently paranoid non-Friendly AGI might self-modify itself to become Friendly

Charlie Steiner 23 Sep 2021 0:50 UTC
3 points
You correctly say the words that self-preservation would be an instrumental goal, but when you talk about the agent it seems like it’s willing to give up on what are supposed to be its terminal goals in order to avoid shutdown. How is self-preservation merely instrumental, then?
I recently saw the notion of “reverse alignment” that might provide some wiggle room here (I’ll try and remember to edit in an attribution if I see this person go public). Basically if the agent ranks a universe where an FAI is in control as 75% as good as a universe where it’s in control (relative to what it thinks will happen if it gets shut down), then it will self-modify to an FAI if it thinks that an FAI is less than 75% as likely to get shut down. Of course the problem is that there might be some third UFAI design that ranks higher according to the original agent’s preferences and also has a low chance of being shut down. So if you have an AI that’s already has very small reverse-alignment divergence, plus a screening mechanism that’s both informative and loophole-free, then the AI is incentivized to self-modify to FAI.