zhukeepa comments on Corrigible but misaligned: a superintelligent messiah

zhukeepa 5 Apr 2018 6:22 UTC
1 point
Oops, I do think that’s what I meant. To explain my wording: when I imagined a “system optimizing for X”, I didn’t imagine that system trying its hardest to do X, I imagined “a system for which the variable Z it can best be described as optimizing is X”.
To say it all concretely another way, I mean that there are a bunch of different systems $S_{1}, S_{2}, . . ., S_{n}$ that, when “trying to optimize for X as hard as possible” all look to us like they optimize for X successfully, but do so via methods $M_{1}, M_{2}, . . ., M_{n}$ that lead to vastly different (and generally undesirable) endstates $Y_{1}, Y_{2}, . ., Y_{n}$ like the one described in this post, or one where the operators become ISIS suicide bombers. In this light, it seem more accurate to describe $S_{i}$ as optimizing for $Y_{i}$ instead of X, even though $S_{i}$ is trying to optimize for X and optimizes it pretty successfully. But I really don’t want a superintelligent system optimizing for some Y that is not my values.
As a possibly related general intuition, I think the space of outcomes that can result from having a human follow a sequence of suggestions, each of which they’d enthusiastically endorse, is massive, and that most of these outcomes are undesirable. (It’s possible that one crisp articulation of “sufficient metaphilosophical competence” is that following a sequence of suggestions, each of which you’d enthusiastically endorse, is actually good for you.)
On reflection, I agree that neither future approval nor idealized preferences are particularly likely, and that whatever Y is would actually look very alien.