Wei Dai comments on Goal retention discussion with Eliezer

Wei Dai 5 Sep 2014 3:36 UTC
41 points
Max, as you can see from Eliezer’s reply, MIRI people (and other FAI proponents) are largely already aware of the problems you brought up in your paper. (Personally I think they are still underestimating the difficulty of solving those problems. For example, Peter de Blanc and Eliezer both suggest that humans can already solve ontological crises, implying that the problem is merely one of understanding how we do so. However I think humans actually do not already have such an ability, at least not in a general form that would be suitable for implementing in a Friendly AI, so this is really a hard philosophical problem rather than just one of reverse engineering.)

Also, you may have misunderstood why Nick Bostrom talks about “goal retention” in his book. I think it’s not meant to be an argument in favor of building FAI (as you suggest in the paper), but rather an argument for AIs being dangerous in general, since they will resist attempts to change their goals by humans if we realize that we built AIs with the wrong final goals.
- Max Tegmark 5 Sep 2014 23:28 UTC
  15 points
  Parent
  Thanks Wei for these interesting comments. Whether humans can “solve” ontological crises clearly depends one’s definition of “solve”. Although there’s arguably a clear best solution for de Blanc’s corridor example, it’s far from clear that there is any behavior that deserves being called a “solution” if the ontological update causes the entire worldview of the rational agent to crumble, revealing the goal to have been fundamentally confused and undefined beyond repair. That’s what I was getting at with my souls example.
  
  As to what Nick’s views are, I plan to ask him about this when I see him tomorrow.
- torekp 5 Sep 2014 23:57 UTC
  2 points
  Parent
  In the link you suggest that ontological crises might lead to nihilism, but I think a much more likely prospect is that they lead to relativism, with respect to the original utility function. That is, there are solutions to the re-interpretation problem, which, for example, allow us to talk of “myself” and “others” despite the underlying particle physics. But there are more than one of those solutions, none of which are forced. Thus the original “utility function” fails to be such, strictly speaking. It does not really specify which actions are preferred. It only does so modulo a choice of interpretation.
  
  So, all we need to do is figure out each of the possible ways physics might develop, and map out utility functions in terms of that possible physics! Or, we could admit that talk of utility functions needs to be recognized as neither descriptive nor truly normative, but rather as a crude, mathematically simplified, approximation to human values. (Which may be congruent to your conclusions—I just arrive by a different route.)