Thanks for mentioning reflective stability, it’s exactly what I’ve been wondering about recently and I didn’t know the term.
However, using the formalism of utility functions, we are able to make decently convincing arguments that this self-improvement process will tend to preserve utility functions.
Can you point me to the canonical proofs/arguments for values being reflectively stable throughout self-improvement/reproduction towards higher intelligence? On the one hand, it seems implausible to me on the intuition that it’s incredibly difficult to predict the behaviour of a complex system more intelligent than you from static analysis. On the other hand, if it is true, then it would seem to hold just as much for humans themselves as the first link in the chain.
Because if it forseeably changed utility function from X to Y, then probably it would be calculated by the X-maximizing agent to harm, rather than help, its utility, and so the change would not be made.
Specifically, the assumption that this is foreseeable at all seems to deeply contradict the notion of intelligence itself.
Thanks for mentioning reflective stability, it’s exactly what I’ve been wondering about recently and I didn’t know the term.
Can you point me to the canonical proofs/arguments for values being reflectively stable throughout self-improvement/reproduction towards higher intelligence? On the one hand, it seems implausible to me on the intuition that it’s incredibly difficult to predict the behaviour of a complex system more intelligent than you from static analysis. On the other hand, if it is true, then it would seem to hold just as much for humans themselves as the first link in the chain.
Specifically, the assumption that this is foreseeable at all seems to deeply contradict the notion of intelligence itself.
Like I said there is no proof. Back in ancient times the arguments were made here:
http://selfawaresystems.com/2007/11/30/paper-on-the-basic-ai-drives/
and here Basic AI drives—LessWrong
For people trying to reason more rigorously and actually prove stuff, we mostly have problems and negative results:
Vingean Reflection: Reliable Reasoning for Self-Improving Agents — LessWrong
Vingean Reflection: Open Problems — LessWrong