Part on referring to a model to adjudicate itself seems really off. I have a hard time imagining a thing that has better performance at meta-level than on object-level. Do you have some concrete example?
Let me rephrase it: FAI has a part of its utility function that decides how to “aggregate” our values, how to resolve disagreements and contradictions in our values, and how to extrapolate our values.
Is FAI allowed to change that part? Because if not, it is stuck with our initial guess on how to do that, forever. That seems like it could be really bad.
Actual example:
-What if groups of humans self-modify to care a lot about some particular issue, in an attempt to influence FAI?
More far-fetched examples:
-What if a rapidly-spreading mind virus drastically changes the values of most humans?
-What if aliens create trillions of humans that all recognize the alien overlords as their masters?
Just to be clear of the point of the examples, these are examples where a “naive” aggregation function might allow itself to be influenced, while a “recursive” function would follow the meta-reasoning that we wouldn’t want FAI’s values and behavior to be influenced by adversarial modification of human values, only by genuine changes in such (whatever “genuine” means to us. I’m sure that’s a very complex question. Which is kind of the point of needing to use recursive reasoning. Human values are very complex. Why would human meta-values be any less complex?)
Let me rephrase it: FAI has a part of its utility function that decides how to “aggregate” our values, how to resolve disagreements and contradictions in our values, and how to extrapolate our values.
Is FAI allowed to change that part? Because if not, it is stuck with our initial guess on how to do that, forever. That seems like it could be really bad.
Actual example:
-What if groups of humans self-modify to care a lot about some particular issue, in an attempt to influence FAI?
More far-fetched examples:
-What if a rapidly-spreading mind virus drastically changes the values of most humans?
-What if aliens create trillions of humans that all recognize the alien overlords as their masters?
Just to be clear of the point of the examples, these are examples where a “naive” aggregation function might allow itself to be influenced, while a “recursive” function would follow the meta-reasoning that we wouldn’t want FAI’s values and behavior to be influenced by adversarial modification of human values, only by genuine changes in such (whatever “genuine” means to us. I’m sure that’s a very complex question. Which is kind of the point of needing to use recursive reasoning. Human values are very complex. Why would human meta-values be any less complex?)