This is a good question, but I think the answer is going to be a dynamical system with just a few degrees of freedom. Like a “world” which is just a perceptron turned on itself somehow.
That is the idea. I think we need to understand the dynamics of wire-heading better. Humans sometimes seem to fall prey to it, but not always. What would happen to AIs?
Maybe we even need to go a step further and let the model model this process too.
What are the smallest world and model trained on that world such that
the world contains the model,
the model has a non-trivial reward,
the representation of the model in the world is detailed enough that the model can observe its reward channel (e.g., weights),
the model outputs non-trivial actions that can affect the reward (e.g., modify weights).
What will happen? What will happen if there are multiple such instances of the model in the world?
This is a good question, but I think the answer is going to be a dynamical system with just a few degrees of freedom. Like a “world” which is just a perceptron turned on itself somehow.
That is the idea. I think we need to understand the dynamics of wire-heading better. Humans sometimes seem to fall prey to it, but not always. What would happen to AIs?
Maybe we even need to go a step further and let the model model this process too.