I imagine a reward function would be implemented such that it takes the current state of the universe as input. Obviously it wouldn’t be the state of the entire universe, so it would be limited to some finite observable portion of it, whatever the sensory capabilities of the agent are. So if the “capacity” of the agent was increased, that could mean different things—it could mean that the agent now has a greater share of the universe it can observe at a given point of time, or, it could mean that the agent has more computational power to calculate some function of the current input. Depending on how that increased capacity manifests itself, it could drastically alter the final output of the reward function. So a properly “aligned” reward function would essentially have to remain aligned under all possible changes to the implementation details. And the difficulty with that is: Can an agent reliably “predict” the outcome of its own computations under capacity increases?
I imagine a reward function would be implemented such that it takes the current state of the universe as input. Obviously it wouldn’t be the state of the entire universe, so it would be limited to some finite observable portion of it, whatever the sensory capabilities of the agent are. So if the “capacity” of the agent was increased, that could mean different things—it could mean that the agent now has a greater share of the universe it can observe at a given point of time, or, it could mean that the agent has more computational power to calculate some function of the current input. Depending on how that increased capacity manifests itself, it could drastically alter the final output of the reward function. So a properly “aligned” reward function would essentially have to remain aligned under all possible changes to the implementation details. And the difficulty with that is: Can an agent reliably “predict” the outcome of its own computations under capacity increases?