This might be sort of missing the point, but here is an ideal and maybe not very useful not-yet-theory of rationality improvements I just came up with.
There are a few black boxes in the theory. The first takes you and returns your true utility function, whatever that is. Maybe it’s just the utility function you endorse, and that’s up to you. The other black box is the space of programs that you could be. Maybe it’s limited by memory, maybe it’s limited by run time, or maybe it’s any finite state machine with less than 10^20 states, maybe it’s python programs less than 5000 characters long, some limited set of programs that takes your sensory data and motor output history as input, and returns a motor output. The limitations could be whatever, don’t have to be like this.
Then you take one of these ideal rational agents with your true utility function and the right prior, and you give them the decision problem of designing your policy, but they can only use policies that are in the limited space of bounded programs you could be. Their expected utility assignments over that space of programs is then our measure of the rationality of a bounded agent. You could also give the ideal agent access to your data and see how that changes their ranking, if it does. If you can change yourself such that the program you become is assigned higher expected utility by the agent, then that is an improvement.
The main thing I want to point out that this is an idealized notion of non-idealized decision theory—in other words, it’s still pretty useless to me as a bounded agent, without some advice about how to approximate it. I can’t very well turn into this max-expected-value bounded policy.
But there are other barriers, too. Figuring out what utility function I endorse is a hard problem. And we face challenges of embedded decision theory; how do we reason about the counterfactuals of changing our policy to the better one?
Modulo those concerns, I do think your description is roughly right, and carries some important information about what it means to self-modify in a justified way rather than cargo-culting.
This might be sort of missing the point, but here is an ideal and maybe not very useful not-yet-theory of rationality improvements I just came up with.
There are a few black boxes in the theory. The first takes you and returns your true utility function, whatever that is. Maybe it’s just the utility function you endorse, and that’s up to you. The other black box is the space of programs that you could be. Maybe it’s limited by memory, maybe it’s limited by run time, or maybe it’s any finite state machine with less than 10^20 states, maybe it’s python programs less than 5000 characters long, some limited set of programs that takes your sensory data and motor output history as input, and returns a motor output. The limitations could be whatever, don’t have to be like this.
Then you take one of these ideal rational agents with your true utility function and the right prior, and you give them the decision problem of designing your policy, but they can only use policies that are in the limited space of bounded programs you could be. Their expected utility assignments over that space of programs is then our measure of the rationality of a bounded agent. You could also give the ideal agent access to your data and see how that changes their ranking, if it does. If you can change yourself such that the program you become is assigned higher expected utility by the agent, then that is an improvement.
The main thing I want to point out that this is an idealized notion of non-idealized decision theory—in other words, it’s still pretty useless to me as a bounded agent, without some advice about how to approximate it. I can’t very well turn into this max-expected-value bounded policy.
But there are other barriers, too. Figuring out what utility function I endorse is a hard problem. And we face challenges of embedded decision theory; how do we reason about the counterfactuals of changing our policy to the better one?
Modulo those concerns, I do think your description is roughly right, and carries some important information about what it means to self-modify in a justified way rather than cargo-culting.