In “What Makes Corrigibility Special”, where you use the metaphor of goals as two-dimensional energy landscape, it is not clear what type of goals are being considered.
Are these utility functions over world-states? If so, corrigibility cannot AFAIK be easily expressed as one, and so doesn’t really fit into the picture.
If not, it’s not clear to me why most of this space is flat: agents are embedded and many things we do in service of goals will change us in ways that don’t conflict with our existing goals, including developing. E.g. if I have the goal of graduating college I will meet people along the way and perhaps gain the goal of being president of the math club, a liberal political bent, etc.
The idea behind the goal space visualization is to have all goals, not necessarily those restricted to world states. (Corrigibility, I think, involves optimizing over histories, not physical states of the world at some time, for example.) I mention in a footnote that we might want to restrict to “unconfused” goals.
The goal space is flat because preserving one’s (terminal) goals (including avoiding adding new ones) is an Omohundro Drive and I’m assuming a certain level of competence/power in these agents. If you gain terminal goals like being president of the math club by going to college, doing so is likely hurting your long-run ability to get what you want. (Note: I am not talking about instrumental goals.)
The idea behind the goal space visualization is to have all goals, not necessarily those restricted to world states. (Corrigibility, I think, involves optimizing over histories, not physical states of the world at some time, for example.) I mention in a footnote that we might want to restrict to “unconfused” goals.
The goal space is flat because preserving one’s (terminal) goals (including avoiding adding new ones) is an Omohundro Drive and I’m assuming a certain level of competence/power in these agents. If you gain terminal goals like being president of the math club by going to college, doing so is likely hurting your long-run ability to get what you want. (Note: I am not talking about instrumental goals.)