It seems to me that there are natural ways to implement value loading as UDT agents, with the properties you’re looking for. For example, if the agent values eating cookies in universes where its creator wants it to eat cookies, and values not eating cookies in universes where its creator doesn’t want it to eat cookies (glossing over how to define “creator wants” for now), then I don’t see any problems with the agent manipulating its own moral changes or avoiding asking whether eating cookies is bad. So I’m not seeing the motivation for coming up with another decision theory framework here...
It seems to me that there are natural ways to implement value loading as UDT agents, with the properties you’re looking for. For example, if the agent values eating cookies in universes where its creator wants it to eat cookies, and values not eating cookies in universes where its creator doesn’t want it to eat cookies (glossing over how to define “creator wants” for now), then I don’t see any problems with the agent manipulating its own moral changes or avoiding asking whether eating cookies is bad. So I’m not seeing the motivation for coming up with another decision theory framework here...