Why would hardcoded model-based RL probably self-modify or build successors this way, though?
Because picking a successor is like picking a policy, and risk aversion over policies can give different results than risk aversion over actions.
Like, suppose you go to a casino with $100, and there are two buttons you can push—one button does nothing, and the other button you have a 60% chance to win a dollar and 40% chance to lose a dollar. If you’re risk averse you might choose to only ever press the first button (not gamble).
If there’s some action you could take to enact a policy of pressing the second button 100 times, that’s like a third button, which gives about $20 with standard deviation $5. Maybe you’d prefer that button to doing nothing even if you’re risk averse.
Because picking a successor is like picking a policy, and risk aversion over policies can give different results than risk aversion over actions.
I was already thinking the AI would be risk averse over whole policies and the aggregate value of their future, not locally/greedily/separately for individual actions and individual unaggregated rewards.
I’m confused about how to do that because I tend to think of self-modification as happening when the agent is limited and can’t foresee all the consequences of a policy, especially policies that involve making itself smarter. But I suspect that even if you figure out a non-confusing way to talk about risk aversion for limited agents that doesn’t look like actions on some level, you’ll get weird behavior under self-modification, like an update rule that privileges the probability distribution you had at the time you decided to self-modify.
Because picking a successor is like picking a policy, and risk aversion over policies can give different results than risk aversion over actions.
Like, suppose you go to a casino with $100, and there are two buttons you can push—one button does nothing, and the other button you have a 60% chance to win a dollar and 40% chance to lose a dollar. If you’re risk averse you might choose to only ever press the first button (not gamble).
If there’s some action you could take to enact a policy of pressing the second button 100 times, that’s like a third button, which gives about $20 with standard deviation $5. Maybe you’d prefer that button to doing nothing even if you’re risk averse.
I was already thinking the AI would be risk averse over whole policies and the aggregate value of their future, not locally/greedily/separately for individual actions and individual unaggregated rewards.
I’m confused about how to do that because I tend to think of self-modification as happening when the agent is limited and can’t foresee all the consequences of a policy, especially policies that involve making itself smarter. But I suspect that even if you figure out a non-confusing way to talk about risk aversion for limited agents that doesn’t look like actions on some level, you’ll get weird behavior under self-modification, like an update rule that privileges the probability distribution you had at the time you decided to self-modify.