If you do model-free RL with a reward that rewards risk-aversion and penalizes risk, inner optimization or other unintended solutions could definitely still lead to problems if they crop up—they wouldn’t have to inherit the risk aversion.
With model-based RL it seems pretty feasible to hard-code risk aversion in. You just have to use the world-model to predict probability distributions (maybe implicitly) and then can more directly be risk-averse when using those predictions. This probably wouldn’t be stable under self-reflection, though—when evaluating self-modification plans, or plans for building a successor agent, keeping risk-aversion around might appear to have some risks to it long-term.
Risk aversion wouldn’t help humanity much if we build unaligned AGI anyhow. The least risky plans from the AI’s perspective are still gonna be bad for humans.
But something like “moral risk aversion” could still be stable under reflection (because moral uncertainty isn’t epistemic uncertainty) and might end up being a useful expression of a way we want the AI to reason.
I agree model-free RL wouldn’t necessarily inherit the risk aversion, although I’d guess there’s still a decent chance it would, because that seems like the most natural and simple way to generalize the structure of the rewards.
Why would hardcoded model-based RL probably self-modify or build successors this way, though? To deter/prevent threats from being made in the first place or even followed through on? But, does this actually deter or prevent our threats when evaluating the plan ahead of time, with the original preferences? We’d still want to shut it and any successors down if we found out (whenever we do find out, or it starts trying to take over), and it should be averse to that increased risk ahead of time when evaluating the plan.
Risk aversion wouldn’t help humanity much if we build unaligned AGI anyhow. The least risky plans from the AI’s perspective are still gonna be bad for humans.
I think there are (at least) two ways to reduce this risk:
Temporal discounting. The AI wants to ensure its own longevity, but is really focused on the very near term, just making it through the next day or hour, or whatever, so increasing the risk of being caught and shut down now by doing something sneaky looks bad even if it increases the expected longevity significantly, because it’s discounting the future so much. It will be more incentivized to do whatever people appear to want it to do ~now (regardless of impacts on the future), or else risk being shut down sooner.
Difference-making risk aversion, i.e. being risk averse with respect to the difference with inaction (or some default safe action).[1] This makes inaction look relatively more attractive. (In this case, I think the agent can’t be represented by a single consistent utility function over time, so I wonder if self-modification or successor risks would be higher, to ensure consistency.)
Why would hardcoded model-based RL probably self-modify or build successors this way, though?
Because picking a successor is like picking a policy, and risk aversion over policies can give different results than risk aversion over actions.
Like, suppose you go to a casino with $100, and there are two buttons you can push—one button does nothing, and the other button you have a 60% chance to win a dollar and 40% chance to lose a dollar. If you’re risk averse you might choose to only ever press the first button (not gamble).
If there’s some action you could take to enact a policy of pressing the second button 100 times, that’s like a third button, which gives about $20 with standard deviation $5. Maybe you’d prefer that button to doing nothing even if you’re risk averse.
Because picking a successor is like picking a policy, and risk aversion over policies can give different results than risk aversion over actions.
I was already thinking the AI would be risk averse over whole policies and the aggregate value of their future, not locally/greedily/separately for individual actions and individual unaggregated rewards.
I’m confused about how to do that because I tend to think of self-modification as happening when the agent is limited and can’t foresee all the consequences of a policy, especially policies that involve making itself smarter. But I suspect that even if you figure out a non-confusing way to talk about risk aversion for limited agents that doesn’t look like actions on some level, you’ll get weird behavior under self-modification, like an update rule that privileges the probability distribution you had at the time you decided to self-modify.
If you do model-free RL with a reward that rewards risk-aversion and penalizes risk, inner optimization or other unintended solutions could definitely still lead to problems if they crop up—they wouldn’t have to inherit the risk aversion.
With model-based RL it seems pretty feasible to hard-code risk aversion in. You just have to use the world-model to predict probability distributions (maybe implicitly) and then can more directly be risk-averse when using those predictions. This probably wouldn’t be stable under self-reflection, though—when evaluating self-modification plans, or plans for building a successor agent, keeping risk-aversion around might appear to have some risks to it long-term.
Risk aversion wouldn’t help humanity much if we build unaligned AGI anyhow. The least risky plans from the AI’s perspective are still gonna be bad for humans.
But something like “moral risk aversion” could still be stable under reflection (because moral uncertainty isn’t epistemic uncertainty) and might end up being a useful expression of a way we want the AI to reason.
Thanks! This makes sense.
I agree model-free RL wouldn’t necessarily inherit the risk aversion, although I’d guess there’s still a decent chance it would, because that seems like the most natural and simple way to generalize the structure of the rewards.
Why would hardcoded model-based RL probably self-modify or build successors this way, though? To deter/prevent threats from being made in the first place or even followed through on? But, does this actually deter or prevent our threats when evaluating the plan ahead of time, with the original preferences? We’d still want to shut it and any successors down if we found out (whenever we do find out, or it starts trying to take over), and it should be averse to that increased risk ahead of time when evaluating the plan.
I think there are (at least) two ways to reduce this risk:
Temporal discounting. The AI wants to ensure its own longevity, but is really focused on the very near term, just making it through the next day or hour, or whatever, so increasing the risk of being caught and shut down now by doing something sneaky looks bad even if it increases the expected longevity significantly, because it’s discounting the future so much. It will be more incentivized to do whatever people appear to want it to do ~now (regardless of impacts on the future), or else risk being shut down sooner.
Difference-making risk aversion, i.e. being risk averse with respect to the difference with inaction (or some default safe action).[1] This makes inaction look relatively more attractive. (In this case, I think the agent can’t be represented by a single consistent utility function over time, so I wonder if self-modification or successor risks would be higher, to ensure consistency.)
And you could fix this to be insensitive to butterfly effects, by comparing quantile functions as random variables instead.
Because picking a successor is like picking a policy, and risk aversion over policies can give different results than risk aversion over actions.
Like, suppose you go to a casino with $100, and there are two buttons you can push—one button does nothing, and the other button you have a 60% chance to win a dollar and 40% chance to lose a dollar. If you’re risk averse you might choose to only ever press the first button (not gamble).
If there’s some action you could take to enact a policy of pressing the second button 100 times, that’s like a third button, which gives about $20 with standard deviation $5. Maybe you’d prefer that button to doing nothing even if you’re risk averse.
I was already thinking the AI would be risk averse over whole policies and the aggregate value of their future, not locally/greedily/separately for individual actions and individual unaggregated rewards.
I’m confused about how to do that because I tend to think of self-modification as happening when the agent is limited and can’t foresee all the consequences of a policy, especially policies that involve making itself smarter. But I suspect that even if you figure out a non-confusing way to talk about risk aversion for limited agents that doesn’t look like actions on some level, you’ll get weird behavior under self-modification, like an update rule that privileges the probability distribution you had at the time you decided to self-modify.