People care about lots of things, from family to sex to aesthetics. My values don’t collapse down to any one of these.
I think AIs will learn lots of values by default. I don’t think we need all of these values to be aligned with human values. I think this is quite important.
I think the more of the AI’s values we align to care about us and make decisions in the way we want, the better. (This is vague because I haven’t yet sketched out AI internal motivations which I think would actually produce good outcomes. On my list!)
I think there are strong gains from trade possible among an agent’s values. If I care about bananas and apples, I don’t need to split my resources between the two values, I don’t need to make one successor agent for each value. I can drive to the store and buy both bananas and apples, and only pay for fuel once.
This makes it lower-cost for internal values handshakes to compromise; it’s less than 50% costly for a power-seeking value to give human-compatible values 50% weight in the reflective utility function.
I think there are thresholds at which the AI doesn’t care about us sufficiently strongly, and we get no value.
EG I might have an “avoid spiders” value which is narrowly contextually activated when I see spiders. But then I think this is silly because spiders are quite interesting, and so I decide to go to exposure therapy and remove this decision-influence. We don’t want human values to be outmaneuvered in this way.
More broadly, I think “value strength” is a loose abstraction which isn’t uni-dimensional. It’s not “The value is strong” or “The value is weak”; I think values are contextually activated, and so they don’t just have a global strength.
Even if you have to get the human-aligned values “perfectly right” in order to avoid Goodharting (which I am unsure of ETA I don’t believe this), not having to get all of the AI’s values perfectly right is good news.
I think these considerations make total alignment failures easier to prevent, because as long as human-compatible values are something the AI meaningfully cares about, we survive.
I think these considerations make total alignment success more difficult, because I expect agents to eg terminalize common instrumental values. Therefore, it’s very hard to end up with e.g. a single dominant shard of value which only cares about maximizing diamonds. I think that value is complex by default.
“Is the agent aligned?” seems to elide many of these considerations, and so I get more nervous / suspicious of such frames and lines of reasoning.
The best counterevidence for this I’m currently aware of comes from the “inescapable wedding parties” incident, where possibly a “talk about weddings” value was very widely instilled in a model.
I anticipate there will be a hill-of-common-computations, where the x-axis is the frequency[1] of the instrumental subgoal, and the y-axis is the extent to which the instrumental goal has been terminalized.
This is because for goals which are very high in frequency, there will be little incentive for the computations responsible for achieving that goal to have self-preserving structures. It will not make sense for them to devote optimization power towards ensuring future states still require them, because future states are basically guaranteed to require them.[2]
An example of this for humans may be the act of balancing while standing up. If someone offered to export this kind of cognition to a machine which did it just as good as I, I wouldn’t particularly mind. If someone also wanted to change physics in such a way that the only effect is that magic invisible fairies made sure everyone stayed balancing while trying to stand up, I don’t think I’d mind that either[3].
This argument also assumes the overseer isn’t otherwise selecting for self-preserving cognition, or that self-preserving cognition is the best way of achieving the relevant goal.
I don’t know if I follow, I think computations terminalize themselves because it makes sense to cache them (e.g. don’t always model out whether dying is a good idea, just cache that it’s bad at the policy-level).
& Isn’t “balance while standing up” terminalized? Doesn’t it feel wrong to fall over, even if you’re on a big cushy surface? Feels like a cached computation to me. (Maybe that’s “don’t fall over and hurt yourself” getting cached?)
Partial alignment successes seem possible.
People care about lots of things, from family to sex to aesthetics. My values don’t collapse down to any one of these.
I think AIs will learn lots of values by default. I don’t think we need all of these values to be aligned with human values. I think this is quite important.
I think the more of the AI’s values we align to care about us and make decisions in the way we want, the better. (This is vague because I haven’t yet sketched out AI internal motivations which I think would actually produce good outcomes. On my list!)
I think there are strong gains from trade possible among an agent’s values. If I care about bananas and apples, I don’t need to split my resources between the two values, I don’t need to make one successor agent for each value. I can drive to the store and buy both bananas and apples, and only pay for fuel once.
This makes it lower-cost for internal values handshakes to compromise; it’s less than 50% costly for a power-seeking value to give human-compatible values 50% weight in the reflective utility function.
I think there are thresholds at which the AI doesn’t care about us sufficiently strongly, and we get no value.
EG I might have an “avoid spiders” value which is narrowly contextually activated when I see spiders. But then I think this is silly because spiders are quite interesting, and so I decide to go to exposure therapy and remove this decision-influence. We don’t want human values to be outmaneuvered in this way.
More broadly, I think “value strength” is a loose abstraction which isn’t uni-dimensional. It’s not “The value is strong” or “The value is weak”; I think values are contextually activated, and so they don’t just have a global strength.
Even if you have to get the human-aligned values “perfectly right” in order to avoid Goodharting (
which I am unsure ofETA I don’t believe this), not having to get all of the AI’s values perfectly right is good news.I think these considerations make total alignment failures easier to prevent, because as long as human-compatible values are something the AI meaningfully cares about, we survive.
I think these considerations make total alignment success more difficult, because I expect agents to eg terminalize common instrumental values. Therefore, it’s very hard to end up with e.g. a single dominant shard of value which only cares about maximizing diamonds. I think that value is complex by default.
“Is the agent aligned?” seems to elide many of these considerations, and so I get more nervous / suspicious of such frames and lines of reasoning.
The best counterevidence for this I’m currently aware of comes from the “inescapable wedding parties” incident, where possibly a “talk about weddings” value was very widely instilled in a model.
Re: agents terminalizing instrumental values.
I anticipate there will be a hill-of-common-computations, where the x-axis is the frequency[1] of the instrumental subgoal, and the y-axis is the extent to which the instrumental goal has been terminalized.
This is because for goals which are very high in frequency, there will be little incentive for the computations responsible for achieving that goal to have self-preserving structures. It will not make sense for them to devote optimization power towards ensuring future states still require them, because future states are basically guaranteed to require them.[2]
An example of this for humans may be the act of balancing while standing up. If someone offered to export this kind of cognition to a machine which did it just as good as I, I wouldn’t particularly mind. If someone also wanted to change physics in such a way that the only effect is that magic invisible fairies made sure everyone stayed balancing while trying to stand up, I don’t think I’d mind that either[3].
I’m assuming this is frequency of the goal assuming the agent isn’t optimizing to get into a state that requires that goal.
This argument also assumes the overseer isn’t otherwise selecting for self-preserving cognition, or that self-preserving cognition is the best way of achieving the relevant goal.
Except for the part where there’s magic invisible fairies in the world now. That would be cool!
I don’t know if I follow, I think computations terminalize themselves because it makes sense to cache them (e.g. don’t always model out whether dying is a good idea, just cache that it’s bad at the policy-level).
& Isn’t “balance while standing up” terminalized? Doesn’t it feel wrong to fall over, even if you’re on a big cushy surface? Feels like a cached computation to me. (Maybe that’s “don’t fall over and hurt yourself” getting cached?)