Actually, I think I disagree. Why do you think this?
Maybe it’s something like too many natural abstractions. When the number of natural abstractions is small, you can just point in the right general direction, and then regularize your way to teaching the exact natural abstraction that’s closest. When the number of abstractions is large, or you’re trying to point to something very complicated, if you just point in the right general direction, there will be a natural abstraction almost wherever you point, and regularization won’t move you towards something that seems privileged to humans.
Closeness of natural abstractions also makes it easier for gradient descent to change your goals—shards are now on a continuum, rather than moated off from each other. The typical picture of value change due to stimulus is something like heroin, which hijacks the reward center in a way that we typically picture as “creating new desires related to heroin.” But if shards can be moved around by gradient descent, then you can have a different-looking kind of value change, of which an example might be updating a political tenet because the culture around you changes—it’s still somewhat resisted by the prior shard, but it’s hard to avoid because each change is small and the gradient updates are a consequence of a deep part of the environment, and it doesn’t have to lead to internal disagreement, at each point in time one’s values are just slowly changing in place.
So information leakage that reflects unintended optima of the actual evaluation function is bad for alignment with vanilla RL. E.g. systematic classification errors, or not working for a few minutes when some software freezes, or systematic biases on what kind of diamonds you’re showing it, or accidentally showing it some cubic zirconium. This is going to update its values to something with more unintended optima, although not necessarily exactly the same unintended optima as were in the reward evaluation process.
Even given all of this, why should reward function “robustness” be the natural solution to this? Like, what if you get your robust reward function and you’re still screwed? It’s very nonobvious that this is how you fix things.
Even given that we need on-trajectory reward “robustness” (i.e. very carefully reward all in-fact-experienced situations relating to diamonds, until the AI becomes smart enough to steer its own training), this is extremely different from a forall-across-counterfactuals robust grading guarantee.
So even given both points, I would conclude “yup, shard theory reasoning shows I can dodge an enormous robust-grading sized bullet. No dealing with ‘nearest unblocked strategy’, here!” And that was the original point of dispute, AFAICT.
something with more unintended optima
What do you have in mind with “unintended optima”? This phrasing seems to suggest that alignment is reasonably formulated as a global optimization problem, which I think is probably not true in the currently understood sense. But maybe that’s not what you meant?
Even given all of this, why should reward function “robustness” be the natural solution to this? Like, what if you get your robust reward function and you’re still screwed? It’s very nonobvious that this is how you fix things.
Yeah, I sorta got sucked into playing pretend, here. I don’t actually have much hope for trying to pick out a concept we’d want just by pointing into a self-supervised world-model—I expect us to need to use human feedback and the AI’s self-reflectivity, which means that the AI has to want human feedback, and be able to reflect on itself, not just get pointed in the right direction in a single push. In the pretend-world where you start out able to pick out some good “human values”-esque concept from the very start, though, it definitely seems important to defend that concept from getting updated to something else.
What do you have in mind with “unintended optima”?
Sort of like in Goodhart Ethology. In situations where humans have a good grasp on what’s going on, we can pick out some fairly unambiguous properties of good vs. bad ways the world could go. If the AI is doing search over plans, guided by some values that care about the world, then what I mean by an “unintended optimum” of those values will lead to its search process outputting plans that make the world go badly according to these human-obvious standards. (And an unintended optimum of the reward function rewards trajectories that are obviously bad).
And an unintended optimum of the reward function rewards trajectories that are obviously bad
It seems not relevant if it’s an optimum or not. What’s relevant is the scalar reward values output on realized datapoints.
I emphasize this because “unintended optimum” phrasing seems to reliably trigger cached thoughts around “reward functions need to be robust graders.” (I also don’t like “optimum” of values, because I think that’s really not how values work in detail instead of in gloss, and “optimum” probably evokes similar thoughts around “values must be robust against adversaries.”)
Maybe it’s something like too many natural abstractions. When the number of natural abstractions is small, you can just point in the right general direction, and then regularize your way to teaching the exact natural abstraction that’s closest. When the number of abstractions is large, or you’re trying to point to something very complicated, if you just point in the right general direction, there will be a natural abstraction almost wherever you point, and regularization won’t move you towards something that seems privileged to humans.
Closeness of natural abstractions also makes it easier for gradient descent to change your goals—shards are now on a continuum, rather than moated off from each other. The typical picture of value change due to stimulus is something like heroin, which hijacks the reward center in a way that we typically picture as “creating new desires related to heroin.” But if shards can be moved around by gradient descent, then you can have a different-looking kind of value change, of which an example might be updating a political tenet because the culture around you changes—it’s still somewhat resisted by the prior shard, but it’s hard to avoid because each change is small and the gradient updates are a consequence of a deep part of the environment, and it doesn’t have to lead to internal disagreement, at each point in time one’s values are just slowly changing in place.
So information leakage that reflects unintended optima of the actual evaluation function is bad for alignment with vanilla RL. E.g. systematic classification errors, or not working for a few minutes when some software freezes, or systematic biases on what kind of diamonds you’re showing it, or accidentally showing it some cubic zirconium. This is going to update its values to something with more unintended optima, although not necessarily exactly the same unintended optima as were in the reward evaluation process.
Even given all of this, why should reward function “robustness” be the natural solution to this? Like, what if you get your robust reward function and you’re still screwed? It’s very nonobvious that this is how you fix things.
Even given that we need on-trajectory reward “robustness” (i.e. very carefully reward all in-fact-experienced situations relating to diamonds, until the AI becomes smart enough to steer its own training), this is extremely different from a forall-across-counterfactuals robust grading guarantee.
So even given both points, I would conclude “yup, shard theory reasoning shows I can dodge an enormous robust-grading sized bullet. No dealing with ‘nearest unblocked strategy’, here!” And that was the original point of dispute, AFAICT.
What do you have in mind with “unintended optima”? This phrasing seems to suggest that alignment is reasonably formulated as a global optimization problem, which I think is probably not true in the currently understood sense. But maybe that’s not what you meant?
Yeah, I sorta got sucked into playing pretend, here. I don’t actually have much hope for trying to pick out a concept we’d want just by pointing into a self-supervised world-model—I expect us to need to use human feedback and the AI’s self-reflectivity, which means that the AI has to want human feedback, and be able to reflect on itself, not just get pointed in the right direction in a single push. In the pretend-world where you start out able to pick out some good “human values”-esque concept from the very start, though, it definitely seems important to defend that concept from getting updated to something else.
Sort of like in Goodhart Ethology. In situations where humans have a good grasp on what’s going on, we can pick out some fairly unambiguous properties of good vs. bad ways the world could go. If the AI is doing search over plans, guided by some values that care about the world, then what I mean by an “unintended optimum” of those values will lead to its search process outputting plans that make the world go badly according to these human-obvious standards. (And an unintended optimum of the reward function rewards trajectories that are obviously bad).
It seems not relevant if it’s an optimum or not. What’s relevant is the scalar reward values output on realized datapoints.
I emphasize this because “unintended optimum” phrasing seems to reliably trigger cached thoughts around “reward functions need to be robust graders.” (I also don’t like “optimum” of values, because I think that’s really not how values work in detail instead of in gloss, and “optimum” probably evokes similar thoughts around “values must be robust against adversaries.”)