Even given all of this, why should reward function “robustness” be the natural solution to this? Like, what if you get your robust reward function and you’re still screwed? It’s very nonobvious that this is how you fix things.
Even given that we need on-trajectory reward “robustness” (i.e. very carefully reward all in-fact-experienced situations relating to diamonds, until the AI becomes smart enough to steer its own training), this is extremely different from a forall-across-counterfactuals robust grading guarantee.
So even given both points, I would conclude “yup, shard theory reasoning shows I can dodge an enormous robust-grading sized bullet. No dealing with ‘nearest unblocked strategy’, here!” And that was the original point of dispute, AFAICT.
something with more unintended optima
What do you have in mind with “unintended optima”? This phrasing seems to suggest that alignment is reasonably formulated as a global optimization problem, which I think is probably not true in the currently understood sense. But maybe that’s not what you meant?
Even given all of this, why should reward function “robustness” be the natural solution to this? Like, what if you get your robust reward function and you’re still screwed? It’s very nonobvious that this is how you fix things.
Yeah, I sorta got sucked into playing pretend, here. I don’t actually have much hope for trying to pick out a concept we’d want just by pointing into a self-supervised world-model—I expect us to need to use human feedback and the AI’s self-reflectivity, which means that the AI has to want human feedback, and be able to reflect on itself, not just get pointed in the right direction in a single push. In the pretend-world where you start out able to pick out some good “human values”-esque concept from the very start, though, it definitely seems important to defend that concept from getting updated to something else.
What do you have in mind with “unintended optima”?
Sort of like in Goodhart Ethology. In situations where humans have a good grasp on what’s going on, we can pick out some fairly unambiguous properties of good vs. bad ways the world could go. If the AI is doing search over plans, guided by some values that care about the world, then what I mean by an “unintended optimum” of those values will lead to its search process outputting plans that make the world go badly according to these human-obvious standards. (And an unintended optimum of the reward function rewards trajectories that are obviously bad).
And an unintended optimum of the reward function rewards trajectories that are obviously bad
It seems not relevant if it’s an optimum or not. What’s relevant is the scalar reward values output on realized datapoints.
I emphasize this because “unintended optimum” phrasing seems to reliably trigger cached thoughts around “reward functions need to be robust graders.” (I also don’t like “optimum” of values, because I think that’s really not how values work in detail instead of in gloss, and “optimum” probably evokes similar thoughts around “values must be robust against adversaries.”)
Even given all of this, why should reward function “robustness” be the natural solution to this? Like, what if you get your robust reward function and you’re still screwed? It’s very nonobvious that this is how you fix things.
Even given that we need on-trajectory reward “robustness” (i.e. very carefully reward all in-fact-experienced situations relating to diamonds, until the AI becomes smart enough to steer its own training), this is extremely different from a forall-across-counterfactuals robust grading guarantee.
So even given both points, I would conclude “yup, shard theory reasoning shows I can dodge an enormous robust-grading sized bullet. No dealing with ‘nearest unblocked strategy’, here!” And that was the original point of dispute, AFAICT.
What do you have in mind with “unintended optima”? This phrasing seems to suggest that alignment is reasonably formulated as a global optimization problem, which I think is probably not true in the currently understood sense. But maybe that’s not what you meant?
Yeah, I sorta got sucked into playing pretend, here. I don’t actually have much hope for trying to pick out a concept we’d want just by pointing into a self-supervised world-model—I expect us to need to use human feedback and the AI’s self-reflectivity, which means that the AI has to want human feedback, and be able to reflect on itself, not just get pointed in the right direction in a single push. In the pretend-world where you start out able to pick out some good “human values”-esque concept from the very start, though, it definitely seems important to defend that concept from getting updated to something else.
Sort of like in Goodhart Ethology. In situations where humans have a good grasp on what’s going on, we can pick out some fairly unambiguous properties of good vs. bad ways the world could go. If the AI is doing search over plans, guided by some values that care about the world, then what I mean by an “unintended optimum” of those values will lead to its search process outputting plans that make the world go badly according to these human-obvious standards. (And an unintended optimum of the reward function rewards trajectories that are obviously bad).
It seems not relevant if it’s an optimum or not. What’s relevant is the scalar reward values output on realized datapoints.
I emphasize this because “unintended optimum” phrasing seems to reliably trigger cached thoughts around “reward functions need to be robust graders.” (I also don’t like “optimum” of values, because I think that’s really not how values work in detail instead of in gloss, and “optimum” probably evokes similar thoughts around “values must be robust against adversaries.”)