Even given all of this, why should reward function “robustness” be the natural solution to this? Like, what if you get your robust reward function and you’re still screwed? It’s very nonobvious that this is how you fix things.
Yeah, I sorta got sucked into playing pretend, here. I don’t actually have much hope for trying to pick out a concept we’d want just by pointing into a self-supervised world-model—I expect us to need to use human feedback and the AI’s self-reflectivity, which means that the AI has to want human feedback, and be able to reflect on itself, not just get pointed in the right direction in a single push. In the pretend-world where you start out able to pick out some good “human values”-esque concept from the very start, though, it definitely seems important to defend that concept from getting updated to something else.
What do you have in mind with “unintended optima”?
Sort of like in Goodhart Ethology. In situations where humans have a good grasp on what’s going on, we can pick out some fairly unambiguous properties of good vs. bad ways the world could go. If the AI is doing search over plans, guided by some values that care about the world, then what I mean by an “unintended optimum” of those values will lead to its search process outputting plans that make the world go badly according to these human-obvious standards. (And an unintended optimum of the reward function rewards trajectories that are obviously bad).
And an unintended optimum of the reward function rewards trajectories that are obviously bad
It seems not relevant if it’s an optimum or not. What’s relevant is the scalar reward values output on realized datapoints.
I emphasize this because “unintended optimum” phrasing seems to reliably trigger cached thoughts around “reward functions need to be robust graders.” (I also don’t like “optimum” of values, because I think that’s really not how values work in detail instead of in gloss, and “optimum” probably evokes similar thoughts around “values must be robust against adversaries.”)
Yeah, I sorta got sucked into playing pretend, here. I don’t actually have much hope for trying to pick out a concept we’d want just by pointing into a self-supervised world-model—I expect us to need to use human feedback and the AI’s self-reflectivity, which means that the AI has to want human feedback, and be able to reflect on itself, not just get pointed in the right direction in a single push. In the pretend-world where you start out able to pick out some good “human values”-esque concept from the very start, though, it definitely seems important to defend that concept from getting updated to something else.
Sort of like in Goodhart Ethology. In situations where humans have a good grasp on what’s going on, we can pick out some fairly unambiguous properties of good vs. bad ways the world could go. If the AI is doing search over plans, guided by some values that care about the world, then what I mean by an “unintended optimum” of those values will lead to its search process outputting plans that make the world go badly according to these human-obvious standards. (And an unintended optimum of the reward function rewards trajectories that are obviously bad).
It seems not relevant if it’s an optimum or not. What’s relevant is the scalar reward values output on realized datapoints.
I emphasize this because “unintended optimum” phrasing seems to reliably trigger cached thoughts around “reward functions need to be robust graders.” (I also don’t like “optimum” of values, because I think that’s really not how values work in detail instead of in gloss, and “optimum” probably evokes similar thoughts around “values must be robust against adversaries.”)