Hmmm, I suspect that when most people say things like “the reward function should be a human-aligned objective,” they’re intending something more like “the reward function is one for which any reasonable learning process, given enough time/data, would converge to an agent that ends up with human-aligned objectives,” or perhaps the far weaker claim that “the reward function is one for which there exists a reasonable learning process that, given enough time/data, will converge to an agent that ends up with human-aligned objectives.”
Maybe! I think this is how Evan explicitly defined it for a time, a few years ago. I think the strong claim isn’t very plausible, and the latter claim is… misdirecting of attention, and maybe too weak. Re: attention, I think that “does the agent end up aligned?” gets explained by the dataset more than by the reward function over e.g. hypothetical sentences.
I think “reward/reinforcement numbers” and “data points” are inextricably wedded. I think trying to reason about reward functions in isolation is… a caution sign? A warning sign?
Hmmm, I suspect that when most people say things like “the reward function should be a human-aligned objective,” they’re intending something more like “the reward function is one for which any reasonable learning process, given enough time/data, would converge to an agent that ends up with human-aligned objectives,” or perhaps the far weaker claim that “the reward function is one for which there exists a reasonable learning process that, given enough time/data, will converge to an agent that ends up with human-aligned objectives.”
Maybe! I think this is how Evan explicitly defined it for a time, a few years ago. I think the strong claim isn’t very plausible, and the latter claim is… misdirecting of attention, and maybe too weak. Re: attention, I think that “does the agent end up aligned?” gets explained by the dataset more than by the reward function over e.g. hypothetical sentences.
I think “reward/reinforcement numbers” and “data points” are inextricably wedded. I think trying to reason about reward functions in isolation is… a caution sign? A warning sign?