If another person mentions an “outer objective/base objective” (in terms of e.g. a reward function) to which we should align an AI, that indicates to me that their view on alignment is very different. The type error is akin to the type error of saying “My physics professor should be an understanding of physical law.” The function of a physics professor is to supply cognitive updates such that you end up understanding physical law. They are not, themselves, that understanding.
Hmmm, I suspect that when most people say things like “the reward function should be a human-aligned objective,” they’re intending something more like “the reward function is one for which any reasonable learning process, given enough time/data, would converge to an agent that ends up with human-aligned objectives,” or perhaps the far weaker claim that “the reward function is one for which there exists a reasonable learning process that, given enough time/data, will converge to an agent that ends up with human-aligned objectives.”
Maybe! I think this is how Evan explicitly defined it for a time, a few years ago. I think the strong claim isn’t very plausible, and the latter claim is… misdirecting of attention, and maybe too weak. Re: attention, I think that “does the agent end up aligned?” gets explained by the dataset more than by the reward function over e.g. hypothetical sentences.
I think “reward/reinforcement numbers” and “data points” are inextricably wedded. I think trying to reason about reward functions in isolation is… a caution sign? A warning sign?
If another person mentions an “outer objective/base objective” (in terms of e.g. a reward function) to which we should align an AI, that indicates to me that their view on alignment is very different. The type error is akin to the type error of saying “My physics professor should be an understanding of physical law.” The function of a physics professor is to supply cognitive updates such that you end up understanding physical law. They are not, themselves, that understanding.
Similarly, “The reward function should be a human-aligned objective”—The function of the reward function is to supply cognitive updates such that the agent ends up with human-aligned objectives. The reward function is not, itself, a human aligned objective.
Hmmm, I suspect that when most people say things like “the reward function should be a human-aligned objective,” they’re intending something more like “the reward function is one for which any reasonable learning process, given enough time/data, would converge to an agent that ends up with human-aligned objectives,” or perhaps the far weaker claim that “the reward function is one for which there exists a reasonable learning process that, given enough time/data, will converge to an agent that ends up with human-aligned objectives.”
Maybe! I think this is how Evan explicitly defined it for a time, a few years ago. I think the strong claim isn’t very plausible, and the latter claim is… misdirecting of attention, and maybe too weak. Re: attention, I think that “does the agent end up aligned?” gets explained by the dataset more than by the reward function over e.g. hypothetical sentences.
I think “reward/reinforcement numbers” and “data points” are inextricably wedded. I think trying to reason about reward functions in isolation is… a caution sign? A warning sign?