abhatt349 comments on TurnTrout’s shortform feed

abhatt349 22 Apr 2023 23:22 UTC
LW: 3 AF: 2
AF
Hmmm, I suspect that when most people say things like “the reward function should be a human-aligned objective,” they’re intending something more like “the reward function is one for which any reasonable learning process, given enough time/data, would converge to an agent that ends up with human-aligned objectives,” or perhaps the far weaker claim that “the reward function is one for which there exists a reasonable learning process that, given enough time/data, will converge to an agent that ends up with human-aligned objectives.”
- TurnTrout 2 May 2023 5:00 UTC
  LW: 3 AF: 2
  AF Parent
  Maybe! I think this is how Evan explicitly defined it for a time, a few years ago. I think the strong claim isn’t very plausible, and the latter claim is… misdirecting of attention, and maybe too weak. Re: attention, I think that “does the agent end up aligned?” gets explained by the dataset more than by the reward function over e.g. hypothetical sentences.
  I think “reward/reinforcement numbers” and “data points” are inextricably wedded. I think trying to reason about reward functions in isolation is… a caution sign? A warning sign?