habryka comments on Don’t you think RLHF solves outer alignment?

habryka Nov 4, 2022, 9:17 PM
11 points
6
I mean, I don’t understand what you mean by “previous reward functions”. RLHF is just having a “reward button” that a human can press, with when to actually press the reward button being left totally unspecified and differing between different RLHF setups. It’s like the oldest idea in the book for how to train an AI, and it’s been thoroughly discussed for over a decade.

Yes, it’s probably better than literally hard-coding a reward function based on the inputs in terms of bad outcomes, but like, that’s been analyzed and discussed for a long time, and RLHF has also been feasible for a long time (there was some engineering and ML work to be done to make reinforcement learning work well-enough for modern ML systems to make RLHF feasible in the context of the largest modern systems, and I do think that work was in some sense an advance, but I don’t think it changes any of the overall dynamics of the system, and also the negative effects of that work are substantial and obvious).

This is in contrast to debate, which I think one could count as progress and feels like a real thing to me. I think it’s not solving a huge part of the problem, but I have less of a strong sense of “what the hell are you talking about when saying that RLHF is ‘an advance’” when referring to debate.
- Richard_Ngo Nov 5, 2022, 7:20 AM
  6 points
  0
  Parent
  I don’t understand what you mean by “previous reward functions”.
  I can’t tell if you’re being uncharitable or if there’s a way bigger inferential gap than I think, but I do literally just mean… reward functions used previously. Like, people did reinforcement learning before RLHF. They used reward functions for StarCraft and for Go and for Atari and for all sorts of random other things. In more complex environments, they used curiosity and empowerment reward functions. And none of these are the type of reward function that would withstand much optimization pressure (except insofar as they only applied to domains simple enough that it’s hard to actually achieve “bad outcomes”).
  - habryka Nov 5, 2022, 8:20 AM
    6 points
    6
    Parent
    But I mean, people have used handcrafted rewards since forever. The human-feedback part of RLHF is nothing new. It’s as old as all the handcrafted reward functions you mentioned (as evidenced by Eliezer referencing a reward button in this 10 year old comment, and even back then the idea of just like a human-feedback driven reward was nothing new), so I don’t understand what you mean by “previous”.
    If you say “other” I would understand this, since there are definitely many different ways to structure reward functions, but I do feel kind of aggressively gaslit by a bunch of people who keep trying to frame RLHF as some kind of novel advance when it’s literally just the most straightforward application of reinforcement learning that I can imagine (like, I think it really is more obvious and was explored earlier than basically any other way I can think off of training an AI system, since it is the standard way we do animal training).
    The term “reinforcement learning” literally has its origin in animal training, where approximately all we do is whatever you would call modern RLHF (having a reward button, or a punishment button, usually in the form of food or via previous operant conditioning). It’s literally the oldest idea in the book of reinforcement learning. There are no “previous” reward functions. It’s literally like, one of the very first class of reward functions we considered.