When people 10 years ago started discussing the outer alignment problem (though with slightly different names), reinforcement learning was the classical example that was used to demonstrate why the outer alignment problem is a problem in the first place.
Got any sources for this? Feels pretty different if the problem was framed as “we can’t write down a reward function which captures human values” versus “we can’t specify rewards correctly in any way”. And in general it’s surprisingly tough to track down the places where Yudkowsky (or others?) said all these things.
The complex value paper is the obvious one, which as the name suggests talks about the complexity of value as one of the primary drivers of the outer alignment problem:
Suppose an AI with a video camera is trained to classify its sensory percepts into positive and negative instances of a certain concept, a concept which the unwary might label “HAPPINESS” but which we would be much wiser to give a neutral name like G0034 (McDermott 1976). The AI is presented with a smiling man, a cat, a frowning woman, a smiling woman, and a snow-topped mountain; of these instances 1 and 4 are classified positive, and instances 2, 3, and 5 are classified negative. Even given a million training cases of this type, if the test case of a tiny molecular smiley-face does not appear in the training data, it is by no means trivial to assume that the inductively simplest boundary around all the training cases classified “positive” will exclude every possible tiny molecular smiley-face that the AI can potentially engineer to satisfy its utility function. And of course, even if all tiny molecular smiley-faces and nanometer-scale dolls of brightly smiling humans were somehow excluded, the end result of such a utility function is for the AI to tile the galaxy with as many “smiling human faces” as a given amount of matter can be processed to yield.
Eliezer isn’t talking strictly about a reinforcement learning setup (but more a classification setup), but I think it comes out to the right thing. Hibbard was suggesting that you learn human values by basically doing reinforcement learning by classifying smiling humans (an approach that strikes as approximately as robust as RLHF), with Eliezer responding about how in the limit this really doesn’t give you the thing you want.
Remark: A very great cause for concern is the number of flawed design proposals which appear to operate well while the AI is in subhuman mode, especially if you don’t think it a cause for concern that the AI’s ‘mistakes’ occasionally need to be ‘corrected’, while giving the AI an instrumental motive to conceal its divergence from you in the close-to-human domain and causing the AI to kill you in the superhuman domain. E.g. the reward button which works pretty well so long as the AI can’t outwit you, later gives the AI an instrumental motive to claim that, yes, your pressing the button in association with moral actions reinforced it to be moral and had it grow up to be human just like your theory claimed, and still later the SI transforms all available matter into reward-button circuitry.
I don’t see any principled distinction between RLHF and other standard reinforcement-learning setups.
I think we disagree on how “principled” a method needs to be in order to constitute progress. RLHF gives rewards which can withstand more optimization before producing unintended outcomes than previous reward functions. Insofar as that’s a key metric we care about, it counts as progress. I’d guess we’d both agree that better RLHF and also techniques like debate will further increase the amount of optimization our reward functions can withstand, and then the main crux is whether that’s anywhere near the ballpark of the amount of optimzation they’ll need to withstand in order to automate most alignment research.
I mean, I don’t understand what you mean by “previous reward functions”. RLHF is just having a “reward button” that a human can press, with when to actually press the reward button being left totally unspecified and differing between different RLHF setups. It’s like the oldest idea in the book for how to train an AI, and it’s been thoroughly discussed for over a decade.
Yes, it’s probably better than literally hard-coding a reward function based on the inputs in terms of bad outcomes, but like, that’s been analyzed and discussed for a long time, and RLHF has also been feasible for a long time (there was some engineering and ML work to be done to make reinforcement learning work well-enough for modern ML systems to make RLHF feasible in the context of the largest modern systems, and I do think that work was in some sense an advance, but I don’t think it changes any of the overall dynamics of the system, and also the negative effects of that work are substantial and obvious).
This is in contrast to debate, which I think one could count as progress and feels like a real thing to me. I think it’s not solving a huge part of the problem, but I have less of a strong sense of “what the hell are you talking about when saying that RLHF is ‘an advance’” when referring to debate.
I don’t understand what you mean by “previous reward functions”.
I can’t tell if you’re being uncharitable or if there’s a way bigger inferential gap than I think, but I do literally just mean… reward functions used previously. Like, people did reinforcement learning before RLHF. They used reward functions for StarCraft and for Go and for Atari and for all sorts of random other things. In more complex environments, they used curiosity and empowerment reward functions. And none of these are the type of reward function that would withstand much optimization pressure (except insofar as they only applied to domains simple enough that it’s hard to actually achieve “bad outcomes”).
But I mean, people have used handcrafted rewards since forever. The human-feedback part of RLHF is nothing new. It’s as old as all the handcrafted reward functions you mentioned (as evidenced by Eliezer referencing a reward button in this 10 year old comment, and even back then the idea of just like a human-feedback driven reward was nothing new), so I don’t understand what you mean by “previous”.
If you say “other” I would understand this, since there are definitely many different ways to structure reward functions, but I do feel kind of aggressively gaslit by a bunch of people who keep trying to frame RLHF as some kind of novel advance when it’s literally just the most straightforward application of reinforcement learning that I can imagine (like, I think it really is more obvious and was explored earlier than basically any other way I can think off of training an AI system, since it is the standard way we do animal training).
The term “reinforcement learning” literally has its origin in animal training, where approximately all we do is whatever you would call modern RLHF (having a reward button, or a punishment button, usually in the form of food or via previous operant conditioning). It’s literally the oldest idea in the book of reinforcement learning. There are no “previous” reward functions. It’s literally like, one of the very first class of reward functions we considered.
Got any sources for this? Feels pretty different if the problem was framed as “we can’t write down a reward function which captures human values” versus “we can’t specify rewards correctly in any way”. And in general it’s surprisingly tough to track down the places where Yudkowsky (or others?) said all these things.
The complex value paper is the obvious one, which as the name suggests talks about the complexity of value as one of the primary drivers of the outer alignment problem:
Eliezer isn’t talking strictly about a reinforcement learning setup (but more a classification setup), but I think it comes out to the right thing. Hibbard was suggesting that you learn human values by basically doing reinforcement learning by classifying smiling humans (an approach that strikes as approximately as robust as RLHF), with Eliezer responding about how in the limit this really doesn’t give you the thing you want.
In Robby’s followup-post “The Genie knows but doesn’t care”, Eliezer says in the top comment:
Cool, makes sense.
I think we disagree on how “principled” a method needs to be in order to constitute progress. RLHF gives rewards which can withstand more optimization before producing unintended outcomes than previous reward functions. Insofar as that’s a key metric we care about, it counts as progress. I’d guess we’d both agree that better RLHF and also techniques like debate will further increase the amount of optimization our reward functions can withstand, and then the main crux is whether that’s anywhere near the ballpark of the amount of optimzation they’ll need to withstand in order to automate most alignment research.
I mean, I don’t understand what you mean by “previous reward functions”. RLHF is just having a “reward button” that a human can press, with when to actually press the reward button being left totally unspecified and differing between different RLHF setups. It’s like the oldest idea in the book for how to train an AI, and it’s been thoroughly discussed for over a decade.
Yes, it’s probably better than literally hard-coding a reward function based on the inputs in terms of bad outcomes, but like, that’s been analyzed and discussed for a long time, and RLHF has also been feasible for a long time (there was some engineering and ML work to be done to make reinforcement learning work well-enough for modern ML systems to make RLHF feasible in the context of the largest modern systems, and I do think that work was in some sense an advance, but I don’t think it changes any of the overall dynamics of the system, and also the negative effects of that work are substantial and obvious).
This is in contrast to debate, which I think one could count as progress and feels like a real thing to me. I think it’s not solving a huge part of the problem, but I have less of a strong sense of “what the hell are you talking about when saying that RLHF is ‘an advance’” when referring to debate.
I can’t tell if you’re being uncharitable or if there’s a way bigger inferential gap than I think, but I do literally just mean… reward functions used previously. Like, people did reinforcement learning before RLHF. They used reward functions for StarCraft and for Go and for Atari and for all sorts of random other things. In more complex environments, they used curiosity and empowerment reward functions. And none of these are the type of reward function that would withstand much optimization pressure (except insofar as they only applied to domains simple enough that it’s hard to actually achieve “bad outcomes”).
But I mean, people have used handcrafted rewards since forever. The human-feedback part of RLHF is nothing new. It’s as old as all the handcrafted reward functions you mentioned (as evidenced by Eliezer referencing a reward button in this 10 year old comment, and even back then the idea of just like a human-feedback driven reward was nothing new), so I don’t understand what you mean by “previous”.
If you say “other” I would understand this, since there are definitely many different ways to structure reward functions, but I do feel kind of aggressively gaslit by a bunch of people who keep trying to frame RLHF as some kind of novel advance when it’s literally just the most straightforward application of reinforcement learning that I can imagine (like, I think it really is more obvious and was explored earlier than basically any other way I can think off of training an AI system, since it is the standard way we do animal training).
The term “reinforcement learning” literally has its origin in animal training, where approximately all we do is whatever you would call modern RLHF (having a reward button, or a punishment button, usually in the form of food or via previous operant conditioning). It’s literally the oldest idea in the book of reinforcement learning. There are no “previous” reward functions. It’s literally like, one of the very first class of reward functions we considered.
Do you have a link for that please?