Calling work you disagree with “lip service” seems wrong and unhelpful.
There are plenty of ML researchers who think that they are doing real work on alignment and that your research is useless. They could choose to describe the situation by saying that you aren’t actually doing alignment research. But I think it would be more accurate and helpful if they were to instead say that you are both working on alignment but have big disagreements about what kind of research is likely to be useful.
(To be clear, plenty of folks also think that my work is useless.)
I definitely do not use “lip service” as a generic term for alignment research I disagree with. I think you-two-years-ago were on a wrong track with HCH, but you were clearly aiming to solve alignment. Same with lots of other researchers today—I disagree with the approaches of most people in the field, but I do not accuse them not actually doing alignment research.
No, this accusation is specifically for things RLHF (which are very obviously not even trying to solve any of the problems which could plausibly kill us), and to things like “AI ethics” work (which are very obviously not even attempting to solve the extinction problem). In general, it has to be not even trying to solve a problem which kills us in order for me to make that sort of accusation.
If someone on the OpenAI team which worked on RLHF thought humanity had a decent (not necessarily large) chance of going extinct from AI, and they honestly thought implementing and popularizing RLHF made that chance go down, and they chose to work on RLHF because of that, then I would say I was wrong to accuse them of merely paying lip service. I’d think they were pretty stupid about their strategy, but hey, it’s alignment, lots of us think each other are being stupid about strategy.
What I actually think is that they saw something that would let them do cool high-status high-paying ML work, while being nominally vaguely related to alignment, and decided to do that without actually stopping to think about questions like “Is this actually going to decrease humanity’s chance of extinction?”. And then later on they made up a story about how the work was helpful for alignment, because that’s the sort of rationalization humans do all the time. Standard Bottom Line failure.
I take this comment as evidence that John would fail an intellectual turing test for people who have different views than he does about how valuable incremental empiricism is. I think this is an ITT which a lot of people in the broader LW cluster would fail. I think the basic mistake that’s being made here is failing to recognize that reality doesn’t grade on a curve when it comes to understanding the world—your arguments can be false even if nobody has refuted them. That’s particularly true when it comes to very high-level abstractions, like the ones this field is built around (and in particular the abstraction which it seems John is using, where making progress on outer alignment makes almost no difference to inner alignment).
Historically, the way that great scientists have gotten around this issue is by engaging very heavily with empirical data (like Darwin did) or else with strongly predictive theoretical frameworks (like Einstein did). Trying to do work which lacks either is a road with a lot of skulls on it. And that’s fine, this might be necessary, and so it’s good to have some people pushing in this direction, but it seems like a bunch of people around here don’t just ignore the skulls, they seem to lack any awareness that the absence of the key components by which scientific progress has basically ever been made is a red flag at all.
I think it’s possible to criticise work on RLHF while taking seriously the possibility that empirical work on our biggest models is necessary for solving alignment. But criticisms like this one seem to showcase a kind of blindspot. I’d be more charitable if people in the LW cluster had actually tried to write up the arguments for things like “why inner misalignment is so inevitable”. But in general people have put shockingly little effort into doing so, with almost nobody trying to tackle this rigorously. E.g. I was surprised when my debates with Eliezer involved him still using all the same intuition-pumps as he did in the sequences, because to me the obvious thing to do over the next decade is to flesh out the underlying mental models of the key issue, which would then allow you to find high-level intuition pumps that are both more persuasive and more trustworthy.
I’m more careful than John about throwing around aspersions on which people are “actually trying” to solve problems. But it sure seems to me that blithely trusting your own intuitions because you personally can’t imagine how they might be wrong is one way of not actually trying to solve hard problems.
Comments on parts of this other than the ITT thing (response to the ITT part is here)...
(and in particular the abstraction which it seems John is using, where making progress on outer alignment makes almost no difference to inner alignment)
I don’t usually focus much on the outer/inner abstraction, and when I do I usually worry about outer alignment. I consider RLHF to have been negative progress on outer alignment, same as inner alignment; I wasn’t relying on that particular abstraction at all.
Historically, the way that great scientists have gotten around this issue is by engaging very heavily with empirical data (like Darwin did) or else with strongly predictive theoretical frameworks (like Einstein did). Trying to do work which lacks either is a road with a lot of skulls on it. And that’s fine, this might be necessary, and so it’s good to have some people pushing in this direction, but it seems like a bunch of people around here don’t just ignore the skulls, they seem to lack any awareness that the absence of the key components by which scientific progress has basically ever been made is a red flag at all.
I think your model here completely fails to predict Descartes, Laplace, Von Neumann & Morgenstern, Shannon, Jaynes, Pearl, and probably many others. Basically all of the people who’ve successfully made exactly the sort of conceptual advances we aim for in agent foundations.
But it is a model under which one could try to make a case for RLHF.
I’d be more charitable if people in the LW cluster had actually tried to write up the arguments for things like “why inner misalignment is so inevitable”.
Speaking for myself, I don’t think inner misalignment is clearly inevitable. I do think outer misalignment is much more clearly inevitable, and I do think inner misalignment is not plausibly sufficiently unlikely that we can afford to ignore the possibility. Similar to this comment: I’m pretty sympathetic to the view that powerful deceptive inner agents are unlikely, but charging ahead assuming that they will not happen is idiotic given the stakes.
A piece which I think is missing from this thread thus far: in order for RLHF to decrease the chance of human extinction, there has to first be some world in which humans go extinct from AI. By and large, it seems like people who think RLHF is useful are mostly also people who think we’re unlikely to die of AI, and that’s not a coincidence: worlds in which the iterative-incremental-empiricism approach suffices for alignment are worlds where we’re unlikely to die in the first place. Humans are good at iterative incremental empiricism. The worlds in which we die are worlds in which that approach is fundamentally flawed for some reason (usually because we are unable to see the problems).
Thus the wording of this claim I made upthread:
If someone on the OpenAI team which worked on RLHF thought humanity had a decent (not necessarily large) chance of going extinct from AI, and they honestly thought implementing and popularizing RLHF made that chance go down, and they chose to work on RLHF because of that, then I would say I was wrong to accuse them of merely paying lip service.
In order for work on RLHF to reduce the chance of humanity going extinct from AI, it has to help in one of the worlds where we otherwise go extinct, not in one of the worlds where alignment by default kicks in and we would probably have been fine anyway.
(In case it was not obvious: I am definitely not saying that one must assign high P(doom) to do actual alignment work. I am saying that one must have some idea of worlds in which we’re actually likely to die.)
I take this comment as evidence that John would fail an intellectual turing test for people who have different views than he does about how valuable incremental empiricism is.
I don’t want to pour a ton of effort into this, but here’s my 5-paragraph ITT attempt.
“As an analogy for alignment, consider processor manufacturing. We didn’t get to gigahertz clock speed and ten nanometer feature size by trying to tackle all the problems of 10 nm manufacturing processes right out the gate. That would never have worked; too many things independently go wrong to isolate and solve them all without iteration. We can’t get many useful bits out of empirical feedback if the result is always failure, and always for a long list of reasons.
And of course, if you know anything about modern fabs, you know there’d have been no hope whatsoever of identifying all the key problems in advance just based on theory. (Side note: I remember a good post or thread from the past year on crazy shit fabs need to do, but can’t find it; anyone remember that and have a link?)
The way we actually did it was to start with gigantic millimeter-size features, which were relatively easy to manufacture. And then we scaled down slowly. At each new size, new problems came up, but those problems came up just a few at a time as we only scaled down a little bit at each step. We could carry over most of our insights from earlier stages, and isolate new problems empirically.
The analogy, in AI, is to slowly ramp up the capabilities/size/optimization pressure of our systems. Start with low capability, and use whatever simple tricks will help in that regime. Then slowly ramp up, see what new problems come up at each stage, just like we did for chip manufacturing. And to complete the analogy: just like with chips, at each step we can use the products of the previous step to help design the next step.
That’s the sort of plan which has a track record of actually handling the messiness of reality, even when scaling things over many orders of magnitude.”
There, let me know how plausible that was as an ITT attempt for “people who have different views [than I do] about how valuable incremental empiricism is”.
Forgot to reply to this at the time, but I think this is a pretty good ITT. (I think there’s probably some additional argument that people would make about why this isn’t just an isolated analogy, but rather a more generally-applicable argument, but it does seem to be a fairly central example of that generally-applicable argument.)
I think people who value empirical alignment work now probably think that (to some extent) we can predict at a high level what future problems we might face (contrasting with “there’d have been no hope whatsoever of identifying all the key problems in advance just based on theory”). Obviously this is a spectrum, but I think the chip fab analogy is I think further towards people believing there are unknown unknowns in the problem space than people at OpenAI are (e.g. OpenAI people possibly think outer alignment and inner alignment capture all of the kinds of problems we’ll face).
However, they probably don’t believe you can work on solutions to those problems without being able to empirically demonstrate those problems and hence iterate on them (and again one could probably appeal to a track record here of most proposed solutions to problems not working unless they were developed by iterating on the actual problem). We can maybe vaguely postulate what the solutions could look like (they would say), but it’s going to be much better to try and actually implement solutions on versions of the problem we can demonstrate, and iterate from there. (Note that they probably also perhaps try and produce demonstrations of the problems such that they can then work on those solutions, but this is still all empirical).
Otherwise I do think your ITT does seem reasonable to me, although I don’t think I’d put myself in the class of people you’re trying to ITT, so that’s not much evidence.
and in particular the abstraction which it seems John is using, where making progress on outer alignment makes almost no difference to inner alignment
I am confused. How does RLHF help with outer alignment? Isn’t optimizing fur human approval the classical outer-alignment problem? (e.g. tiling the universe with smiling faces)
I don’t think the argument for RLHF runs through outer alignment. I think it has to run through using it as a lens to study how models generalize, and eliciting misalignment (i.e. the points about empirical data that you mentioned, I just don’t understand where the inner/outer alignment distinction comes from in this context)
The smiley faces example feels confusing as a “classic” outer alignment problem because AGIs won’t be trained on a reward function anywhere near as limited as smiley faces. An alternative like “AGIs are trained on a reward function in which all behavior on a wide range of tasks is classified by humans as good or bad” feels more realistic, but also lacks the intuitive force of the smiley face example—it’s much less clear in this example why generalization will go badly, given the breadth of the data collected.
I think the smiling example is much more analogous than you are making it out here. I think the basic argument for “this just encourages taking control of the reward” or “this just encourages deception” goes through the same way.
Like, RLHF is not some magical “we have definitely figured out whether a behavior is really good or bad” signal, it’s historically been just some contractors thinking for like a minute about whether a thing is fine. I don’t think there is less bayesian evidence conveyed by people smiling (like, the variance in smiling is greater than the variance in RLHF approval, and so the amount of information conveyed is actually more), so I don’t buy that RLHF conveys more about human preferences in any meaningful way.
RLHF (which are very obviously not even trying to solve any of the problems which could plausibly kill us)
Sorry for being dumb, but I thought the naïve case for RLHF is that it helps solve the problem of “people are very bad at manually writing down an explicit utility or reward function that does what they intuitively want”? Does that not count as one of the lethal problems (even if RLHF alone would kill us because of the other problems)? If one of the other problems is Goodharting/unforseen-maxima, it seems like RLHF could be helpful insofar as if RLHF rewards are quantitatively less misaligned than hand-coded rewards, you can get away with optimizing them harder before they kill you?
That is a reasonable case, with the obvious catch that you don’t know how hard you can optimize before it goes wrong, and when it does go wrong you’re less likely to notice than with a hand-coded utility/reward.
But I expect the people who work on RLHF do not expect an explicit utility/reward to be a problem which actually kills us, because they’d expect visible failures before it gets to the capability level of killing us. RLHF makes those visible failures less likely. Under that frame, it’s the lack of a warning shot which kills us.
when it does go wrong you’re less likely to notice than with a hand-coded utility/reward [...] RLHF makes those visible failures less likely
Because it incentivizes learning human models which can then be used to be more competently deceptive, or just because once you’ve fixed the problems you know how to notice, what’s left are the ones you don’t know how to notice? The latter doesn’t seem specific to RLHF (you’d have the same problem if people magically got better at hand-coding rewards), but I see how the former is plausible and bad.
The problem isn’t just learning whole human models. RLHF will select for any heuristic/strategy which, even by accident, hides bad behavior from humans. It applies even at low capabilities.
How? E.g. Jacob left a comment here about his motivations, does that count as a falsification? Or, if you’d say that this is an example of rationalization, then what would the comment need to look like in order to falsify your claim? Does Paul’s comment here mentioning the discussions that took place before launching the GPT-3 work count as a falsification? if not, why not?
Jacob’s comment does not count, since it’s not addressing the “actually consider whether the project will net decrease chance of extinction” or the “could the answer have plausibly been ‘no’ and then the project would not have happened” part.
Paul’s comment does address both of those, especially this part at the end:
To be clear, this is not post hoc reasoning. I talked with WebGPT folks early on while they were wondering about whether these risks were significant, and I said that I thought this was badly overdetermined. If there had been more convincing arguments that the harms from the research were significant, I believe that it likely wouldn’t have happened.
That does indeed falsify my position, and I have updated the top-level comment accordingly. Thankyou for the information.
I think Jacob (OP) said “OpenAI is trying to directly build safe AGI.” and cited the charter and other statements as evidence of this claim. Then John replied that the charter and other statements are “not much evidence” either for or against this claim, because talk is cheap. I think that’s a reasonable point.
Separately, maybe John in fact believes that the charter and other statements are insincere lip service. If so, I would agree with you (Paul) that John’s belief is probably incorrect, based on my very limited knowledge. [Where I disagree with OpenAI, I presume that top leadership is acting sincerely to make a good future with safe AGI, but that they have mistaken beliefs about the hardness of alignment and other topics.]
I’d guess that is an overestimate of the number of people actually doing alignment research at OpenAI, as opposed to capabilities research in which people pay lip service to alignment. In particular, all of the RLHF work is basically capabilities work which makes alignment harder in the long term (because it directly selects for deception), while billing itself as “alignment”.
Calling work you disagree with “lip service” seems wrong and unhelpful.
There are plenty of ML researchers who think that they are doing real work on alignment and that your research is useless. They could choose to describe the situation by saying that you aren’t actually doing alignment research. But I think it would be more accurate and helpful if they were to instead say that you are both working on alignment but have big disagreements about what kind of research is likely to be useful.
(To be clear, plenty of folks also think that my work is useless.)
I definitely do not use “lip service” as a generic term for alignment research I disagree with. I think you-two-years-ago were on a wrong track with HCH, but you were clearly aiming to solve alignment. Same with lots of other researchers today—I disagree with the approaches of most people in the field, but I do not accuse them not actually doing alignment research.
No, this accusation is specifically for things RLHF (which are very obviously not even trying to solve any of the problems which could plausibly kill us), and to things like “AI ethics” work (which are very obviously not even attempting to solve the extinction problem). In general, it has to be not even trying to solve a problem which kills us in order for me to make that sort of accusation.
If someone on the OpenAI team which worked on RLHF thought humanity had a decent (not necessarily large) chance of going extinct from AI, and they honestly thought implementing and popularizing RLHF made that chance go down, and they chose to work on RLHF because of that, then I would say I was wrong to accuse them of merely paying lip service. I’d think they were pretty stupid about their strategy, but hey, it’s alignment, lots of us think each other are being stupid about strategy.
What I actually think is that they saw something that would let them do cool high-status high-paying ML work, while being nominally vaguely related to alignment, and decided to do that without actually stopping to think about questions like “Is this actually going to decrease humanity’s chance of extinction?”. And then later on they made up a story about how the work was helpful for alignment, because that’s the sort of rationalization humans do all the time. Standard Bottom Line failure.
I take this comment as evidence that John would fail an intellectual turing test for people who have different views than he does about how valuable incremental empiricism is. I think this is an ITT which a lot of people in the broader LW cluster would fail. I think the basic mistake that’s being made here is failing to recognize that reality doesn’t grade on a curve when it comes to understanding the world—your arguments can be false even if nobody has refuted them. That’s particularly true when it comes to very high-level abstractions, like the ones this field is built around (and in particular the abstraction which it seems John is using, where making progress on outer alignment makes almost no difference to inner alignment).
Historically, the way that great scientists have gotten around this issue is by engaging very heavily with empirical data (like Darwin did) or else with strongly predictive theoretical frameworks (like Einstein did). Trying to do work which lacks either is a road with a lot of skulls on it. And that’s fine, this might be necessary, and so it’s good to have some people pushing in this direction, but it seems like a bunch of people around here don’t just ignore the skulls, they seem to lack any awareness that the absence of the key components by which scientific progress has basically ever been made is a red flag at all.
I think it’s possible to criticise work on RLHF while taking seriously the possibility that empirical work on our biggest models is necessary for solving alignment. But criticisms like this one seem to showcase a kind of blindspot. I’d be more charitable if people in the LW cluster had actually tried to write up the arguments for things like “why inner misalignment is so inevitable”. But in general people have put shockingly little effort into doing so, with almost nobody trying to tackle this rigorously. E.g. I was surprised when my debates with Eliezer involved him still using all the same intuition-pumps as he did in the sequences, because to me the obvious thing to do over the next decade is to flesh out the underlying mental models of the key issue, which would then allow you to find high-level intuition pumps that are both more persuasive and more trustworthy.
I’m more careful than John about throwing around aspersions on which people are “actually trying” to solve problems. But it sure seems to me that blithely trusting your own intuitions because you personally can’t imagine how they might be wrong is one way of not actually trying to solve hard problems.
Comments on parts of this other than the ITT thing (response to the ITT part is here)...
I don’t usually focus much on the outer/inner abstraction, and when I do I usually worry about outer alignment. I consider RLHF to have been negative progress on outer alignment, same as inner alignment; I wasn’t relying on that particular abstraction at all.
I think your model here completely fails to predict Descartes, Laplace, Von Neumann & Morgenstern, Shannon, Jaynes, Pearl, and probably many others. Basically all of the people who’ve successfully made exactly the sort of conceptual advances we aim for in agent foundations.
But it is a model under which one could try to make a case for RLHF.
I still do not think that the team doing RLHF work at OpenAI actually thought about whether this model makes RLHF decrease the chance of human extinction, and deliberated on that in a way which could plausibly have resulted in the project not happening. But I have made that claim maximally easy to falsify if I’m wrong.
Speaking for myself, I don’t think inner misalignment is clearly inevitable. I do think outer misalignment is much more clearly inevitable, and I do think inner misalignment is not plausibly sufficiently unlikely that we can afford to ignore the possibility. Similar to this comment: I’m pretty sympathetic to the view that powerful deceptive inner agents are unlikely, but charging ahead assuming that they will not happen is idiotic given the stakes.
A piece which I think is missing from this thread thus far: in order for RLHF to decrease the chance of human extinction, there has to first be some world in which humans go extinct from AI. By and large, it seems like people who think RLHF is useful are mostly also people who think we’re unlikely to die of AI, and that’s not a coincidence: worlds in which the iterative-incremental-empiricism approach suffices for alignment are worlds where we’re unlikely to die in the first place. Humans are good at iterative incremental empiricism. The worlds in which we die are worlds in which that approach is fundamentally flawed for some reason (usually because we are unable to see the problems).
Thus the wording of this claim I made upthread:
In order for work on RLHF to reduce the chance of humanity going extinct from AI, it has to help in one of the worlds where we otherwise go extinct, not in one of the worlds where alignment by default kicks in and we would probably have been fine anyway.
(In case it was not obvious: I am definitely not saying that one must assign high P(doom) to do actual alignment work. I am saying that one must have some idea of worlds in which we’re actually likely to die.)
I don’t want to pour a ton of effort into this, but here’s my 5-paragraph ITT attempt.
“As an analogy for alignment, consider processor manufacturing. We didn’t get to gigahertz clock speed and ten nanometer feature size by trying to tackle all the problems of 10 nm manufacturing processes right out the gate. That would never have worked; too many things independently go wrong to isolate and solve them all without iteration. We can’t get many useful bits out of empirical feedback if the result is always failure, and always for a long list of reasons.
And of course, if you know anything about modern fabs, you know there’d have been no hope whatsoever of identifying all the key problems in advance just based on theory. (Side note: I remember a good post or thread from the past year on crazy shit fabs need to do, but can’t find it; anyone remember that and have a link?)
The way we actually did it was to start with gigantic millimeter-size features, which were relatively easy to manufacture. And then we scaled down slowly. At each new size, new problems came up, but those problems came up just a few at a time as we only scaled down a little bit at each step. We could carry over most of our insights from earlier stages, and isolate new problems empirically.
The analogy, in AI, is to slowly ramp up the capabilities/size/optimization pressure of our systems. Start with low capability, and use whatever simple tricks will help in that regime. Then slowly ramp up, see what new problems come up at each stage, just like we did for chip manufacturing. And to complete the analogy: just like with chips, at each step we can use the products of the previous step to help design the next step.
That’s the sort of plan which has a track record of actually handling the messiness of reality, even when scaling things over many orders of magnitude.”
There, let me know how plausible that was as an ITT attempt for “people who have different views [than I do] about how valuable incremental empiricism is”.
Forgot to reply to this at the time, but I think this is a pretty good ITT. (I think there’s probably some additional argument that people would make about why this isn’t just an isolated analogy, but rather a more generally-applicable argument, but it does seem to be a fairly central example of that generally-applicable argument.)
I think people who value empirical alignment work now probably think that (to some extent) we can predict at a high level what future problems we might face (contrasting with “there’d have been no hope whatsoever of identifying all the key problems in advance just based on theory”). Obviously this is a spectrum, but I think the chip fab analogy is I think further towards people believing there are unknown unknowns in the problem space than people at OpenAI are (e.g. OpenAI people possibly think outer alignment and inner alignment capture all of the kinds of problems we’ll face).
However, they probably don’t believe you can work on solutions to those problems without being able to empirically demonstrate those problems and hence iterate on them (and again one could probably appeal to a track record here of most proposed solutions to problems not working unless they were developed by iterating on the actual problem). We can maybe vaguely postulate what the solutions could look like (they would say), but it’s going to be much better to try and actually implement solutions on versions of the problem we can demonstrate, and iterate from there. (Note that they probably also perhaps try and produce demonstrations of the problems such that they can then work on those solutions, but this is still all empirical).
Otherwise I do think your ITT does seem reasonable to me, although I don’t think I’d put myself in the class of people you’re trying to ITT, so that’s not much evidence.
I am confused. How does RLHF help with outer alignment? Isn’t optimizing fur human approval the classical outer-alignment problem? (e.g. tiling the universe with smiling faces)
I don’t think the argument for RLHF runs through outer alignment. I think it has to run through using it as a lens to study how models generalize, and eliciting misalignment (i.e. the points about empirical data that you mentioned, I just don’t understand where the inner/outer alignment distinction comes from in this context)
RLHF helps with outer alignment because it leads to rewards which more accurately reflect human preferences than the hard-coded reward functions (including the classic specification gaming examples, but also intrinsic motivation functions like curiosity and empowerment) which are used to train agents in the absence of RLHF.
The smiley faces example feels confusing as a “classic” outer alignment problem because AGIs won’t be trained on a reward function anywhere near as limited as smiley faces. An alternative like “AGIs are trained on a reward function in which all behavior on a wide range of tasks is classified by humans as good or bad” feels more realistic, but also lacks the intuitive force of the smiley face example—it’s much less clear in this example why generalization will go badly, given the breadth of the data collected.
I think the smiling example is much more analogous than you are making it out here. I think the basic argument for “this just encourages taking control of the reward” or “this just encourages deception” goes through the same way.
Like, RLHF is not some magical “we have definitely figured out whether a behavior is really good or bad” signal, it’s historically been just some contractors thinking for like a minute about whether a thing is fine. I don’t think there is less bayesian evidence conveyed by people smiling (like, the variance in smiling is greater than the variance in RLHF approval, and so the amount of information conveyed is actually more), so I don’t buy that RLHF conveys more about human preferences in any meaningful way.
Sorry for being dumb, but I thought the naïve case for RLHF is that it helps solve the problem of “people are very bad at manually writing down an explicit utility or reward function that does what they intuitively want”? Does that not count as one of the lethal problems (even if RLHF alone would kill us because of the other problems)? If one of the other problems is Goodharting/unforseen-maxima, it seems like RLHF could be helpful insofar as if RLHF rewards are quantitatively less misaligned than hand-coded rewards, you can get away with optimizing them harder before they kill you?
That is a reasonable case, with the obvious catch that you don’t know how hard you can optimize before it goes wrong, and when it does go wrong you’re less likely to notice than with a hand-coded utility/reward.
But I expect the people who work on RLHF do not expect an explicit utility/reward to be a problem which actually kills us, because they’d expect visible failures before it gets to the capability level of killing us. RLHF makes those visible failures less likely. Under that frame, it’s the lack of a warning shot which kills us.
Because it incentivizes learning human models which can then be used to be more competently deceptive, or just because once you’ve fixed the problems you know how to notice, what’s left are the ones you don’t know how to notice? The latter doesn’t seem specific to RLHF (you’d have the same problem if people magically got better at hand-coding rewards), but I see how the former is plausible and bad.
The problem isn’t just learning whole human models. RLHF will select for any heuristic/strategy which, even by accident, hides bad behavior from humans. It applies even at low capabilities.
This is testable by asking someone from OpenAI things like
how the decision to work on RLHF was made: how many hours were spent on it, who was in charge
their models under which RLHF is good and bad for humanity
FWIW, I personally know some of the people involved pretty well since ~2015, and I think you are wrong about their motivations.
That is plausible; I have made my position here very easy to falsify if I’m wrong.
How? E.g. Jacob left a comment here about his motivations, does that count as a falsification? Or, if you’d say that this is an example of rationalization, then what would the comment need to look like in order to falsify your claim? Does Paul’s comment here mentioning the discussions that took place before launching the GPT-3 work count as a falsification? if not, why not?
Jacob’s comment does not count, since it’s not addressing the “actually consider whether the project will net decrease chance of extinction” or the “could the answer have plausibly been ‘no’ and then the project would not have happened” part.
Paul’s comment does address both of those, especially this part at the end:
That does indeed falsify my position, and I have updated the top-level comment accordingly. Thankyou for the information.
I think Jacob (OP) said “OpenAI is trying to directly build safe AGI.” and cited the charter and other statements as evidence of this claim. Then John replied that the charter and other statements are “not much evidence” either for or against this claim, because talk is cheap. I think that’s a reasonable point.
Separately, maybe John in fact believes that the charter and other statements are insincere lip service. If so, I would agree with you (Paul) that John’s belief is probably incorrect, based on my very limited knowledge. [Where I disagree with OpenAI, I presume that top leadership is acting sincerely to make a good future with safe AGI, but that they have mistaken beliefs about the hardness of alignment and other topics.]
I was replying to:
Thanks, sorry for misunderstanding.