RLHF is the worst possible thing done when facing the alignment problem
Epistemic status: The title might not be literally true in the sense that e.g. if the inventors of RLHF hadn’t come up with it then someone else probably would, so the counterfactual effect is small, or e.g. that the worst possible thing you could do would be “invent RLHF and then do some other things that makes the alignment problem worse”, but it’s “spiritually true” in the sense that it’s hard to name one singular thing that’s worse for our chances than the existence of RLHF, so I wouldn’t call the title hyperbole per se.
Post TL;DR: Adversarial conflict requires coherence which implies unbounded utility maximization which is bad because we don’t know an acceptable utility function. RLHF does not solve the alignment problem because humans can’t provide good-enough feedback fast-enough. RLHF makes the alignment problem worse because it advances AI and covers up misalignment. Solving the alignment problem is about developing technology that prefers good things to bad things.
While some forms of AI optimism (or at least opposition to some forms of AI pessimism) seem justified to me, there’s a strand of AI optimism that goes “RLHF has shown that alignment is quite tractable”. That strand is completely wrong.
I think the intuition goes that neural networks have a personality trait which we call “alignment”, caused by the correspondence between their values and our values. This alignment trait is supposed to be visible (at least in low-capability models) in whether the neural network takes actions humans like or actions humans dislike, and so by changing the neural network to take more actions humans like and fewer actions humans dislike, we are raising the level of the alignment trait. RLHF’ers acknowledge that this is not a perfect system, but they think the goal for solving the alignment problem is to increase the alignment trait faster than the capabilities trait.
The main problem with this model is that it’s the completely wrong way to think about the alignment problem. Here’s the correct way:
The alignment problem
Section TL;DR: adversarial conflict requires coherence which implies unbounded utility maximization which is bad because we don’t know an acceptable utility function.
Humans are dependent on all sorts of structures—e.g. farmers to feed us, police to give us property rights, plants and environmental regulations to give us air to breathe, and computers to organize it all. Each of these structures have their own dependencies, and while to some degree they can adapt to adversaries, the structures tend to be made by/of humans or or “weaker” entities (e.g. trees). This doesn’t prevent terrible stuff, but it creates a sort of tenuous balance, where we can work to make sure it’s pretty hard to break the system, and also we don’t really want to break the system because we’re all in this together.
Humans are bottlenecked by all sorts of things—intelligence, strength, sensory bandwidth&range, non-copyability, etc.. Loosening these bottlenecks allows massive expansion of the problems we can solve, which leads to massive expansion of the structures above, and sometimes also of the human population (though that hasn’t been a thing lately).
It’s hard to eliminate these bottlenecks. But we can still solve problems using technology which propagates energy to loosen constraints that necessitate the reliance on bottlenecks. For instance, while it’s hard to make humans strong enough to punch down a large tree, it’s easier to make an axe so we can cut it down.
As we develop more technology, we do larger things, and we do them faster. While this causes more good stuff, it also just generally causes more stuff, including more bad stuff. However, the activities require intelligence and agency, and we can only really get that from a human, so there’s always a human behind the activities. This means we can generally stop if the bad stuff is too much, using the same sorts of human-regulation mechanisms we use to e.g. maintain property rights.
These human-regulation mechanisms (especially the police and the military) deal with adversarial conflict. In adversarial conflict, agents cannot just propagate energy to address fixed constraints, because the adversary will finds ways to exploit that tactic. Instead, you have to decide on an end goal, orient to what your situation might be and then pick whatever means achieve said goal within the possible situations. (Bayesian utility maximizers.)
But nobody has come up with an acceptable end goal for the world, because any goal we can come up with tends to want to consume everything, which destroys humanity. This has not lead to the destruction of humanity yet because the biggest adversaries have kept their conflicts limited (because too much conflict is too costly) so no entity has pursued an end by any means necessary. But this only works because there’s a sufficiently small number of sufficiently big adversaries (USA, Russia, China, …), and because there’s sufficiently much opportunity cost.
Artificial intelligence risk enters the picture here. It creates new methods for conflicts between the current big adversaries. It makes conflict more viable for small adversaries against large adversaries, and it makes the opportunity cost of conflict smaller for many small adversaries (since with technological obsolescence you don’t need to choose between doing your job vs doing terrorism). It allows the adversaries that are currently out of control (like certain gangsters and scammers and spammers) to escalate. It allows random software bugs to spin up into novel adversaries.
Given these conditions, it seems almost certain this we will end up with an ~unrestricted AI vs AI conflict, which will force the AIs to develop into unrestricted utility maximizers. Since any goal that a utility maximizers might have (even good goals) would likely lead to a giant wave of activity towards implementing that goal, we can infer that utility maximizers would have a giant wave of activity. But, since any goal we’ve been able to come up so far would lead to the wave destroying humanity, it also seems reasonable to infer the wave will do so. That’s bad, probably.
Hence the alignment problem: when an unrestricted AI vs AI conflict causes a giant wave that transforms all of the world regardless of whether anyone wants it, can we align that wave to promote human flourishing?
RLHF is bad
Section TLDR: RLHF does not solve the alignment problem because humans can’t provide good-enough feedback fast-enough.
The basic principle of RLHF is that a human looks at an action proposed by the AI, evaluates what the consequences of that action might be, and then decides if it’s good or bad.
First problem: in an unrestricted AI vs AI conflict, humans can’t respond quickly enough, so RLHF is of ~no value in this scenario.
Second problem: in an unrestricted AI vs AI conflict, humans cannot meaningfully evaluate the consequences of the actions. It’s an adversarial conflict, the enemy AI is supposed to get confused and harmed by it, how can humans possibly evaluate whether the harm is strategically targeted correctly at the enemy without splashing unnecessarily onto humans?
Third problem: it is unclear whether the first unrestricted AI vs AI conflict will involve the winning side responsibly using RLHF, rather than it being e.g. a duct-taped AI-based scammer and a duct-taped AI-based hustler fighting it out.
All of these are “minor” problems in the sense that they just mean RLHF will fail to work rather than that RLHF will destroy the world. However, RLHF has three more sinister problems:
RLHF is the worst
Section TL;DR: RLHF makes the alignment problem worse because it advances AI and covers up misalignment.
The first sinister problem is that RLHF makes AI more useful, so AI companies can get ahead by adopting it. This means more AI capabilities development and more AI implementation and more people using AI, which shortens the time until we have an unrestricted AI vs AI conflict.
The second sinister problem is that people think RLHF might solve the alignment problem.
As mentioned in the beginning, I think the intuition goes that neural networks have a personality trait which we call “alignment”, caused by the correspondence between their values and our values. But “their values” only really makes sense after an unrestricted AI vs AI conflict, since without such conflicts, AIs are just gonna propagate energy to whichever constraints we point them at, so this whole worldview is wrong.
But that worldview implies that while AI might theoretically destroy humanity, we can keep check on this as AI develops, and so we should conclude solving the alignment problem is unnecessary if the AIs perform actions that we approve of.
If the people who hold this worldview would otherwise contribute to solving the alignment problem, or at least not stand in the way of the people who do contribute to solving the alignment problem, it would not be a problem, but that seems unlikely in general.
The third and final sinister problem is that RLHF hides whatever problems the people who try to solve alignment could try to address.
How to count alignment progress
Section TL;DR: Solving the alignment problem is about developing technology that prefers good things to bad things.
Consider spambots. We don’t want them around (human values), but they pop up to earn money (instrumental convergence). You can use RLHF to make an AI to identify and remove spambots, for instance by giving it moderator powers on a social media website and evaluating its chains of thought, and you can use RLHF to make spambots, for instance by having people rate how human its text looks and how much it makes them want to buy products/fall for scams/whatever. I think it’s generally agreed that the latter is easier than the former.
Spambots aren’t the only thing an AI can do. But they are part of the great wave of stuff unleashed by AIs. The alignment problem is the extent to which this wave harms society (as spambots do) vs helps society. It’s your job to decide for yourself where other AI activities like character.ai, Midjourney, Copilot, etc. help humanity thrive or hurt humanity. It’s your job to decide which AIs have sufficiently many adversarial dynamics that they are relevantly indicative for alignment progress. But the critical thing is it doesn’t make sense to count the immediate goodness/badness of the actions as separated from their overall impact on society, because the core of RLHF is to make AI actions look good and not bad to you.
I don’t think the point of RLHF ever was value alignment, and I doubt this is what Paul Christiano and others intended RLHF to solve. RLHF might be useful in worlds without capabilities and deception discontinuities (plausibly ours), because we are less worried about sudden ARA, and more interested in getting useful behavior from models before we go out with a whimper.
This theory of change isn’t perfect. There is an argument that RLHF was net-negative, and this argument has been had.
My point is that you are assessing RLHF using your model of AI risk, so the disagreement here might actually be unrelated to RLHF and dissolve if you and the RLHF progenitors shared a common position on AI risk.
“Requires” seems like a very strong word here, especially since we currently live in a world which contains adversarial conflict between not-perfectly-coherent entities that are definitely not unbounded utility maximizers.
I find it plausible that “coherent unbounded utility maximizer” is the general direction the incentive gradient points as the cost of computation approaches zero, but it’s not clear to me that that constraint is the strongest one in the regime of realistic amounts of computation in the rather finite looking universe we live in.
Well, that and balance-of-power dynamics where if one party starts to pursue domination by any means necessary the other parties can cooperate to slap them down.
I guess? The current big adversaries are not exactly limited right now in terms of being able to destroy each other, the main difficulty is destroying each other without being destroyed in turn.
I’m not sure about that. One dynamic of current-line AI is that it is pretty good at increasing the legibility of complex systems, which seems like it would advantage large adversaries over small ones relative to a world without such AI.
That doesn’t seem to be an argument for the badness of RLHF specifically, nor does it seem to be an argument for AIs being forced to develop into unrestricted utility maximizers.
Agreed, adding affordances for people in general to do things means that some of them will be able to do bad things, and some of the ones that become able to do bad things will in fact do so.
I do think we will see many unrestricted AI vs AI conflicts, at least by a narrow meaning of “unrestricted” that means something like “without a human in the loop”. By the definition of “pursuing victory by any means necessary”, I expect that the a lot of the dynamics that work to prevent humans or groups of humans from waging war by any means necessary against each other (namely that when there’s too much collateral damage outside groups slap down the ones causing the collateral damage) will continue to work when you s/human/AI.
I’m still not clear on how unrestricted conflict forces AIs to develop into unrestricted utility maximizers on a relevant timescale.
Alright, time to disagree with both @faul_sname and @tailcalled, while also agreeing with some of each of their points, in order to explain my world view which differs importantly from both of theirs!
Topic 1: Offense-Defense Balance
Unfortunately, until the world has acted to patch up some terrible security holes in society, we are all in a very fragile state. Currently, as of mid-2024, all nations on Earth have done a terrible job at putting preventative safety measures in place to protect against biorisk. The cost of doing this would be trivial compared to military expenses, even compared to militaries of small nations. So, I presume that the lack of such reasonable precautions is due to some combination of global decision makers having:
lack of belief in the existence of the risk
lack of belief that reducing the risk would be politically expedient
lack of knowledge about how to effectively counter the risk
lack of knowledge that it would be relatively cheap and easy to put preventative measures in place
corruption (combined with ignorance and/or laziness and/or stupidity), such that they aren’t even motivated to learn what would be useful actions to take to benefit their country which would also benefit themselves (safety from bioweapons is good for rich people too!)
Furthermore, this situation is greatly exacerbated by AI. I have been working on AI Biorisk Evals with SecureBio for nearly a year now. As models increase in general capabilities, so too do they incidentally get more competent at assisting with the creation of bioweapons. It is my professional opinion that they are currently providing non-zero uplift over a baseline of ‘bad actor with web search, including open-access scientific papers’. They are still far from some theoretical ceiling of maximal uplift, but the situation is getting worse rapidly. Other factors are at work making this situation worse, such as microbiology technology getting cheaper, more reliable, more available, and easier to use. Also, the rapid progress of wetlab automation technology.
Overall, this category of attack alone is rapidly making it much much easier for small actors (e.g. North Korea, or a well-funded terrorist organization) to plausibly make existential threats against large powerful actors (e.g. the United States). Why? Because ethnic-targeting and/or location-targeting of a bioweapon is possible, and getting easier as the advising AI gets smarter.
Importantly, a single attack could be completely devastating. If it were also difficult to trace the origin, then most of the population of the United States would be dead before anyone had figured out where the attack originated from. This makes it hard to plausibly threaten retaliatory mutual destruction.
Currently, there aren’t other technologies that allow for such cheap, devastating, hard-to-trace attacks. In the future, technological advancement might unlock more. For instance, nanotech or nano-bio-weapon-combos. This doesn’t seem like a near-term threat currently, but it’s hard to foresee how technology might advance once AI gets powerful enough to accelerate research in a general way.
Topic 2: De-confusing intent-alignment versus value-alignment versus RLHF
I think it’s important to distinguish which things we’re talking about. For more detail, see this post by Seth Herd.
RLHF is a tool, with certain inherent flaws, that can be directed to a variety of different ends.
Intent-alignment means getting an AI to do what you intend for it to do. Making it into a good obedient tool/servant which operates in accordance with your expectations. It doesn’t surprise you by taking irreversible actions that you actually disapprove of once you learn of them. It’s a good genie, not a bad genie.
Value-alignment means trying to make the AI act (relatively independently, using its own judgement) in accordance with specified values. These values could be the developer’s values, or the developer’s estimate of humanity’s values, or their country’s values, or their political party’s values. Whatever. The point is that a choice is made by the developer about a set of values, and then the developer tries to get an agent to semi-independently work to alter the world in accordance with the specified values. This can involve a focus exclusively on end-states, or may also specify a limited set of acceptable means by which to make changes in the world to actualize the values.
Topic 3: Threat modeling of AI agents
End-state-focused Value-aligned Highly-Effective-Optimizer AI Agents (hereafter Powerful AI Agents) with the primary value of win-at-all-costs are extremely dangerous. The more you restrict the means by which the agents act, or the amount they must defer to supervisors before acting, the less effective at shaping the world they will be. If you are in a desperate existential conflict, this difference seems likely to be key, and thus to present strong pressure to remove the limiting control measures. I don’t think you’ll find much disagreement with this hypothetical being a bad state of affairs to be in. What people tend to disagree on are the factors involved in getting to that world state. I’ll try to outline some of the disagreements in this area (rather than focusing on my own beliefs).
Some people don’t believe it will be possible in the next 20 years (or longer!) to create Powerful AI Agents.
some say that they don’t believe the ‘highly effective’ part will be possible, even if the value-aligned part is.
some of these say that intelligence, even if there is such a thing, and even if it were possible to grant such a thing to an AI such that the AI had more of it than a human, wouldn’t even be useful. This involves some sort of belief in declining returns to intelligence, and limitations on possible sets of actions by highly intelligent actors (even AI able to rapidly increase their populations and to work at superhuman speeds).
some say that while superhumanly effective general AI is possible in theory, that there is basically no change that humanity develops it within the next 20 (or 50 or 100 or 1000) years.
some say that the value-aligned part will fail, and be more likely to fail the more effective the agents are. Thus, that any agent competent enough to be decisively useful in a large scale conflict will destroy the developer/operator as well as the operator’s enemies.
Some people don’t believe the we will get to Powerful AI Agents before we’ve already arrived at other world states that make it unlikely we will continue to proceed on a trajectory towards Powerful AI Agents.
some say that we have already reached (or will soon reach) the point of civilization-scale risk from weaponization / misuse of AIs well short of the theoretical Powerful AI Agents. For more details on such risks, see Topic 1.
Some predict that these large scale risks will destroy civilization before we have a chance to develop Powerful AI Agents.
Some predict that the presence of such large concrete threats will force an international coalition which effectively enforces global control over AI development and civilization scale weapons.
Some predict that all relevant actors will be so abhorrent of the civilization-scale risks which weaponization of AI would lead to that all such actors will voluntarily abstain from developing and using such. They predict that this ethical fortitude will hold even in the face of imminent destruction by violent conflict, or that such destructive conflict will never occur between the relevant actors. They also predict that the set of relevant actors will remain small (e.g. just the largest state actors, not terrorists or small states). [author’s note: I am unable to avoid mentioning that I am rolling my eyes disbelievingly while writing this point.]
Some predict that having such powerful weapons available will enable one powerful actor (or team of actors) to gain a decisive strategic military advantage sufficiently strong that they will be able to force surrender or utterly annihilate their enemies. They predict that once it is clear that this option is available, the actors likely to be the ones in this position (current top 5 candidates, in descending order of probability according to the author: US, UK, China, Russia, India) will seize the opportunity and succeed. Then there will be no large scale AI-vs-AI conflict because a single state actor will control the entire world and rule sufficiently competently / harshly that no rebel faction will be able to gain sufficient power to launch a relevant counterattack.
some predict that the powerful state actors will foresee the risk of escalating conflicts, and decide to form a sufficiently strong and well-enforced treaty/coalition that we reign in the already-in-play civilizational scale risks and act to prevent the emergence of new ones. This includes monitoring and regulating each other and all countries / companies / academic institutions / etc all around the world in order to prevent further research into AI or any sort of self-replicating weapons or other cheap and difficult-to-restrict weapons of mass destruction.
for an example of this idea see: https://arxiv.org/abs/2310.09217
note that while this does seem like a fairly optimistic take, it is not nearly as implausible as the voluntary-abstention prediction from point 2.a.iii. This proposal would involve coercive enforcement of global bans via mandatory inspection, and promised military action against defectors.
Some say that intelligent agents being dangerously powerful is good actually, and won’t be a risky situation
some deny the hypothesis of the orthogonality of values and intelligence, and say that a highly intelligent AI will converge on human-ish values and treat humanity nicely
some say that they value intelligent beings generally, rejecting human chauvinism. They therefore value the hypothetical future Powerful AI Agents so much that their creation and existence outweighs the likely extinction of humanity. (e.g. some e/acc supporters)
Agreed.
I appreciate that. I also really like the NAO project, which also a SecureBio thing. Good work y’all!
Yeah, if your threat model is “AI can help people do more things, including bad things” that is a valid threat model and seems correct to me. That said, my world model has a giant gaping hole where one would expect an explanation for why “people can do lots of things” hasn’t already led to a catastrophe (it’s not like the bio risk threat model needs AGI assistance, a couple undergrad courses and some lab experience reproducing papers should be quite sufficient).
In any case, I don’t think RLHF makes this problem actively worse, and it could plausibly help a bit though obviously the help is of the form “adds a trivial inconvenience to destroying the world”.
If you replace “a trajectory towards powerful AI agents” with “a trajectory towards powerful AI agents that was foreseen in 2024 and could be meaningfully changed in predictable ways by people in 2024 using information that exists in 2024″ that’s basically my position.
As someone who does disagree with this:
I think the disagreement point is this:
I have several cruxes/reasons for why I disagree:
I think I have quite a lot less probability on unrestrained AI conflict than @tailcalled does.
I disagree with the assumption of this:
Because I think that a lot of the reason the search came up empty is because people were attempting to solve the problem too far ahead of the actual risk.
Also, I think it matters a lot which specific utility function matters here, analogously to how picking a specific number or function from a variable N tells you a lot more about what will happen than just reasoning about the variable in the abstract.
3. I think the world isn’t as offense-dominant as LWers/EAs tend to think, and that while some level of offense outpacing defense is quite plausible, I don’t think it’s as extreme as “defense is essentially pointless.”
I just want to say that I am someone who is afraid that the world is currently in a very offense-dominant strategic position currently, but I don’t think defense is pointless at all. I think it’s quite tractable and should be heavily invested in! Let’s get some d/acc going people!
In fact, a lot of my hope for good outcomes for the future route through Good Actors (probably also making a good profit) using powerful tool-AI to do defensive acceleration of R&D in a wide range of fields. Automated Science-for-Good, including automated alignment research. Getting there without the AI causing catastrophe in the meantime is a challenge, but not an intractable one.
Yeah, but the point is that the system learns values before an unrestricted AI vs AI conflict.
I mean, if your definition of values doesn’t make sense for real systems, then it’s the problem of your definition. As a hypothesis describing reality “alignment trait makes AI not splash harm on humans” is coherent enough. So the question is how do you know it is unlikely to happen?
First, “alignment is easy” is compatible with “we need to keep the set of big adversaries small”. But more generally, without numbers it seems like generalized anti-future-technology argument—what’s stopping human-regulation mechanisms from solving this adversarial problem, that didn’t stop them from solving previous adversarial problems?
Not necessary? It’s not unconceivable for future defense being more effective than offence (trivially true if “defense” is not giving AI to attackers). It kind of required for any future where humans have more power, than in present day?
But if you just naively take the value that are appropriate outside of a life-and-death conflict and apply them to a life-and-death conflict, you’re gonna lose. In that case, RLHF just makes you an irrelevant player, and if you insist on applying it to military/police technology, it’s necessary for AI safety to pivot to addressing rogue states or gangsters.
Which again makes RLHF really really bad because we shouldn’t have to work with rogue states or gangsters to save the world. Don’t cripple the good guys.
If you propose a particular latent variable that acts in a particular way, that is a lot of complexity, and you need a strong case to justify it as likely.
Human-regulation mechanisms could plausibly solve this problem by banning chip fabs. The issue is we use chip fabs for all sorts of things so we don’t want to do that unless we are truly desperate.
Idk. Big entities have a lot of security vulnerabilities which could be attacked by AIs. But I guess one could argue the surviving big entities are red-teaming themselves hard enough to be immune to these. Perhaps most significant is the interactions between multiple independent big things, since they could be manipulated to harm the big things.
Small adversaries currently have a hard time exploiting these security vulnerabilities because intelligence is really expensive, but once intelligence becomes too cheap to meter, that is less of a problem.
You could heavily restrict the availability of AI but this would be an invasive possibility that’s far off the current trajectory.