WebGPT seemed like one of the most in-expectation harmful projects that OpenAI has worked on, with no (to me) obvious safety relevance, so my guess is I would still mostly categorize the things you list under the first misconception as capabilities research. InstructGPT also seems to be almost fully capabilities research (like, I agree that there are some safety lessons to be learned here, but it seems somewhat clear to me that people are working on WebGPT and InstructGPT primarily for capabilities reasons, not for existential-risk-from-AI reasons)
(Edit: My current guess for full-time equivalents who are doing safety work at OpenAI (e.g. if someone is doing 50% work that a researcher fully focused on capabilities would do and 50% on alignment work, then we count them as 0.5 full-time equivalents) is around 10, maybe a bit less, though I might be wrong here.)
I was the project lead on WebGPT and my motivation was to explore ideas for scalable oversight and truthfulness (some further explanation is given here).
The real question for Habryka is why does he think that it’s bad for WebGPT to be built in order to get truthful AI? Like, isn’t solving that problem quite a significant thing already for alignment?
WebGPT is approximately “reinforcement learning on the internet”.
There are some very minimal safeguards implemented (search via Bing API, but the AI can click on arbitrary links), but I do indeed think “reinforcement learning on the internet” is approximately the worst direction for modern AI to go in terms of immediate risks.
I don’t think connecting GPT-3 to the internet is risky at current capability levels, but pushing AI in the direction of just hooking up language models with reinforcement learning to a browser seems like one of the worst directions for AI to go. And my guess is the majority of the effect of this research will be to cause more people to pursue this direction in the future (Adept.AI seems to be pursuing a somewhat similar approach).
Edit: Jacob does talk about this a bit in a section I had forgotten about in the truthful LM post:
Another concern is that working on truthful LMs may lead to AI being “let out of the box” by encouraging research in which models interact with the external world agentically, in the manner of WebGPT.
I think this concern is worth taking seriously, but that the case for it is weak:
As AI capabilities improve, the level of access to the external world required for unintended model behavior to cause harm goes down. Hence access to the external world needs to be heavily restricted in order to have a meaningful safety benefit, which imposes large costs on research that are hard to justify.
I am in favor of carefully and conservatively evaluating the risks of unintended model behavior before conducting research, and putting in place appropriate monitoring. But in the short term, this seems like an advantage of the research direction rather than a disadvantage, since it helps surface risks while the stakes are still low, build institutional capacity for evaluating and taking into account these risks, and set good precedents.
In case this does turn out to be more of a concern upon reflection, there are other approaches to truthful AI that involve less agentic interaction with the external world than continuing in the style of WebGPT.
There is still an argument that there will be a period during which AI is capable enough to cause serious damage, but not capable enough to escape from sandboxed environments, and that setting precedents could worsen the risks posed during this interval. I don’t currently find this argument persuasive, but would be interested to hear if there is a more persuasive version of it. That said, one bright line that stands out is training models to perform tasks that actually require real-world side effects, and I think it makes sense to think carefully before crossing that line.
I don’t think I would phrase the problem as “letting the AI out of the box” and more “training an AI in a context where agency is strongly rewarded and where there are a ton of permanent side effects”.
I find the point about “let’s try to discover the risky behavior as early as possible” generally reasonable, and am in-favor of doing this kind of work now instead of later, but I think in that case we need to put in quite strong safeguards and make it very clear that quite soon we don’t want to see more research like this, and I don’t think the WebGPT work got that across.
I don’t understand this point at all:
As AI capabilities improve, the level of access to the external world required for unintended model behavior to cause harm goes down. Hence access to the external world needs to be heavily restricted in order to have a meaningful safety benefit, which imposes large costs on research that are hard to justify.
This says to me “even very little access to the external world will be sufficient for capable models to cause great harm, so we have to restrict access to the external world a lot, therefore… we won’t do that because that sounds really inconvenient for my research”. Like, yes, more direct access to the external world is one of the most obvious ways AIs can cause more harm and learn more agentic behavior. Boxing is costly.
The primary job of OpenAI is to be a clear leader here and do the obvious good things to keep an AI safe, which will hopefully include boxing it. Saying “well, seems like the cost is kinda high so we won’t do it” seems like exactly the kind of attitude that I am worried will cause humanity to go extinct.
Separately, there is also no other research I am aware of that is training AI as directly on access to the internet (except maybe at Adept.ai), so I don’t really buy that currently at the margin the cost of avoiding research like this would be very high, either for capabilities or safety.
But I might also be completely misunderstanding this section. I also don’t really understand why you only get a safety benefit when you restrict access a lot. Seems like you also get a safety benefit earlier, by just making it harder for the AI to build a good model of the external world and not learning heuristics for manipulating people, etc.
There is still an argument that there will be a period during which AI is capable enough to cause serious damage, but not capable enough to escape from sandboxed environments, and that setting precedents could worsen the risks posed during this interval.
I mean, isn’t this the mainline scenario of most prosaic AI Alignment research? A lot of the current plans for AI Alignment consist of taking unaligned AIs, boxing them, and then trying to use them to do better AI Alignment research despite them being somewhat clearly unaligned, but unable to break out of the box.
The primary job of OpenAI is to be a clear leader here and do the obvious good things to keep an AI safe, which will hopefully include boxing it. Saying “well, seems like the cost is kinda high so we won’t do it” seems like exactly the kind of attitude that I am worried will cause humanity to go extinct.
When you say “good things to keep an AI safe” I think you are referring to a goal like “maximize capability while minimizing catastrophic alignment risk.” But in my opinion “don’t give your models access to the internet or anything equally risky” is a bad way to make that tradeoff. I think we really want dumber models doing more useful things, not smarter models that can do impressive stuff with less resources. You can get a tiny bit of safety by making it harder for your model to have any effect on the world, but at the cost of significant capability, and you would have been better off just using a slightly dumber model with more ability to do stuff. This effect is much bigger if you need to impose extreme limitations in order to get any of this “boxing benefit” (as claimed by the quote you are objecting to).
I assume the harms you are pointing to here are about setting expectations+norms about whether AI should interact with the world in a way that can have effects. But people attempting to box smart unaligned AIs, or believing that boxed AIs are significantly safer because they can’t access the internet, seems to me like a bad situation. An AI smart enough to cause risk with internet access is very likely to be able to cause risk anyway, and at best you are creating a super unstable situation where a lab leak is catastrophic. So the possible norms you are gesturing at preserving seem like they are probably net negative to me, because the main effect of these norms is on how strong an AI has to be before people consider it dangerous, not the relevance to an alignment strategy that (to me) doesn’t seem very workable.
I think the argument “don’t do things that could lead to low-stakes failures because then people will get in the habit of allowing failure” is sometimes right but often wrong. I think examples of reward hacking in a successor to WebGPT would have a large effect on reducing risk via learning and example, while having essentially zero direct costs. You say this requires “strong protections” presumably to avoid the extrapolation out to catastrophe, but you don’t really get into any quantitative detail and when I think about the numbers on this it looks like the net effect is positive. I don’t think this is close to the primary benefit of this work, but I think it’s already large enough to swamp the costs. I think the story would be way different if the actual risk posed by WebGPT was meaningful (say if it were driving >0.1% of the risk of OpenAI’s activities).
I believe the most important drivers of catastrophic misalignment risk are models that optimize in ways humans don’t understand or are deceptively aligned. So the great majority of risk comes from actions that accelerate those events, and especially making models smarter. I think your threat model here is quantitatively wrong, and that it’s an important disagreement.
There is a lot of disagreement about what alignment research is useful. For example, much of the work I consider useful you consider ~useless, and much of the work you consider useful I consider ~useless. But I think the more interesting disagreement is did the work help, and focusing on net negativeness seems rhetorically relevant but not very relevant to the cost-benefit analysis. This is related to the last point. If you thought that researchers working on WebGPT were shortening timelines significantly more efficiently than the average AI researcher, then the direct harm starts to become relevant compared to opportunity costs. (This is also related to the argument here which I disagree with very strongly, but might be the kind of intuition you are drawing on.)
I don’t think “your AI wants to kill you but it can’t get out of the box so it helps you with alignment instead” is the mainline scenario. You should be building an AI that wouldn’t stab you if your back was turned and it was holding a knife, and if you can’t do that then you should not build the AI. I believe all the reasonable approaches to prosaic AI alignment involve avoiding that situation, and the question is how well you succeed. I agree you want defense in depth and so you should also not give your AI a knife while you are looking away, but again (i) web access is really weaksauce compared to the opportunities you want to give your AI in order for it to be useful, and “defense in depth” doesn’t mean compromising on protections that actually matter like intelligence in order to buy a tiny bit of additional security, (ii) “make sure all your knives are plastic” is a pretty lame norm that is more likely to make it harder to establish clarity about risks than to actually help once you have AI systems who would stab you if they got the chance.
It’s very plausible the core disagreement here may be something like “how useful is it for safety if if people try to avoid giving their AI access to the internet.” It’s possible that after thinking more about the argument that this is useful I might change my mind. I don’t know if you have any links to someone making this argument. I think there are more and less useful forms of boxing and “your AI can’t browse the internet” is one of the forms that makes relatively little sense. (I think the versions that make most sense are way more in the weeds about details of the training setup.) I think that many kinds of improvements in security and thoughtfulness about training setup make much more sense (though are mostly still lower-order terms).
I think the best argument for work like WebGPT having harms in the same order of magnitude as opportunity cost is that it’s a cool thing to do with AI that might further accelerate interest in the area. I’m much less sure how to think about these effects and I could imagine it carrying the day that publication of WebGPT is net negative. (Though this is not my view.)
To be clear, this is not post hoc reasoning. I talked with WebGPT folks early on while they were wondering about whether these risks were significant, and I said that I thought this was badly overdetermined. If there had been more convincing arguments that the harms from the research were significant, I believe that it likely wouldn’t have happened.
If you thought that researchers working on WebGPT were shortening timelines significantly more efficiently than the average AI researcher, then the direct harm starts to become relevant compared to opportunity costs.
Yeah, my current model is that WebGPT feels like some of the most timelines-reducing work that I’ve seen (as has most of OpenAIs work). In-general, OpenAI seems to have been the organization that has most shortened timelines in the last 5 years, with the average researcher seeming ~10x more efficient at shortening timelines than even researchers at other AGI companies like Deepmind, and probably ~100x more efficient than researchers at most AI research organizations (like Facebook AI).
WebGPT strikes me on the worse side of OpenAI capabilities research in terms of accelerating timelines (since I think it pushes us into a more dangerous paradigm that will become dangerous earlier, and because I expect it to be the kind of thing that could very drastically increase economical returns from AI). And then it also has the additional side-effect of pushing us into a paradigm of AIs that are much harder to align and so doing alignment work in that paradigm will be slower (as has I think a bunch of the RLHF work, though there I think there is a more reasonable case for a commensurate benefit there in terms of the technology also being useful for AI Alignment).
I think almost all of the acceleration comes from either products that generate $ and hype and further investment, or more directly from scaleup to more powerful models. I think “We have powerful AI systems but haven’t deployed them to do stuff they are capable of” is a very short-term kind of situation and not particularly desirable besides.
I’m not sure what you are comparing RLHF or WebGPT to when you say “paradigm of AIs that are much harder to align.” I think I probably just think this is wrong, in that (i) you are comparing to pure generative modeling but I think that’s the wrong comparison point barring a degree of coordination that is much larger than what is needed to avoid scaling up models past dangerous thresholds, (ii) I think you are wrong about the dynamics of deceptive alignment under existing mitigation strategies and that scaling up generative modeling to the point where it is transformative is considerably more likely to lead to deceptive alignment than using RLHF (primarily via involving much more intelligent models).
Something I learned today that might be relevant: OpenAI was not the first organization to train transformer language models with search engine access to the internet. Facebook AI Research released their own paper on the topic six months before WebGPT came out, though the paper is surprisingly uncited by the WebGPT paper.
Generally I agree that hooking language models up to the internet is terrifying, despite the potential improvements for factual accuracy. Paul’s arguments seem more detailed on this and I’m not sure what I would think if I thought about them more. But the fact that OpenAI was following rather than leading the field would be some evidence against WebGPT accelerating timelines.
However, I don’t think this is really the same kind of reference class in terms of risk. It looks like the search engine access for the Facebook case is much more limited and basically just consisted of them appending a number of relevant documents to the query, instead of the model itself being able to send various commands that include starting new searches and clicking on links.
A search query generator: an encoder-decoder Transformer that takes in the dialogue context as input, and generates a search query. This is given to the black-box search engine API, and N documents are returned.
You’d think they’d train the same model weights and just make it multi-task with the appropriate prompting, but no, that phrasing implies that it’s a separate finetuned model, to the extent that that matters. (I don’t particularly think it does matter because whether it’s one model or multiple, the system as a whole still has most of the same behaviors and feedback loops once it gets more access to data or starts being trained on previous dialogues/sessions—how many systems are in your system? Probably a lot, depending on your level of analysis. Nevertheless...)
But people attempting to box smart unaligned AIs, or believing that boxed AIs are significantly safer because they can’t access the internet, seems to me like a bad situation. An AI smart enough to cause risk with internet access is very likely to be able to cause risk anyway, and at best you are creating a super unstable situation where a lab leak is catastrophic.
I do think we are likely to be in a bad spot, and talking to people at OpenAI, Deepmind and Anthropic (e.g. the places where most of the heavily-applied prosaic alignment work is happening), I do sure feel unhappy that their plan seems to be to be banking on this kind of terrifying situation, which is part of why I am so pessimistic about the likelihood of doom.
If I had a sense that these organizations are aiming for a much more comprehensive AI Alignment solution that doesn’t rely on extensive boxing I would agree with you more, but I am currently pretty sure they aren’t ensuring that, and by-default will hope that they can get far enough ahead with boxing-like strategies.
talking to people at OpenAI, Deepmind and Anthropic [...]
If I had a sense that these organizations are aiming for a much more comprehensive AI Alignment solution that doesn’t rely on extensive boxing I would agree with you more, but I am currently pretty sure they aren’t ensuring that, and by-default will hope that they can get far enough ahead with boxing-like strategies.
… Who are you talking to? I’m having trouble naming a single person at either of OpenAI or Anthropic who seems to me to be interested in extensive boxing (though admittedly I don’t know them that well). At DeepMind there’s a small minority who think about boxing, but I think even they wouldn’t think of this as a major aspect of their plan.
I agree that they aren’t aiming for a “much more comprehensive AI alignment solution” in the sense you probably mean it but saying “they rely on boxing” seems wildly off.
My best-but-still-probably-incorrect guess is that you hear people proposing schemes that seem to you like they will obviously not work in producing intent aligned systems and so you assume that the people proposing them also believe that and are putting their trust in boxing, rather than noticing that they have different empirical predictions about how likely those schemes are to produce intent aligned systems.
Here is an example quote from the latest OpenAI blogpost on AI Alignment:
Language models are particularly well-suited for automating alignment research because they come “preloaded” with a lot of knowledge and information about human values from reading the internet. Out of the box, they aren’t independent agents and thus don’t pursue their own goals in the world. To do alignment research they don’t need unrestricted access to the internet. Yet a lot of alignment research tasks can be phrased as natural language or coding tasks.
This sounds super straightforwardly to me like the plan of “we are going to train non-agentic AIs that will help us with AI Alignment research, and will limit their ability to influence the world, by e.g. not giving them access to the internet”. I don’t know whether “boxing” is the exact right word here, but it’s the strategy I was pointing to here.
Importantly, we only need “narrower” AI systems that have human-level capabilities in the relevant domains to do as well as humans on alignment research. We expect these AI systems are easier to align than general-purpose systems or systems much smarter than humans.
I would have guessed the claim is “boxing the AI system during training will be helpful for ensuring that the resulting AI system is aligned”, rather than “after training, the AI system might be trying to pursue its own goals, but we’ll ensure it can’t accomplish them via boxing”. But I can see your interpretation as well.
Oh, I do think a bunch of my problems with WebGPT is that we are training the system on direct internet access.
I agree that “train a system with internet access, but then remove it, then hope that it’s safe”, doesn’t really make much sense. In-general, I expect bad things to happen during training, and separately, a lot of the problems that I have with training things on the internet is that it’s an environment that seems like it would incentivize a lot of agency and make supervision really hard because you have a ton of permanent side effects.
Oh you’re making a claim directly about other people’s approaches, not about what other people think about their own approaches. Okay, that makes sense (though I disagree).
Oh, I do think a bunch of my problems with WebGPT is that we are training the system on direct internet access.
I agree that “train a system with internet access, but then remove it, then hope that it’s safe”, doesn’t really make much sense.
I was suggesting that the plan was “train a system without Internet access, then add it at deployment time” (aka “box the AI system during training”). I wasn’t at any point talking about WebGPT.
I don’t think “your AI wants to kill you but it can’t get out of the box so it helps you with alignment instead” is the mainline scenario. You should be building an AI that wouldn’t stab you if your back was turned and it was holding a knife, and if you can’t do that then you should not build the AI.
That’s interesting. I do think this is true about your current research direction (which I really like about your research and I do really hope we can get there), but when I e.g. talk to Carl Shulman he (if I recall correctly) said things like “we’ll just have AIs competing against each other and box them and make sure they don’t have long-lasting memory and then use those competing AIs to help us make progress on AI Alignment”. Buck’s post on “The prototypical catastrophic AI action is getting root access to its datacenter” also suggests to me that the “AI gets access to the internet” scenario is a thing that he is pretty concerned about.
More broadly, I remember that Carl Shulman said that he thinks that the reference class of “violent revolutions” is generally one of the best reference classes for forecasting whether an AI takeover will happen, and that a lot of his hope comes from just being much better at preventing that kind of revolution, by making it harder by e.g. having AIs rat out each other, not giving them access to resources, resetting them periodically, etc.
I also think that many AI Alignment schemes I have heard about rely quite a bit on preventing an AI from having long-term memory or generally be able to persist over multiple instantiations, which becomes approximately impossible if an AI just has direct access to the internet.
I think we both agree that in the long-run we want to have an AI that we can scale up much more and won’t stab us in the back even when much more powerful, but my sense is outside of your research in-particular, I haven’t actually seen anyone work on that in a prosaic context, and my model of e.g. OpenAI’s safety team is indeed planning to rely on having a lot of very smart and not-fully-aligned AIs do a lot of work for us, with a lot of that work happening just at the edge of where the systems are really capable, but not able to overthrow all of us.
Even in those schemes, I think the AI systems in question will have much better levers for causing trouble than access to the internet, including all sorts of internal access and their involvement in the process of improving your AI (and that trying to constrain them so severely would mean increasing their intelligence far enough that you come out behind). The mechanisms making AI uprising difficult are not mostly things like “you are in a secure box and can’t get out,” they are mostly facts about all the other AI systems you are dealing with.
That said, I think you are overestimating how representative these are of the “mainline” hope most places, I think the goal is primarily that AI systems powerful enough to beat all of us combined come after AI systems powerful enough to greatly improve the situation. I also think there are a lot of subtle distinctions about how AI systems are trained that are very relevant to a lot of these stories (e.g. WebGPT is not doing RL over inscrutable long-term consequences on the internet—just over human evaluations of the quality of answers or browsing behavior).
I believe the most important drivers of catastrophic misalignment risk are models that optimize in ways humans don’t understand or are deceptively aligned. So the great majority of risk comes from actions that accelerate those events, and especially making models smarter. I think your threat model here is quantitatively wrong, and that it’s an important disagreement.
I agree with this! But I feel like this kind of reinforcement learning on a basically unsupervisable action-space while interfacing with humans and getting direct reinforcement on approval is exactly the kind of work that will likely make AIs more strategic and smarter, create deceptive alignment, and produce models that humans don’t understand.
I do indeed think the WebGPT work is relevant to both increasing capabilities and increasing likelihood of deceptive alignment (as is most reinforcement learning that directly pushes on human approval, especially in a large action space with permanent side effect).
I think the story would be way different if the actual risk posed by WebGPT was meaningful (say if it were driving >0.1% of the risk of OpenAI’s activities).
Huh, I definitely expect it to drive >0.1% of OpenAI’s activities. Seems like the WebGPT stuff is pretty close to commercial application, and is consuming much more than 0.1% of OpenAI’s research staff, while probably substantially increasing OpenAI’s ability to generally solve reinforcement learning problems. I am confused why you would estimate it at below 0.1%. 1% seems more reasonable to me as a baseline estimate, even if you don’t think it’s a particularly risky direction of research (given that it’s consuming about 4-5% of OpenAI’s research staff).
I think the direct risk of OpenAI’s activities is overwhelmingly dominated by training new smarter models and by deploying the public AI that could potentially be used in unanticipated ways.
I agree that if we consider indirect risks broadly (including e.g. “this helps OpenAI succeed or raise money and OpenAI’s success is dangerous”) then I’d probably move back towards “what % of OpenAI’s activities is it.”
When you say “good things to keep an AI safe” I think you are referring to a goal like “maximize capability while minimizing catastrophic alignment risk.” But in my opinion “don’t give your models access to the internet or anything equally risky” is a bad way to make that tradeoff. I think we really want dumber models doing more useful things, not smarter models that can do impressive stuff with less resources. You can get a tiny bit of safety by making it harder for your model to have any effect on the world, but at the cost of significant capability, and you would have been better off just using a slightly dumber model with more ability to do stuff. This effect is much bigger if you need to impose extreme limitations in order to get any of this “boxing benefit” (as claimed by the quote you are objecting to).
I don’d think the choice is between “smart and boxed” or “less smart and less boxed”. Intelligence (e.g. especially domain knowledge) is not 1-dimensional, boxing is largely a means of controlling what kind of knowledge the AI has. We might prefer AI savants that are super smart about some task-relevant aspects of the world and ignorant about a lot of other strategically-relevant aspects of the world.
I talked with WebGPT folks early on while they were wondering about whether these risks were significant, and I said that I thought this was badly overdetermined. If there had been more convincing arguments that the harms from the research were significant, I believe that it likely wouldn’t have happened.
Just to make sure I follow: You told them at the time that it was overdetermined that the risks weren’t significant? And if you had instead told them that the risks were significant, they wouldn’t have done it?
As in: there seem to have generally been informal discussions about how serious this risk was, and I participated in some of those discussions (though I don’t remember which discussions were early on vs prior to paper release vs later). In those discussions I said that I thought the case for risk seemed very weak.
If the case for risk had been strong, I think there are a bunch of channels by which the project would have been less likely. Some involve me—I would have said so, and I would have discouraged rather than encouraged the project in general since I certainly was aware fo it. But most of the channels would have been through other people—those on the team who thought about it would have come to different conclusions, internal discussions on the team would have gone differently, etc.
Obviously I have only indirect knowledge about decision-making at OpenAI so those are just guesses (hence “I believe that it likely wouldn’t have happened”). I think the decision to train WebGPT would be unusually responsive to arguments that it is bad (e.g. via Jacob’s involvement) and indeed I’m afraid that OpenAI is fairly likely to do risky things in other cases where there are quite good arguments against.
Glad to know at least that “Reinforcement Learning but in a highly dynamic and hard-to-measure and uncontrollable environment” is as unsafe as my intuition says it is.
like, I agree that there are some safety lessons to be learned here, but it seems somewhat clear to me that people are working on WebGPT and InstructGPT primarily for capabilities reasons, not for existential-risk-from-AI reasons
This also seems like an odd statement—it seems reasonable to say “I think the net effect of InstructGPT is to boost capabilities” or even “If someone was motivated by x-risk it would be poor prioritisation/a mistake to work on InstructGPT”. But it feels like you’re assuming some deep insight into the intention behind the people working on it, and making a much stronger statement than “I think OpenAI’s alignment team is making bad prioritisation decisions”.
Like, reading the author list of InstructGPT, there are obviously a bunch of people on there who care a bunch about safety including I believe the first two authors—it seems pretty uncharitable and hostile to say that they were motivated by a desire to boost capabilities, even if you think that was a net result of their work.
(Note: My personal take is to be somewhat confused, but to speculate that InstructGPT was mildly good for the world? And that a lot of the goodness comes from field building of getting more people investing in good quality RLHF.)
Yeah, I agree that I am doing reasoning on people’s motivations here, which is iffy and given the pushback I will be a bit more hesitant to do, but also like, in this case reasoning about people’s motivations is really important, because what I care about is what the people working at OpenAI will actually do when they have extremely powerful AI in their hands, and that will depend a bunch on their motivations.
I am honestly a bit surprised to see that WebGPT was as much driven by people who I do know reasonably well and who seem to be driven primarily by safety concerns, since the case for it strikes me as so weak, and the risk seeming as somewhat obviously high, so I am still trying to process that and will probably make some kind of underlying update.
I do think overall I’ve had much better success at predicting the actions of the vast majority of people at OpenAI, including a lot of safety work, by thinking of them by being motivated by doing cool capability things, sometimes with a thin safety veneer on top, instead of being motivated primarily by safety. For example, I currently think that the release strategy for the GPT models of OpenAI is much better explained by OpenAI wanting a moat around their language model product instead of being motivated by safety concerns. I spent many hours trying to puzzle over the reasons for why they choose this release strategy, and ultimately concluded that the motivation was primarily financial/competetive-advantage related, and not related to safety (despite people at OpenAI claiming otherwise).
I also overall agree that trying to analyze motivations of people is kind of fraught and difficult, but I also feel pretty strongly that it’s now been many years where people have been trying to tell a story of OpenAI leadership being motivated by safety stuff, with very little action to actually back that up (and a massive amount of harm in terms of capability gains), and I do want to be transparent that I no longer really believe the stated intentions of many people working there.
For people viewing on the Alignment Forum, there is a separate thread on this question here. (Edit: my link to LessWrong is automatically converted to an Alignment Forum link, you will have to navigate there yourself.)
InstructGPT also seems to be almost fully capabilities research
I don’t understand this at all. I see InstructGPT as an attempt to make a badly misaligned AI (GPT-3) corrigible. GPT-3 was never at a dangerous capability level, but it was badly misaligned; InstructGPT made a lot of progress.
I think the primary point of InstructGPT is to make the GPT-API more useful to end users (like, it just straightforwardly makes OpenAI more money, and the metric to be optimized is I don’t think something particularly close to corrigibility).
I don’t think Instruct-GPT has made the AI more corrigible in any obvious way (unless you are using the word corrigible very very broadly). In-general, I think we should expect reinforcement learning to make AIs more agentic and less corrigible, though there is some hope we can come up with clever things in the future that will allow us to use reinforcement learning to also increase corrigibility (but I don’t think we’ve done that yet).
I think it’s the primary reason why OpenAI leadership cares about InstructGPT and is willing to dedicate substantial personel and financial resources on it. I expect that when OpenAI leadership is making tradeoffs of different types of training, the primary question is commercial viability, not safety.
Similarly, if InstructGPT would hurt commercial viability, I expect it would not get deployed (I think individual researchers would likely still be able to work on it, though I think they would be unlikely to be able to hire others to work on it, or get substantial financial resources to scale it).
though there is some hope we can come up with clever things in the future that will allow us to use reinforcement learning to also increase corrigibility
Any particular research directions you’re optimistic about?
WebGPT seemed like one of the most in-expectation harmful projects that OpenAI has worked on, with no (to me) obvious safety relevance, so my guess is I would still mostly categorize the things you list under the first misconception as capabilities research. InstructGPT also seems to be almost fully capabilities research (like, I agree that there are some safety lessons to be learned here, but it seems somewhat clear to me that people are working on WebGPT and InstructGPT primarily for capabilities reasons, not for existential-risk-from-AI reasons)
(Edit: My current guess for full-time equivalents who are doing safety work at OpenAI (e.g. if someone is doing 50% work that a researcher fully focused on capabilities would do and 50% on alignment work, then we count them as 0.5 full-time equivalents) is around 10, maybe a bit less, though I might be wrong here.)
I was the project lead on WebGPT and my motivation was to explore ideas for scalable oversight and truthfulness (some further explanation is given here).
The real question for Habryka is why does he think that it’s bad for WebGPT to be built in order to get truthful AI? Like, isn’t solving that problem quite a significant thing already for alignment?
WebGPT is approximately “reinforcement learning on the internet”.
There are some very minimal safeguards implemented (search via Bing API, but the AI can click on arbitrary links), but I do indeed think “reinforcement learning on the internet” is approximately the worst direction for modern AI to go in terms of immediate risks.
I don’t think connecting GPT-3 to the internet is risky at current capability levels, but pushing AI in the direction of just hooking up language models with reinforcement learning to a browser seems like one of the worst directions for AI to go. And my guess is the majority of the effect of this research will be to cause more people to pursue this direction in the future (Adept.AI seems to be pursuing a somewhat similar approach).
Edit: Jacob does talk about this a bit in a section I had forgotten about in the truthful LM post:
I don’t think I would phrase the problem as “letting the AI out of the box” and more “training an AI in a context where agency is strongly rewarded and where there are a ton of permanent side effects”.
I find the point about “let’s try to discover the risky behavior as early as possible” generally reasonable, and am in-favor of doing this kind of work now instead of later, but I think in that case we need to put in quite strong safeguards and make it very clear that quite soon we don’t want to see more research like this, and I don’t think the WebGPT work got that across.
I don’t understand this point at all:
This says to me “even very little access to the external world will be sufficient for capable models to cause great harm, so we have to restrict access to the external world a lot, therefore… we won’t do that because that sounds really inconvenient for my research”. Like, yes, more direct access to the external world is one of the most obvious ways AIs can cause more harm and learn more agentic behavior. Boxing is costly.
The primary job of OpenAI is to be a clear leader here and do the obvious good things to keep an AI safe, which will hopefully include boxing it. Saying “well, seems like the cost is kinda high so we won’t do it” seems like exactly the kind of attitude that I am worried will cause humanity to go extinct.
Separately, there is also no other research I am aware of that is training AI as directly on access to the internet (except maybe at Adept.ai), so I don’t really buy that currently at the margin the cost of avoiding research like this would be very high, either for capabilities or safety.
But I might also be completely misunderstanding this section. I also don’t really understand why you only get a safety benefit when you restrict access a lot. Seems like you also get a safety benefit earlier, by just making it harder for the AI to build a good model of the external world and not learning heuristics for manipulating people, etc.
I mean, isn’t this the mainline scenario of most prosaic AI Alignment research? A lot of the current plans for AI Alignment consist of taking unaligned AIs, boxing them, and then trying to use them to do better AI Alignment research despite them being somewhat clearly unaligned, but unable to break out of the box.
When you say “good things to keep an AI safe” I think you are referring to a goal like “maximize capability while minimizing catastrophic alignment risk.” But in my opinion “don’t give your models access to the internet or anything equally risky” is a bad way to make that tradeoff. I think we really want dumber models doing more useful things, not smarter models that can do impressive stuff with less resources. You can get a tiny bit of safety by making it harder for your model to have any effect on the world, but at the cost of significant capability, and you would have been better off just using a slightly dumber model with more ability to do stuff. This effect is much bigger if you need to impose extreme limitations in order to get any of this “boxing benefit” (as claimed by the quote you are objecting to).
I assume the harms you are pointing to here are about setting expectations+norms about whether AI should interact with the world in a way that can have effects. But people attempting to box smart unaligned AIs, or believing that boxed AIs are significantly safer because they can’t access the internet, seems to me like a bad situation. An AI smart enough to cause risk with internet access is very likely to be able to cause risk anyway, and at best you are creating a super unstable situation where a lab leak is catastrophic. So the possible norms you are gesturing at preserving seem like they are probably net negative to me, because the main effect of these norms is on how strong an AI has to be before people consider it dangerous, not the relevance to an alignment strategy that (to me) doesn’t seem very workable.
I think the argument “don’t do things that could lead to low-stakes failures because then people will get in the habit of allowing failure” is sometimes right but often wrong. I think examples of reward hacking in a successor to WebGPT would have a large effect on reducing risk via learning and example, while having essentially zero direct costs. You say this requires “strong protections” presumably to avoid the extrapolation out to catastrophe, but you don’t really get into any quantitative detail and when I think about the numbers on this it looks like the net effect is positive. I don’t think this is close to the primary benefit of this work, but I think it’s already large enough to swamp the costs. I think the story would be way different if the actual risk posed by WebGPT was meaningful (say if it were driving >0.1% of the risk of OpenAI’s activities).
I believe the most important drivers of catastrophic misalignment risk are models that optimize in ways humans don’t understand or are deceptively aligned. So the great majority of risk comes from actions that accelerate those events, and especially making models smarter. I think your threat model here is quantitatively wrong, and that it’s an important disagreement.
There is a lot of disagreement about what alignment research is useful. For example, much of the work I consider useful you consider ~useless, and much of the work you consider useful I consider ~useless. But I think the more interesting disagreement is did the work help, and focusing on net negativeness seems rhetorically relevant but not very relevant to the cost-benefit analysis. This is related to the last point. If you thought that researchers working on WebGPT were shortening timelines significantly more efficiently than the average AI researcher, then the direct harm starts to become relevant compared to opportunity costs. (This is also related to the argument here which I disagree with very strongly, but might be the kind of intuition you are drawing on.)
I don’t think “your AI wants to kill you but it can’t get out of the box so it helps you with alignment instead” is the mainline scenario. You should be building an AI that wouldn’t stab you if your back was turned and it was holding a knife, and if you can’t do that then you should not build the AI. I believe all the reasonable approaches to prosaic AI alignment involve avoiding that situation, and the question is how well you succeed. I agree you want defense in depth and so you should also not give your AI a knife while you are looking away, but again (i) web access is really weaksauce compared to the opportunities you want to give your AI in order for it to be useful, and “defense in depth” doesn’t mean compromising on protections that actually matter like intelligence in order to buy a tiny bit of additional security, (ii) “make sure all your knives are plastic” is a pretty lame norm that is more likely to make it harder to establish clarity about risks than to actually help once you have AI systems who would stab you if they got the chance.
It’s very plausible the core disagreement here may be something like “how useful is it for safety if if people try to avoid giving their AI access to the internet.” It’s possible that after thinking more about the argument that this is useful I might change my mind. I don’t know if you have any links to someone making this argument. I think there are more and less useful forms of boxing and “your AI can’t browse the internet” is one of the forms that makes relatively little sense. (I think the versions that make most sense are way more in the weeds about details of the training setup.) I think that many kinds of improvements in security and thoughtfulness about training setup make much more sense (though are mostly still lower-order terms).
I think the best argument for work like WebGPT having harms in the same order of magnitude as opportunity cost is that it’s a cool thing to do with AI that might further accelerate interest in the area. I’m much less sure how to think about these effects and I could imagine it carrying the day that publication of WebGPT is net negative. (Though this is not my view.)
To be clear, this is not post hoc reasoning. I talked with WebGPT folks early on while they were wondering about whether these risks were significant, and I said that I thought this was badly overdetermined. If there had been more convincing arguments that the harms from the research were significant, I believe that it likely wouldn’t have happened.
Yeah, my current model is that WebGPT feels like some of the most timelines-reducing work that I’ve seen (as has most of OpenAIs work). In-general, OpenAI seems to have been the organization that has most shortened timelines in the last 5 years, with the average researcher seeming ~10x more efficient at shortening timelines than even researchers at other AGI companies like Deepmind, and probably ~100x more efficient than researchers at most AI research organizations (like Facebook AI).
WebGPT strikes me on the worse side of OpenAI capabilities research in terms of accelerating timelines (since I think it pushes us into a more dangerous paradigm that will become dangerous earlier, and because I expect it to be the kind of thing that could very drastically increase economical returns from AI). And then it also has the additional side-effect of pushing us into a paradigm of AIs that are much harder to align and so doing alignment work in that paradigm will be slower (as has I think a bunch of the RLHF work, though there I think there is a more reasonable case for a commensurate benefit there in terms of the technology also being useful for AI Alignment).
I think almost all of the acceleration comes from either products that generate $ and hype and further investment, or more directly from scaleup to more powerful models. I think “We have powerful AI systems but haven’t deployed them to do stuff they are capable of” is a very short-term kind of situation and not particularly desirable besides.
I’m not sure what you are comparing RLHF or WebGPT to when you say “paradigm of AIs that are much harder to align.” I think I probably just think this is wrong, in that (i) you are comparing to pure generative modeling but I think that’s the wrong comparison point barring a degree of coordination that is much larger than what is needed to avoid scaling up models past dangerous thresholds, (ii) I think you are wrong about the dynamics of deceptive alignment under existing mitigation strategies and that scaling up generative modeling to the point where it is transformative is considerably more likely to lead to deceptive alignment than using RLHF (primarily via involving much more intelligent models).
Something I learned today that might be relevant: OpenAI was not the first organization to train transformer language models with search engine access to the internet. Facebook AI Research released their own paper on the topic six months before WebGPT came out, though the paper is surprisingly uncited by the WebGPT paper.
Generally I agree that hooking language models up to the internet is terrifying, despite the potential improvements for factual accuracy. Paul’s arguments seem more detailed on this and I’m not sure what I would think if I thought about them more. But the fact that OpenAI was following rather than leading the field would be some evidence against WebGPT accelerating timelines.
I did not know!
However, I don’t think this is really the same kind of reference class in terms of risk. It looks like the search engine access for the Facebook case is much more limited and basically just consisted of them appending a number of relevant documents to the query, instead of the model itself being able to send various commands that include starting new searches and clicking on links.
It does generate the query itself, though:
Does it itself generate the query, or is it a separate trained system? I was a bit confused about this in the paper.
You’d think they’d train the same model weights and just make it multi-task with the appropriate prompting, but no, that phrasing implies that it’s a separate finetuned model, to the extent that that matters. (I don’t particularly think it does matter because whether it’s one model or multiple, the system as a whole still has most of the same behaviors and feedback loops once it gets more access to data or starts being trained on previous dialogues/sessions—how many systems are in your system? Probably a lot, depending on your level of analysis. Nevertheless...)
I do think we are likely to be in a bad spot, and talking to people at OpenAI, Deepmind and Anthropic (e.g. the places where most of the heavily-applied prosaic alignment work is happening), I do sure feel unhappy that their plan seems to be to be banking on this kind of terrifying situation, which is part of why I am so pessimistic about the likelihood of doom.
If I had a sense that these organizations are aiming for a much more comprehensive AI Alignment solution that doesn’t rely on extensive boxing I would agree with you more, but I am currently pretty sure they aren’t ensuring that, and by-default will hope that they can get far enough ahead with boxing-like strategies.
… Who are you talking to? I’m having trouble naming a single person at either of OpenAI or Anthropic who seems to me to be interested in extensive boxing (though admittedly I don’t know them that well). At DeepMind there’s a small minority who think about boxing, but I think even they wouldn’t think of this as a major aspect of their plan.
I agree that they aren’t aiming for a “much more comprehensive AI alignment solution” in the sense you probably mean it but saying “they rely on boxing” seems wildly off.
My best-but-still-probably-incorrect guess is that you hear people proposing schemes that seem to you like they will obviously not work in producing intent aligned systems and so you assume that the people proposing them also believe that and are putting their trust in boxing, rather than noticing that they have different empirical predictions about how likely those schemes are to produce intent aligned systems.
Here is an example quote from the latest OpenAI blogpost on AI Alignment:
This sounds super straightforwardly to me like the plan of “we are going to train non-agentic AIs that will help us with AI Alignment research, and will limit their ability to influence the world, by e.g. not giving them access to the internet”. I don’t know whether “boxing” is the exact right word here, but it’s the strategy I was pointing to here.
The immediately preceding paragraph is:
I would have guessed the claim is “boxing the AI system during training will be helpful for ensuring that the resulting AI system is aligned”, rather than “after training, the AI system might be trying to pursue its own goals, but we’ll ensure it can’t accomplish them via boxing”. But I can see your interpretation as well.
Oh, I do think a bunch of my problems with WebGPT is that we are training the system on direct internet access.
I agree that “train a system with internet access, but then remove it, then hope that it’s safe”, doesn’t really make much sense. In-general, I expect bad things to happen during training, and separately, a lot of the problems that I have with training things on the internet is that it’s an environment that seems like it would incentivize a lot of agency and make supervision really hard because you have a ton of permanent side effects.
Oh you’re making a claim directly about other people’s approaches, not about what other people think about their own approaches. Okay, that makes sense (though I disagree).
I was suggesting that the plan was “train a system without Internet access, then add it at deployment time” (aka “box the AI system during training”). I wasn’t at any point talking about WebGPT.
That’s interesting. I do think this is true about your current research direction (which I really like about your research and I do really hope we can get there), but when I e.g. talk to Carl Shulman he (if I recall correctly) said things like “we’ll just have AIs competing against each other and box them and make sure they don’t have long-lasting memory and then use those competing AIs to help us make progress on AI Alignment”. Buck’s post on “The prototypical catastrophic AI action is getting root access to its datacenter” also suggests to me that the “AI gets access to the internet” scenario is a thing that he is pretty concerned about.
More broadly, I remember that Carl Shulman said that he thinks that the reference class of “violent revolutions” is generally one of the best reference classes for forecasting whether an AI takeover will happen, and that a lot of his hope comes from just being much better at preventing that kind of revolution, by making it harder by e.g. having AIs rat out each other, not giving them access to resources, resetting them periodically, etc.
I also think that many AI Alignment schemes I have heard about rely quite a bit on preventing an AI from having long-term memory or generally be able to persist over multiple instantiations, which becomes approximately impossible if an AI just has direct access to the internet.
I think we both agree that in the long-run we want to have an AI that we can scale up much more and won’t stab us in the back even when much more powerful, but my sense is outside of your research in-particular, I haven’t actually seen anyone work on that in a prosaic context, and my model of e.g. OpenAI’s safety team is indeed planning to rely on having a lot of very smart and not-fully-aligned AIs do a lot of work for us, with a lot of that work happening just at the edge of where the systems are really capable, but not able to overthrow all of us.
Even in those schemes, I think the AI systems in question will have much better levers for causing trouble than access to the internet, including all sorts of internal access and their involvement in the process of improving your AI (and that trying to constrain them so severely would mean increasing their intelligence far enough that you come out behind). The mechanisms making AI uprising difficult are not mostly things like “you are in a secure box and can’t get out,” they are mostly facts about all the other AI systems you are dealing with.
That said, I think you are overestimating how representative these are of the “mainline” hope most places, I think the goal is primarily that AI systems powerful enough to beat all of us combined come after AI systems powerful enough to greatly improve the situation. I also think there are a lot of subtle distinctions about how AI systems are trained that are very relevant to a lot of these stories (e.g. WebGPT is not doing RL over inscrutable long-term consequences on the internet—just over human evaluations of the quality of answers or browsing behavior).
I agree with this! But I feel like this kind of reinforcement learning on a basically unsupervisable action-space while interfacing with humans and getting direct reinforcement on approval is exactly the kind of work that will likely make AIs more strategic and smarter, create deceptive alignment, and produce models that humans don’t understand.
I do indeed think the WebGPT work is relevant to both increasing capabilities and increasing likelihood of deceptive alignment (as is most reinforcement learning that directly pushes on human approval, especially in a large action space with permanent side effect).
Huh, I definitely expect it to drive >0.1% of OpenAI’s activities. Seems like the WebGPT stuff is pretty close to commercial application, and is consuming much more than 0.1% of OpenAI’s research staff, while probably substantially increasing OpenAI’s ability to generally solve reinforcement learning problems. I am confused why you would estimate it at below 0.1%. 1% seems more reasonable to me as a baseline estimate, even if you don’t think it’s a particularly risky direction of research (given that it’s consuming about 4-5% of OpenAI’s research staff).
I think the direct risk of OpenAI’s activities is overwhelmingly dominated by training new smarter models and by deploying the public AI that could potentially be used in unanticipated ways.
I agree that if we consider indirect risks broadly (including e.g. “this helps OpenAI succeed or raise money and OpenAI’s success is dangerous”) then I’d probably move back towards “what % of OpenAI’s activities is it.”
I don’d think the choice is between “smart and boxed” or “less smart and less boxed”. Intelligence (e.g. especially domain knowledge) is not 1-dimensional, boxing is largely a means of controlling what kind of knowledge the AI has. We might prefer AI savants that are super smart about some task-relevant aspects of the world and ignorant about a lot of other strategically-relevant aspects of the world.
Just to make sure I follow: You told them at the time that it was overdetermined that the risks weren’t significant? And if you had instead told them that the risks were significant, they wouldn’t have done it?
As in: there seem to have generally been informal discussions about how serious this risk was, and I participated in some of those discussions (though I don’t remember which discussions were early on vs prior to paper release vs later). In those discussions I said that I thought the case for risk seemed very weak.
If the case for risk had been strong, I think there are a bunch of channels by which the project would have been less likely. Some involve me—I would have said so, and I would have discouraged rather than encouraged the project in general since I certainly was aware fo it. But most of the channels would have been through other people—those on the team who thought about it would have come to different conclusions, internal discussions on the team would have gone differently, etc.
Obviously I have only indirect knowledge about decision-making at OpenAI so those are just guesses (hence “I believe that it likely wouldn’t have happened”). I think the decision to train WebGPT would be unusually responsive to arguments that it is bad (e.g. via Jacob’s involvement) and indeed I’m afraid that OpenAI is fairly likely to do risky things in other cases where there are quite good arguments against.
Glad to know at least that “Reinforcement Learning but in a highly dynamic and hard-to-measure and uncontrollable environment” is as unsafe as my intuition says it is.
Letting GPT-3 interact with the internet seems pretty bad to me
This also seems like an odd statement—it seems reasonable to say “I think the net effect of InstructGPT is to boost capabilities” or even “If someone was motivated by x-risk it would be poor prioritisation/a mistake to work on InstructGPT”. But it feels like you’re assuming some deep insight into the intention behind the people working on it, and making a much stronger statement than “I think OpenAI’s alignment team is making bad prioritisation decisions”.
Like, reading the author list of InstructGPT, there are obviously a bunch of people on there who care a bunch about safety including I believe the first two authors—it seems pretty uncharitable and hostile to say that they were motivated by a desire to boost capabilities, even if you think that was a net result of their work.
(Note: My personal take is to be somewhat confused, but to speculate that InstructGPT was mildly good for the world? And that a lot of the goodness comes from field building of getting more people investing in good quality RLHF.)
Yeah, I agree that I am doing reasoning on people’s motivations here, which is iffy and given the pushback I will be a bit more hesitant to do, but also like, in this case reasoning about people’s motivations is really important, because what I care about is what the people working at OpenAI will actually do when they have extremely powerful AI in their hands, and that will depend a bunch on their motivations.
I am honestly a bit surprised to see that WebGPT was as much driven by people who I do know reasonably well and who seem to be driven primarily by safety concerns, since the case for it strikes me as so weak, and the risk seeming as somewhat obviously high, so I am still trying to process that and will probably make some kind of underlying update.
I do think overall I’ve had much better success at predicting the actions of the vast majority of people at OpenAI, including a lot of safety work, by thinking of them by being motivated by doing cool capability things, sometimes with a thin safety veneer on top, instead of being motivated primarily by safety. For example, I currently think that the release strategy for the GPT models of OpenAI is much better explained by OpenAI wanting a moat around their language model product instead of being motivated by safety concerns. I spent many hours trying to puzzle over the reasons for why they choose this release strategy, and ultimately concluded that the motivation was primarily financial/competetive-advantage related, and not related to safety (despite people at OpenAI claiming otherwise).
I also overall agree that trying to analyze motivations of people is kind of fraught and difficult, but I also feel pretty strongly that it’s now been many years where people have been trying to tell a story of OpenAI leadership being motivated by safety stuff, with very little action to actually back that up (and a massive amount of harm in terms of capability gains), and I do want to be transparent that I no longer really believe the stated intentions of many people working there.
That seems weirdly strong. Why do you think that?
For people viewing on the Alignment Forum, there is a separate thread on this question here. (Edit: my link to LessWrong is automatically converted to an Alignment Forum link, you will have to navigate there yourself.)
I moved that thread over the AIAF as well!
I don’t understand this at all. I see InstructGPT as an attempt to make a badly misaligned AI (GPT-3) corrigible. GPT-3 was never at a dangerous capability level, but it was badly misaligned; InstructGPT made a lot of progress.
I think the primary point of InstructGPT is to make the GPT-API more useful to end users (like, it just straightforwardly makes OpenAI more money, and the metric to be optimized is I don’t think something particularly close to corrigibility).
I don’t think Instruct-GPT has made the AI more corrigible in any obvious way (unless you are using the word corrigible very very broadly). In-general, I think we should expect reinforcement learning to make AIs more agentic and less corrigible, though there is some hope we can come up with clever things in the future that will allow us to use reinforcement learning to also increase corrigibility (but I don’t think we’ve done that yet).
See also a previous discussion between me and Paul where we were talking about whether it makes sense to say that Instruct-GPT is more “aligned” than GPT-3, which maybe explored some related disagreements: https://www.lesswrong.com/posts/auKWgpdiBwreB62Kh/sam-marks-s-shortform?commentId=ktxyWjAaQXGBwvitf
Could you clarify what you mean by “the primary point” here? As in: the primary actual effect? Or the primary intended effect? From whose perspective?
I think it’s the primary reason why OpenAI leadership cares about InstructGPT and is willing to dedicate substantial personel and financial resources on it. I expect that when OpenAI leadership is making tradeoffs of different types of training, the primary question is commercial viability, not safety.
Similarly, if InstructGPT would hurt commercial viability, I expect it would not get deployed (I think individual researchers would likely still be able to work on it, though I think they would be unlikely to be able to hire others to work on it, or get substantial financial resources to scale it).
Any particular research directions you’re optimistic about?