The primary job of OpenAI is to be a clear leader here and do the obvious good things to keep an AI safe, which will hopefully include boxing it. Saying “well, seems like the cost is kinda high so we won’t do it” seems like exactly the kind of attitude that I am worried will cause humanity to go extinct.
When you say “good things to keep an AI safe” I think you are referring to a goal like “maximize capability while minimizing catastrophic alignment risk.” But in my opinion “don’t give your models access to the internet or anything equally risky” is a bad way to make that tradeoff. I think we really want dumber models doing more useful things, not smarter models that can do impressive stuff with less resources. You can get a tiny bit of safety by making it harder for your model to have any effect on the world, but at the cost of significant capability, and you would have been better off just using a slightly dumber model with more ability to do stuff. This effect is much bigger if you need to impose extreme limitations in order to get any of this “boxing benefit” (as claimed by the quote you are objecting to).
I assume the harms you are pointing to here are about setting expectations+norms about whether AI should interact with the world in a way that can have effects. But people attempting to box smart unaligned AIs, or believing that boxed AIs are significantly safer because they can’t access the internet, seems to me like a bad situation. An AI smart enough to cause risk with internet access is very likely to be able to cause risk anyway, and at best you are creating a super unstable situation where a lab leak is catastrophic. So the possible norms you are gesturing at preserving seem like they are probably net negative to me, because the main effect of these norms is on how strong an AI has to be before people consider it dangerous, not the relevance to an alignment strategy that (to me) doesn’t seem very workable.
I think the argument “don’t do things that could lead to low-stakes failures because then people will get in the habit of allowing failure” is sometimes right but often wrong. I think examples of reward hacking in a successor to WebGPT would have a large effect on reducing risk via learning and example, while having essentially zero direct costs. You say this requires “strong protections” presumably to avoid the extrapolation out to catastrophe, but you don’t really get into any quantitative detail and when I think about the numbers on this it looks like the net effect is positive. I don’t think this is close to the primary benefit of this work, but I think it’s already large enough to swamp the costs. I think the story would be way different if the actual risk posed by WebGPT was meaningful (say if it were driving >0.1% of the risk of OpenAI’s activities).
I believe the most important drivers of catastrophic misalignment risk are models that optimize in ways humans don’t understand or are deceptively aligned. So the great majority of risk comes from actions that accelerate those events, and especially making models smarter. I think your threat model here is quantitatively wrong, and that it’s an important disagreement.
There is a lot of disagreement about what alignment research is useful. For example, much of the work I consider useful you consider ~useless, and much of the work you consider useful I consider ~useless. But I think the more interesting disagreement is did the work help, and focusing on net negativeness seems rhetorically relevant but not very relevant to the cost-benefit analysis. This is related to the last point. If you thought that researchers working on WebGPT were shortening timelines significantly more efficiently than the average AI researcher, then the direct harm starts to become relevant compared to opportunity costs. (This is also related to the argument here which I disagree with very strongly, but might be the kind of intuition you are drawing on.)
I don’t think “your AI wants to kill you but it can’t get out of the box so it helps you with alignment instead” is the mainline scenario. You should be building an AI that wouldn’t stab you if your back was turned and it was holding a knife, and if you can’t do that then you should not build the AI. I believe all the reasonable approaches to prosaic AI alignment involve avoiding that situation, and the question is how well you succeed. I agree you want defense in depth and so you should also not give your AI a knife while you are looking away, but again (i) web access is really weaksauce compared to the opportunities you want to give your AI in order for it to be useful, and “defense in depth” doesn’t mean compromising on protections that actually matter like intelligence in order to buy a tiny bit of additional security, (ii) “make sure all your knives are plastic” is a pretty lame norm that is more likely to make it harder to establish clarity about risks than to actually help once you have AI systems who would stab you if they got the chance.
It’s very plausible the core disagreement here may be something like “how useful is it for safety if if people try to avoid giving their AI access to the internet.” It’s possible that after thinking more about the argument that this is useful I might change my mind. I don’t know if you have any links to someone making this argument. I think there are more and less useful forms of boxing and “your AI can’t browse the internet” is one of the forms that makes relatively little sense. (I think the versions that make most sense are way more in the weeds about details of the training setup.) I think that many kinds of improvements in security and thoughtfulness about training setup make much more sense (though are mostly still lower-order terms).
I think the best argument for work like WebGPT having harms in the same order of magnitude as opportunity cost is that it’s a cool thing to do with AI that might further accelerate interest in the area. I’m much less sure how to think about these effects and I could imagine it carrying the day that publication of WebGPT is net negative. (Though this is not my view.)
To be clear, this is not post hoc reasoning. I talked with WebGPT folks early on while they were wondering about whether these risks were significant, and I said that I thought this was badly overdetermined. If there had been more convincing arguments that the harms from the research were significant, I believe that it likely wouldn’t have happened.
If you thought that researchers working on WebGPT were shortening timelines significantly more efficiently than the average AI researcher, then the direct harm starts to become relevant compared to opportunity costs.
Yeah, my current model is that WebGPT feels like some of the most timelines-reducing work that I’ve seen (as has most of OpenAIs work). In-general, OpenAI seems to have been the organization that has most shortened timelines in the last 5 years, with the average researcher seeming ~10x more efficient at shortening timelines than even researchers at other AGI companies like Deepmind, and probably ~100x more efficient than researchers at most AI research organizations (like Facebook AI).
WebGPT strikes me on the worse side of OpenAI capabilities research in terms of accelerating timelines (since I think it pushes us into a more dangerous paradigm that will become dangerous earlier, and because I expect it to be the kind of thing that could very drastically increase economical returns from AI). And then it also has the additional side-effect of pushing us into a paradigm of AIs that are much harder to align and so doing alignment work in that paradigm will be slower (as has I think a bunch of the RLHF work, though there I think there is a more reasonable case for a commensurate benefit there in terms of the technology also being useful for AI Alignment).
I think almost all of the acceleration comes from either products that generate $ and hype and further investment, or more directly from scaleup to more powerful models. I think “We have powerful AI systems but haven’t deployed them to do stuff they are capable of” is a very short-term kind of situation and not particularly desirable besides.
I’m not sure what you are comparing RLHF or WebGPT to when you say “paradigm of AIs that are much harder to align.” I think I probably just think this is wrong, in that (i) you are comparing to pure generative modeling but I think that’s the wrong comparison point barring a degree of coordination that is much larger than what is needed to avoid scaling up models past dangerous thresholds, (ii) I think you are wrong about the dynamics of deceptive alignment under existing mitigation strategies and that scaling up generative modeling to the point where it is transformative is considerably more likely to lead to deceptive alignment than using RLHF (primarily via involving much more intelligent models).
Something I learned today that might be relevant: OpenAI was not the first organization to train transformer language models with search engine access to the internet. Facebook AI Research released their own paper on the topic six months before WebGPT came out, though the paper is surprisingly uncited by the WebGPT paper.
Generally I agree that hooking language models up to the internet is terrifying, despite the potential improvements for factual accuracy. Paul’s arguments seem more detailed on this and I’m not sure what I would think if I thought about them more. But the fact that OpenAI was following rather than leading the field would be some evidence against WebGPT accelerating timelines.
However, I don’t think this is really the same kind of reference class in terms of risk. It looks like the search engine access for the Facebook case is much more limited and basically just consisted of them appending a number of relevant documents to the query, instead of the model itself being able to send various commands that include starting new searches and clicking on links.
A search query generator: an encoder-decoder Transformer that takes in the dialogue context as input, and generates a search query. This is given to the black-box search engine API, and N documents are returned.
You’d think they’d train the same model weights and just make it multi-task with the appropriate prompting, but no, that phrasing implies that it’s a separate finetuned model, to the extent that that matters. (I don’t particularly think it does matter because whether it’s one model or multiple, the system as a whole still has most of the same behaviors and feedback loops once it gets more access to data or starts being trained on previous dialogues/sessions—how many systems are in your system? Probably a lot, depending on your level of analysis. Nevertheless...)
But people attempting to box smart unaligned AIs, or believing that boxed AIs are significantly safer because they can’t access the internet, seems to me like a bad situation. An AI smart enough to cause risk with internet access is very likely to be able to cause risk anyway, and at best you are creating a super unstable situation where a lab leak is catastrophic.
I do think we are likely to be in a bad spot, and talking to people at OpenAI, Deepmind and Anthropic (e.g. the places where most of the heavily-applied prosaic alignment work is happening), I do sure feel unhappy that their plan seems to be to be banking on this kind of terrifying situation, which is part of why I am so pessimistic about the likelihood of doom.
If I had a sense that these organizations are aiming for a much more comprehensive AI Alignment solution that doesn’t rely on extensive boxing I would agree with you more, but I am currently pretty sure they aren’t ensuring that, and by-default will hope that they can get far enough ahead with boxing-like strategies.
talking to people at OpenAI, Deepmind and Anthropic [...]
If I had a sense that these organizations are aiming for a much more comprehensive AI Alignment solution that doesn’t rely on extensive boxing I would agree with you more, but I am currently pretty sure they aren’t ensuring that, and by-default will hope that they can get far enough ahead with boxing-like strategies.
… Who are you talking to? I’m having trouble naming a single person at either of OpenAI or Anthropic who seems to me to be interested in extensive boxing (though admittedly I don’t know them that well). At DeepMind there’s a small minority who think about boxing, but I think even they wouldn’t think of this as a major aspect of their plan.
I agree that they aren’t aiming for a “much more comprehensive AI alignment solution” in the sense you probably mean it but saying “they rely on boxing” seems wildly off.
My best-but-still-probably-incorrect guess is that you hear people proposing schemes that seem to you like they will obviously not work in producing intent aligned systems and so you assume that the people proposing them also believe that and are putting their trust in boxing, rather than noticing that they have different empirical predictions about how likely those schemes are to produce intent aligned systems.
Here is an example quote from the latest OpenAI blogpost on AI Alignment:
Language models are particularly well-suited for automating alignment research because they come “preloaded” with a lot of knowledge and information about human values from reading the internet. Out of the box, they aren’t independent agents and thus don’t pursue their own goals in the world. To do alignment research they don’t need unrestricted access to the internet. Yet a lot of alignment research tasks can be phrased as natural language or coding tasks.
This sounds super straightforwardly to me like the plan of “we are going to train non-agentic AIs that will help us with AI Alignment research, and will limit their ability to influence the world, by e.g. not giving them access to the internet”. I don’t know whether “boxing” is the exact right word here, but it’s the strategy I was pointing to here.
Importantly, we only need “narrower” AI systems that have human-level capabilities in the relevant domains to do as well as humans on alignment research. We expect these AI systems are easier to align than general-purpose systems or systems much smarter than humans.
I would have guessed the claim is “boxing the AI system during training will be helpful for ensuring that the resulting AI system is aligned”, rather than “after training, the AI system might be trying to pursue its own goals, but we’ll ensure it can’t accomplish them via boxing”. But I can see your interpretation as well.
Oh, I do think a bunch of my problems with WebGPT is that we are training the system on direct internet access.
I agree that “train a system with internet access, but then remove it, then hope that it’s safe”, doesn’t really make much sense. In-general, I expect bad things to happen during training, and separately, a lot of the problems that I have with training things on the internet is that it’s an environment that seems like it would incentivize a lot of agency and make supervision really hard because you have a ton of permanent side effects.
Oh you’re making a claim directly about other people’s approaches, not about what other people think about their own approaches. Okay, that makes sense (though I disagree).
Oh, I do think a bunch of my problems with WebGPT is that we are training the system on direct internet access.
I agree that “train a system with internet access, but then remove it, then hope that it’s safe”, doesn’t really make much sense.
I was suggesting that the plan was “train a system without Internet access, then add it at deployment time” (aka “box the AI system during training”). I wasn’t at any point talking about WebGPT.
I don’t think “your AI wants to kill you but it can’t get out of the box so it helps you with alignment instead” is the mainline scenario. You should be building an AI that wouldn’t stab you if your back was turned and it was holding a knife, and if you can’t do that then you should not build the AI.
That’s interesting. I do think this is true about your current research direction (which I really like about your research and I do really hope we can get there), but when I e.g. talk to Carl Shulman he (if I recall correctly) said things like “we’ll just have AIs competing against each other and box them and make sure they don’t have long-lasting memory and then use those competing AIs to help us make progress on AI Alignment”. Buck’s post on “The prototypical catastrophic AI action is getting root access to its datacenter” also suggests to me that the “AI gets access to the internet” scenario is a thing that he is pretty concerned about.
More broadly, I remember that Carl Shulman said that he thinks that the reference class of “violent revolutions” is generally one of the best reference classes for forecasting whether an AI takeover will happen, and that a lot of his hope comes from just being much better at preventing that kind of revolution, by making it harder by e.g. having AIs rat out each other, not giving them access to resources, resetting them periodically, etc.
I also think that many AI Alignment schemes I have heard about rely quite a bit on preventing an AI from having long-term memory or generally be able to persist over multiple instantiations, which becomes approximately impossible if an AI just has direct access to the internet.
I think we both agree that in the long-run we want to have an AI that we can scale up much more and won’t stab us in the back even when much more powerful, but my sense is outside of your research in-particular, I haven’t actually seen anyone work on that in a prosaic context, and my model of e.g. OpenAI’s safety team is indeed planning to rely on having a lot of very smart and not-fully-aligned AIs do a lot of work for us, with a lot of that work happening just at the edge of where the systems are really capable, but not able to overthrow all of us.
Even in those schemes, I think the AI systems in question will have much better levers for causing trouble than access to the internet, including all sorts of internal access and their involvement in the process of improving your AI (and that trying to constrain them so severely would mean increasing their intelligence far enough that you come out behind). The mechanisms making AI uprising difficult are not mostly things like “you are in a secure box and can’t get out,” they are mostly facts about all the other AI systems you are dealing with.
That said, I think you are overestimating how representative these are of the “mainline” hope most places, I think the goal is primarily that AI systems powerful enough to beat all of us combined come after AI systems powerful enough to greatly improve the situation. I also think there are a lot of subtle distinctions about how AI systems are trained that are very relevant to a lot of these stories (e.g. WebGPT is not doing RL over inscrutable long-term consequences on the internet—just over human evaluations of the quality of answers or browsing behavior).
I believe the most important drivers of catastrophic misalignment risk are models that optimize in ways humans don’t understand or are deceptively aligned. So the great majority of risk comes from actions that accelerate those events, and especially making models smarter. I think your threat model here is quantitatively wrong, and that it’s an important disagreement.
I agree with this! But I feel like this kind of reinforcement learning on a basically unsupervisable action-space while interfacing with humans and getting direct reinforcement on approval is exactly the kind of work that will likely make AIs more strategic and smarter, create deceptive alignment, and produce models that humans don’t understand.
I do indeed think the WebGPT work is relevant to both increasing capabilities and increasing likelihood of deceptive alignment (as is most reinforcement learning that directly pushes on human approval, especially in a large action space with permanent side effect).
I think the story would be way different if the actual risk posed by WebGPT was meaningful (say if it were driving >0.1% of the risk of OpenAI’s activities).
Huh, I definitely expect it to drive >0.1% of OpenAI’s activities. Seems like the WebGPT stuff is pretty close to commercial application, and is consuming much more than 0.1% of OpenAI’s research staff, while probably substantially increasing OpenAI’s ability to generally solve reinforcement learning problems. I am confused why you would estimate it at below 0.1%. 1% seems more reasonable to me as a baseline estimate, even if you don’t think it’s a particularly risky direction of research (given that it’s consuming about 4-5% of OpenAI’s research staff).
I think the direct risk of OpenAI’s activities is overwhelmingly dominated by training new smarter models and by deploying the public AI that could potentially be used in unanticipated ways.
I agree that if we consider indirect risks broadly (including e.g. “this helps OpenAI succeed or raise money and OpenAI’s success is dangerous”) then I’d probably move back towards “what % of OpenAI’s activities is it.”
When you say “good things to keep an AI safe” I think you are referring to a goal like “maximize capability while minimizing catastrophic alignment risk.” But in my opinion “don’t give your models access to the internet or anything equally risky” is a bad way to make that tradeoff. I think we really want dumber models doing more useful things, not smarter models that can do impressive stuff with less resources. You can get a tiny bit of safety by making it harder for your model to have any effect on the world, but at the cost of significant capability, and you would have been better off just using a slightly dumber model with more ability to do stuff. This effect is much bigger if you need to impose extreme limitations in order to get any of this “boxing benefit” (as claimed by the quote you are objecting to).
I don’d think the choice is between “smart and boxed” or “less smart and less boxed”. Intelligence (e.g. especially domain knowledge) is not 1-dimensional, boxing is largely a means of controlling what kind of knowledge the AI has. We might prefer AI savants that are super smart about some task-relevant aspects of the world and ignorant about a lot of other strategically-relevant aspects of the world.
I talked with WebGPT folks early on while they were wondering about whether these risks were significant, and I said that I thought this was badly overdetermined. If there had been more convincing arguments that the harms from the research were significant, I believe that it likely wouldn’t have happened.
Just to make sure I follow: You told them at the time that it was overdetermined that the risks weren’t significant? And if you had instead told them that the risks were significant, they wouldn’t have done it?
As in: there seem to have generally been informal discussions about how serious this risk was, and I participated in some of those discussions (though I don’t remember which discussions were early on vs prior to paper release vs later). In those discussions I said that I thought the case for risk seemed very weak.
If the case for risk had been strong, I think there are a bunch of channels by which the project would have been less likely. Some involve me—I would have said so, and I would have discouraged rather than encouraged the project in general since I certainly was aware fo it. But most of the channels would have been through other people—those on the team who thought about it would have come to different conclusions, internal discussions on the team would have gone differently, etc.
Obviously I have only indirect knowledge about decision-making at OpenAI so those are just guesses (hence “I believe that it likely wouldn’t have happened”). I think the decision to train WebGPT would be unusually responsive to arguments that it is bad (e.g. via Jacob’s involvement) and indeed I’m afraid that OpenAI is fairly likely to do risky things in other cases where there are quite good arguments against.
When you say “good things to keep an AI safe” I think you are referring to a goal like “maximize capability while minimizing catastrophic alignment risk.” But in my opinion “don’t give your models access to the internet or anything equally risky” is a bad way to make that tradeoff. I think we really want dumber models doing more useful things, not smarter models that can do impressive stuff with less resources. You can get a tiny bit of safety by making it harder for your model to have any effect on the world, but at the cost of significant capability, and you would have been better off just using a slightly dumber model with more ability to do stuff. This effect is much bigger if you need to impose extreme limitations in order to get any of this “boxing benefit” (as claimed by the quote you are objecting to).
I assume the harms you are pointing to here are about setting expectations+norms about whether AI should interact with the world in a way that can have effects. But people attempting to box smart unaligned AIs, or believing that boxed AIs are significantly safer because they can’t access the internet, seems to me like a bad situation. An AI smart enough to cause risk with internet access is very likely to be able to cause risk anyway, and at best you are creating a super unstable situation where a lab leak is catastrophic. So the possible norms you are gesturing at preserving seem like they are probably net negative to me, because the main effect of these norms is on how strong an AI has to be before people consider it dangerous, not the relevance to an alignment strategy that (to me) doesn’t seem very workable.
I think the argument “don’t do things that could lead to low-stakes failures because then people will get in the habit of allowing failure” is sometimes right but often wrong. I think examples of reward hacking in a successor to WebGPT would have a large effect on reducing risk via learning and example, while having essentially zero direct costs. You say this requires “strong protections” presumably to avoid the extrapolation out to catastrophe, but you don’t really get into any quantitative detail and when I think about the numbers on this it looks like the net effect is positive. I don’t think this is close to the primary benefit of this work, but I think it’s already large enough to swamp the costs. I think the story would be way different if the actual risk posed by WebGPT was meaningful (say if it were driving >0.1% of the risk of OpenAI’s activities).
I believe the most important drivers of catastrophic misalignment risk are models that optimize in ways humans don’t understand or are deceptively aligned. So the great majority of risk comes from actions that accelerate those events, and especially making models smarter. I think your threat model here is quantitatively wrong, and that it’s an important disagreement.
There is a lot of disagreement about what alignment research is useful. For example, much of the work I consider useful you consider ~useless, and much of the work you consider useful I consider ~useless. But I think the more interesting disagreement is did the work help, and focusing on net negativeness seems rhetorically relevant but not very relevant to the cost-benefit analysis. This is related to the last point. If you thought that researchers working on WebGPT were shortening timelines significantly more efficiently than the average AI researcher, then the direct harm starts to become relevant compared to opportunity costs. (This is also related to the argument here which I disagree with very strongly, but might be the kind of intuition you are drawing on.)
I don’t think “your AI wants to kill you but it can’t get out of the box so it helps you with alignment instead” is the mainline scenario. You should be building an AI that wouldn’t stab you if your back was turned and it was holding a knife, and if you can’t do that then you should not build the AI. I believe all the reasonable approaches to prosaic AI alignment involve avoiding that situation, and the question is how well you succeed. I agree you want defense in depth and so you should also not give your AI a knife while you are looking away, but again (i) web access is really weaksauce compared to the opportunities you want to give your AI in order for it to be useful, and “defense in depth” doesn’t mean compromising on protections that actually matter like intelligence in order to buy a tiny bit of additional security, (ii) “make sure all your knives are plastic” is a pretty lame norm that is more likely to make it harder to establish clarity about risks than to actually help once you have AI systems who would stab you if they got the chance.
It’s very plausible the core disagreement here may be something like “how useful is it for safety if if people try to avoid giving their AI access to the internet.” It’s possible that after thinking more about the argument that this is useful I might change my mind. I don’t know if you have any links to someone making this argument. I think there are more and less useful forms of boxing and “your AI can’t browse the internet” is one of the forms that makes relatively little sense. (I think the versions that make most sense are way more in the weeds about details of the training setup.) I think that many kinds of improvements in security and thoughtfulness about training setup make much more sense (though are mostly still lower-order terms).
I think the best argument for work like WebGPT having harms in the same order of magnitude as opportunity cost is that it’s a cool thing to do with AI that might further accelerate interest in the area. I’m much less sure how to think about these effects and I could imagine it carrying the day that publication of WebGPT is net negative. (Though this is not my view.)
To be clear, this is not post hoc reasoning. I talked with WebGPT folks early on while they were wondering about whether these risks were significant, and I said that I thought this was badly overdetermined. If there had been more convincing arguments that the harms from the research were significant, I believe that it likely wouldn’t have happened.
Yeah, my current model is that WebGPT feels like some of the most timelines-reducing work that I’ve seen (as has most of OpenAIs work). In-general, OpenAI seems to have been the organization that has most shortened timelines in the last 5 years, with the average researcher seeming ~10x more efficient at shortening timelines than even researchers at other AGI companies like Deepmind, and probably ~100x more efficient than researchers at most AI research organizations (like Facebook AI).
WebGPT strikes me on the worse side of OpenAI capabilities research in terms of accelerating timelines (since I think it pushes us into a more dangerous paradigm that will become dangerous earlier, and because I expect it to be the kind of thing that could very drastically increase economical returns from AI). And then it also has the additional side-effect of pushing us into a paradigm of AIs that are much harder to align and so doing alignment work in that paradigm will be slower (as has I think a bunch of the RLHF work, though there I think there is a more reasonable case for a commensurate benefit there in terms of the technology also being useful for AI Alignment).
I think almost all of the acceleration comes from either products that generate $ and hype and further investment, or more directly from scaleup to more powerful models. I think “We have powerful AI systems but haven’t deployed them to do stuff they are capable of” is a very short-term kind of situation and not particularly desirable besides.
I’m not sure what you are comparing RLHF or WebGPT to when you say “paradigm of AIs that are much harder to align.” I think I probably just think this is wrong, in that (i) you are comparing to pure generative modeling but I think that’s the wrong comparison point barring a degree of coordination that is much larger than what is needed to avoid scaling up models past dangerous thresholds, (ii) I think you are wrong about the dynamics of deceptive alignment under existing mitigation strategies and that scaling up generative modeling to the point where it is transformative is considerably more likely to lead to deceptive alignment than using RLHF (primarily via involving much more intelligent models).
Something I learned today that might be relevant: OpenAI was not the first organization to train transformer language models with search engine access to the internet. Facebook AI Research released their own paper on the topic six months before WebGPT came out, though the paper is surprisingly uncited by the WebGPT paper.
Generally I agree that hooking language models up to the internet is terrifying, despite the potential improvements for factual accuracy. Paul’s arguments seem more detailed on this and I’m not sure what I would think if I thought about them more. But the fact that OpenAI was following rather than leading the field would be some evidence against WebGPT accelerating timelines.
I did not know!
However, I don’t think this is really the same kind of reference class in terms of risk. It looks like the search engine access for the Facebook case is much more limited and basically just consisted of them appending a number of relevant documents to the query, instead of the model itself being able to send various commands that include starting new searches and clicking on links.
It does generate the query itself, though:
Does it itself generate the query, or is it a separate trained system? I was a bit confused about this in the paper.
You’d think they’d train the same model weights and just make it multi-task with the appropriate prompting, but no, that phrasing implies that it’s a separate finetuned model, to the extent that that matters. (I don’t particularly think it does matter because whether it’s one model or multiple, the system as a whole still has most of the same behaviors and feedback loops once it gets more access to data or starts being trained on previous dialogues/sessions—how many systems are in your system? Probably a lot, depending on your level of analysis. Nevertheless...)
I do think we are likely to be in a bad spot, and talking to people at OpenAI, Deepmind and Anthropic (e.g. the places where most of the heavily-applied prosaic alignment work is happening), I do sure feel unhappy that their plan seems to be to be banking on this kind of terrifying situation, which is part of why I am so pessimistic about the likelihood of doom.
If I had a sense that these organizations are aiming for a much more comprehensive AI Alignment solution that doesn’t rely on extensive boxing I would agree with you more, but I am currently pretty sure they aren’t ensuring that, and by-default will hope that they can get far enough ahead with boxing-like strategies.
… Who are you talking to? I’m having trouble naming a single person at either of OpenAI or Anthropic who seems to me to be interested in extensive boxing (though admittedly I don’t know them that well). At DeepMind there’s a small minority who think about boxing, but I think even they wouldn’t think of this as a major aspect of their plan.
I agree that they aren’t aiming for a “much more comprehensive AI alignment solution” in the sense you probably mean it but saying “they rely on boxing” seems wildly off.
My best-but-still-probably-incorrect guess is that you hear people proposing schemes that seem to you like they will obviously not work in producing intent aligned systems and so you assume that the people proposing them also believe that and are putting their trust in boxing, rather than noticing that they have different empirical predictions about how likely those schemes are to produce intent aligned systems.
Here is an example quote from the latest OpenAI blogpost on AI Alignment:
This sounds super straightforwardly to me like the plan of “we are going to train non-agentic AIs that will help us with AI Alignment research, and will limit their ability to influence the world, by e.g. not giving them access to the internet”. I don’t know whether “boxing” is the exact right word here, but it’s the strategy I was pointing to here.
The immediately preceding paragraph is:
I would have guessed the claim is “boxing the AI system during training will be helpful for ensuring that the resulting AI system is aligned”, rather than “after training, the AI system might be trying to pursue its own goals, but we’ll ensure it can’t accomplish them via boxing”. But I can see your interpretation as well.
Oh, I do think a bunch of my problems with WebGPT is that we are training the system on direct internet access.
I agree that “train a system with internet access, but then remove it, then hope that it’s safe”, doesn’t really make much sense. In-general, I expect bad things to happen during training, and separately, a lot of the problems that I have with training things on the internet is that it’s an environment that seems like it would incentivize a lot of agency and make supervision really hard because you have a ton of permanent side effects.
Oh you’re making a claim directly about other people’s approaches, not about what other people think about their own approaches. Okay, that makes sense (though I disagree).
I was suggesting that the plan was “train a system without Internet access, then add it at deployment time” (aka “box the AI system during training”). I wasn’t at any point talking about WebGPT.
That’s interesting. I do think this is true about your current research direction (which I really like about your research and I do really hope we can get there), but when I e.g. talk to Carl Shulman he (if I recall correctly) said things like “we’ll just have AIs competing against each other and box them and make sure they don’t have long-lasting memory and then use those competing AIs to help us make progress on AI Alignment”. Buck’s post on “The prototypical catastrophic AI action is getting root access to its datacenter” also suggests to me that the “AI gets access to the internet” scenario is a thing that he is pretty concerned about.
More broadly, I remember that Carl Shulman said that he thinks that the reference class of “violent revolutions” is generally one of the best reference classes for forecasting whether an AI takeover will happen, and that a lot of his hope comes from just being much better at preventing that kind of revolution, by making it harder by e.g. having AIs rat out each other, not giving them access to resources, resetting them periodically, etc.
I also think that many AI Alignment schemes I have heard about rely quite a bit on preventing an AI from having long-term memory or generally be able to persist over multiple instantiations, which becomes approximately impossible if an AI just has direct access to the internet.
I think we both agree that in the long-run we want to have an AI that we can scale up much more and won’t stab us in the back even when much more powerful, but my sense is outside of your research in-particular, I haven’t actually seen anyone work on that in a prosaic context, and my model of e.g. OpenAI’s safety team is indeed planning to rely on having a lot of very smart and not-fully-aligned AIs do a lot of work for us, with a lot of that work happening just at the edge of where the systems are really capable, but not able to overthrow all of us.
Even in those schemes, I think the AI systems in question will have much better levers for causing trouble than access to the internet, including all sorts of internal access and their involvement in the process of improving your AI (and that trying to constrain them so severely would mean increasing their intelligence far enough that you come out behind). The mechanisms making AI uprising difficult are not mostly things like “you are in a secure box and can’t get out,” they are mostly facts about all the other AI systems you are dealing with.
That said, I think you are overestimating how representative these are of the “mainline” hope most places, I think the goal is primarily that AI systems powerful enough to beat all of us combined come after AI systems powerful enough to greatly improve the situation. I also think there are a lot of subtle distinctions about how AI systems are trained that are very relevant to a lot of these stories (e.g. WebGPT is not doing RL over inscrutable long-term consequences on the internet—just over human evaluations of the quality of answers or browsing behavior).
I agree with this! But I feel like this kind of reinforcement learning on a basically unsupervisable action-space while interfacing with humans and getting direct reinforcement on approval is exactly the kind of work that will likely make AIs more strategic and smarter, create deceptive alignment, and produce models that humans don’t understand.
I do indeed think the WebGPT work is relevant to both increasing capabilities and increasing likelihood of deceptive alignment (as is most reinforcement learning that directly pushes on human approval, especially in a large action space with permanent side effect).
Huh, I definitely expect it to drive >0.1% of OpenAI’s activities. Seems like the WebGPT stuff is pretty close to commercial application, and is consuming much more than 0.1% of OpenAI’s research staff, while probably substantially increasing OpenAI’s ability to generally solve reinforcement learning problems. I am confused why you would estimate it at below 0.1%. 1% seems more reasonable to me as a baseline estimate, even if you don’t think it’s a particularly risky direction of research (given that it’s consuming about 4-5% of OpenAI’s research staff).
I think the direct risk of OpenAI’s activities is overwhelmingly dominated by training new smarter models and by deploying the public AI that could potentially be used in unanticipated ways.
I agree that if we consider indirect risks broadly (including e.g. “this helps OpenAI succeed or raise money and OpenAI’s success is dangerous”) then I’d probably move back towards “what % of OpenAI’s activities is it.”
I don’d think the choice is between “smart and boxed” or “less smart and less boxed”. Intelligence (e.g. especially domain knowledge) is not 1-dimensional, boxing is largely a means of controlling what kind of knowledge the AI has. We might prefer AI savants that are super smart about some task-relevant aspects of the world and ignorant about a lot of other strategically-relevant aspects of the world.
Just to make sure I follow: You told them at the time that it was overdetermined that the risks weren’t significant? And if you had instead told them that the risks were significant, they wouldn’t have done it?
As in: there seem to have generally been informal discussions about how serious this risk was, and I participated in some of those discussions (though I don’t remember which discussions were early on vs prior to paper release vs later). In those discussions I said that I thought the case for risk seemed very weak.
If the case for risk had been strong, I think there are a bunch of channels by which the project would have been less likely. Some involve me—I would have said so, and I would have discouraged rather than encouraged the project in general since I certainly was aware fo it. But most of the channels would have been through other people—those on the team who thought about it would have come to different conclusions, internal discussions on the team would have gone differently, etc.
Obviously I have only indirect knowledge about decision-making at OpenAI so those are just guesses (hence “I believe that it likely wouldn’t have happened”). I think the decision to train WebGPT would be unusually responsive to arguments that it is bad (e.g. via Jacob’s involvement) and indeed I’m afraid that OpenAI is fairly likely to do risky things in other cases where there are quite good arguments against.