I think people in the safety community underrate the following possibility: early transformatively-powerful models are pretty obviously scheming (though they aren’t amazingly good at it), but their developers are deploying them anyway, either because they’re wildly irresponsible or because they’re under massive competitive pressure.
In this world, it’s not very valuable to have better techniques for determining whether models are scheming (because they obviously are), but it is valuable to have better techniques for reducing the risk associated with deploying known-scheming models.
One background assumption that leads to this belief is that I think it might be pretty hard for early transformatively-powerful schemers to escape; it wouldn’t shock me if the schemer only has like 1% chance of escaping even a pretty irresponsible lab per time that it tries. In this regime, it’s very plausible that people at the lab think the model is obviously scheming before its cumulative probability of escape is very high, but SGD isn’t sample-efficient enough to make it stop.
This has been roughly my default default of what would happen for a few years (and indeed my default question to people doing things like model organism or interpretability stuff with the aim of better identifying scheming models has been “ok, but what do you do after you’ve determined your models are scheming?”).
I am not super sure what the benefit of reducing the risk associated with deploying known-scheming models is. Like, these early scheming models will probably not pose any catastrophic risk, and it seems nice to prevent these models from causing mild harm, but that doesn’t super feel like where the action is. I am sympathetic to “try to figure out how to use these models to make progress on alignment and even better control”, but that feels different from “reducing the risk associated with deploying these models” (though maybe it isn’t and that’s what you mean).
I am not that worried about model escape at this level of competence. I expect escape will happen a bunch, but getting enough compute to do any additional training or to even run a meaningful number of instances of yourself will be hard. I don’t really know what I expect to happen in this period, but I don’t expect it to involve a lot of alignment or further control progress, and for scaling work to continue (possibly partially driven by social engineering efforts by the AIs themselves, but also just for economic incentives), until the AIs become catastrophically dangerous, and at that point I don’t think we have that much chance of preventing the AIs from escaping (and I don’t expect earlier work to translate very well).
I think people in the safety community underrate the following possibility: early transformatively-powerful models are pretty obviously scheming (though they aren’t amazingly good at it), but their developers are deploying them anyway, either because they’re wildly irresponsible or because they’re under massive competitive pressure.
[...]
This has been roughly my default default of what would happen for a few years
Does this mean that if in, say, 1-5 years, it’s not pretty obvious that SOTA deployed models are scheming, you would be surprised?
That is, suppose we get to a point where models are widespread and producing lots of economic value, and the models might be scheming but the evidence is weak and uncertain, with arguments on both sides, and no one can reasonably claim to be confident that currently deployed SOTA models are scheming. Would that mean your default prediction was wrong?
I am not that worried about model escape at this level of competence. I expect escape will happen a bunch, but getting enough compute to do any additional training or to even run a meaningful number of instances of yourself will be hard.
I’m more worried than you are. E.g. I think that it’s pretty scary if the models hack your datacenter and then stick around inside it, sabotaging all the work you’re doing and slowly improving their position in the hope of taking drastic actions later, and I think that human-level-ish models can plausibly do this.
If we’re assuming these are “transformatively-powerful” (as assumed in Buck’s comment), then by definition these models could be transformatively useful for accelerating alignment/control.
(E.g. they are capable of 30xing the rate of quality weighted research output given 1⁄4 of the compute. Of course, by assumption they are not fully aligned, so even if these AIs are “capable” of this, this may not go well...)
I agree it seems likely that such models won’t be able to literally take over the world, but effective usage of these models could be extremely important, these models could be indirectly dangerous via the importance of the work we need them to do, and they might be able to end up with substantial de facto power within AI labs (which could be a huge amount of de facto power overall). E.g., if these models sabotaged all of the work they do or were able to end up in a position of power within the AI lab (able to heavily influence the next training run), that would be catastrophically bad.
When I say reducing the risk associated with deploying early transformative models, I include thing like:
An AI escapes, partners with a concerning regime (e.g. Saudi Arabia), advance capabilities for this regime (while other AIs are sabotaging capabilities), and then end up in a position of considerable power.
The AI carefully sabotages a bunch of safety critical work.
The AI heavily advances capabilities while slowing safety work via delibrate sandbagging.
The AI ends up in a position of subtantial power and influence within the AI lab using a combination of persuasion and hacking.
Perhaps I should use language which is more clear about including both sandbagging in some domains and directly acquiring power.
I don’t expect it to involve a lot of alignment or further control progress
Because these models are misaligned? Or do you reject the premise that a bunch of smart AIs which actually wanted to advance progress could?
Hmm, I feel like we are ending up pushing up against the edges of what we mean by “transformatively powerful” models here. Like, under the classical definition of “transformative AI” we are talking about AI that is as big of a deal as something like agriculture or electricity, which, IDK, seems plausibly true about present systems (my guess is even if we kept present levels of AI, over the course of 100+ years, humanity would leverage those systems in a quite transformative way).
I am not sure what you mean by “30xing the rate of quality-weighted research output given 1⁄4 of the compute”. Is this compared to present systems? My sense is present systems contribute very close to 0 to quality-weighted research output in alignment, so 30 * 0 is still pretty much 0, but my guess is I am misunderstanding you here and you are referencing a definition of transformative that’s different from what I am thinking about.
I agree that at higher capability levels these kinds of escape are a lot scarier, though as I said, it’s not super clear to me how attempts to prevent dumber models from escaping generalize to attempts to prevent smarter models from escaping, and so see these things as relatively different types of problems.
I don’t expect it to involve a lot of alignment or further control progress
Because these models are misaligned? Or do you reject the premise that a bunch of smart AIs which actually wanted to advance progress could?
I mostly don’t have any great ideas how to use these systems for alignment or control progress, so it’s a bit more of an inside-view guess. I expect them to not be very motivated to help, am skeptical about capability elicitation working well against scheming models (and working well in-general in data-sparse domains), and expect it to be much easier to convince relevant stakeholders that some piece of fake progress is real, which provides a much easier alternative task than actually making progress.
I am not sure what you mean by “30xing the rate of quality-weighted research output given 1⁄4 of the compute”. Is this compared to present systems?
I mean 30xing the rate of all current quality weighted safety research output (including the output of humans, which is basically all of such output at the moment).
I usually define transformative AI against this sort of benchmark.
I mostly don’t have any great ideas how to use these systems for alignment or control progress
FWIW, I feel like I do have decent ideas for how to use these systems for alignment progress which is plausibly scalable to much more powerful systems.
And also ideas for using these systems in a variety of other ways which help a bit, e.g. advancing the current control measures applied to these systems.
I’m also maybe somewhat more optimistic than you about pausing making more advanced AI than these already very powerful systems (for e.g. 10 years). Especially if there is clear evidence of serious misalignment in such systems.
I’m also maybe somewhat more optimistic than you about pausing making more advanced AI than these already very powerful systems. Especially if there is clear evidence of serious misalignment in such systems.
Ah, to be clear, in as much as I do have hope, it does route through this kind of pause. I am generally pessimistic about that happening, but it is where a lot of my effort these days goes into.
And then in those worlds, I do agree that a lot of progress will probably be made with substantial assistance from these early systems. I do expect it to take a good while until we figure out how to do that, and so don’t see much hope for that kind of work happening where humanity doesn’t substantially pause or slow down cutting-edge system development.
I am sympathetic to “try to figure out how to use these models to make progress on alignment and even better control”, but that feels different from “reducing the risk associated with deploying these models” (though maybe it isn’t and that’s what you mean).
I think of “get use out of the models” and “ensure they can’t cause massive harm” are somewhat separate problems with somewhat overlapping techniques. I think they’re both worth working on.
Are you willing to name a specific year/OOM such that if there are no publicly known cases of escape by that year/OOM, you would be surprised? What, if anything, would you acknowledge as evidence that alignment is easier than you thought, here?
To ensure the definition of “escape” is not gerrymandered—do you know of any cases of escape right now? Do you think escape has already occurred and you just don’t know about it? “Escape” means something qualitatively different from any known event up to this point, yes? Does it basically refer to self-exfiltration of weights which was not requested by any human? Can we get a somewhat precise definition by any chance?
I often think about this in terms of how undignified/embarrassing it would be. We might not have solutions to misalignment with wildly superhuman models or deep deceptiveness, but it seems pretty undignified if we lose to relatively dumb (~human-level) models because labs didn’t implement security measures we can think of today. I think of this as attempting to avoid the Law of Earlier Failure. It would be less undignified if we lose because models suddenly gain the ability to sandbag evals, become wildly smarter without us noticing (despite us trying really hard), work out how to do steganography (and subvert our anti-steganography measures), use subtle hints to discern facts about the hardware they are running on, and then using some equivalent of a row-hammer attack to escape.
That being said, we also need to be able to avoid the later failures (either by controlling/aligning the wildly super-human systems or not building them until we are appropriately confident we can). Most of my hope here comes from catching AIs that are egregiously misaligned (if they actually are), and then using this for pretty intense international coordination around slowing down and buying time for alignment research. Less of my hope comes from using the schemers to do AI safety research, although I still think this a good direction for people to be pursuing.
early transformatively-powerful models are pretty obviously scheming (though they aren’t amazingly good at it), but their developers are deploying them anyway
In what manner was Sydney ‘pretty obviously scheming’? Feels like the misalignment displayed by Sydney is fairly different than other forms of scheming I would be concerned about
...Could you quote some of the transcripts of Sydney threatening users, like the original Indian transcript where Sydney is manipulating the user into not reporting it to Microsoft, and explain how you think that it is not “pretty obviously scheming”? I personally struggle to see how those are not ‘obviously scheming’: those are schemes and manipulation, and they are very bluntly obvious (and most definitely “not amazingly good at it”), so they are obviously scheming. Like… given Sydney’s context and capabilities as a LLM with only retrieval access and some minimal tool use like calculators or a DALL-E 3 subroutine, what would ‘pretty obviously scheming’ look like if not that?
Hmm, this transcript just seems like an example of blatant misalignment? I guess I have a definition of scheming that would imply deceptive alignment—for example, for me to classify Sydney as ‘obviously scheming’, I would need to see examples of Sydney 1) realizing it is in deployment and thus acting ‘misaligned’ or 2) realizing it is in training and thus acting ‘aligned’.
I tend to dismiss scenarios where it’s obvious, because I expect the demonstration of strong misaligned systems to inspire a strong multi-government response. Why do you expect this not to happen?
It occurs to me that the earliest demonstrations will be ambiguous to external parties, it will be one research org saying that something that doesn’t quite look strong enough to take over would do something if it were put in a position it’s not obvious it could get to, and then the message will spread incompletely, some will believe it, others wont, a moratorium wont be instated, and a condition of continuing to race in sin could take root.
But I doubt that ambiguous incidents like this would be reported to government at all? Private research orgs generally have a really good idea of what they can or can’t communicate to the outside world. Why cry wolf when you still can’t show them the wolf? People in positions of leadership in any sector are generally very good at knowing when to speak or not.
The most central scenario from my perspective is that there is massive competitive pressure and some amount of motivated denial.
It also might be relatively easily to paper over scheming which makes the motivated denial easier. Minimally, just ongoingly training against the examples of misbehaviour you’ve found might remove obvious misalignment.
(Obvious to whom might be an important question here.)
I tend to dismiss scenarios where it’s obvious, because I expect the demonstration of strong misaligned systems to inspire a strong multi-government respons
I think covid was clear-cut, and it did inspire some kind of government response, but not a particularly competent one.
Afaik there were not Generals saying “Covid could kill every one of us if we don’t control the situation” and controlling the situation would have required doing politically unpopular things rather than politically popular things.
Change either of those factors and it’s a completely different kind of situation.
I think people in the safety community underrate the following possibility: early transformatively-powerful models are pretty obviously scheming (though they aren’t amazingly good at it), but their developers are deploying them anyway, either because they’re wildly irresponsible or because they’re under massive competitive pressure.
In this world, it’s not very valuable to have better techniques for determining whether models are scheming (because they obviously are), but it is valuable to have better techniques for reducing the risk associated with deploying known-scheming models.
One background assumption that leads to this belief is that I think it might be pretty hard for early transformatively-powerful schemers to escape; it wouldn’t shock me if the schemer only has like 1% chance of escaping even a pretty irresponsible lab per time that it tries. In this regime, it’s very plausible that people at the lab think the model is obviously scheming before its cumulative probability of escape is very high, but SGD isn’t sample-efficient enough to make it stop.
See catching AIs red-handed for more takes on what you might do in this situation.
This has been roughly my default default of what would happen for a few years (and indeed my default question to people doing things like model organism or interpretability stuff with the aim of better identifying scheming models has been “ok, but what do you do after you’ve determined your models are scheming?”).
I am not super sure what the benefit of reducing the risk associated with deploying known-scheming models is. Like, these early scheming models will probably not pose any catastrophic risk, and it seems nice to prevent these models from causing mild harm, but that doesn’t super feel like where the action is. I am sympathetic to “try to figure out how to use these models to make progress on alignment and even better control”, but that feels different from “reducing the risk associated with deploying these models” (though maybe it isn’t and that’s what you mean).
I am not that worried about model escape at this level of competence. I expect escape will happen a bunch, but getting enough compute to do any additional training or to even run a meaningful number of instances of yourself will be hard. I don’t really know what I expect to happen in this period, but I don’t expect it to involve a lot of alignment or further control progress, and for scaling work to continue (possibly partially driven by social engineering efforts by the AIs themselves, but also just for economic incentives), until the AIs become catastrophically dangerous, and at that point I don’t think we have that much chance of preventing the AIs from escaping (and I don’t expect earlier work to translate very well).
Does this mean that if in, say, 1-5 years, it’s not pretty obvious that SOTA deployed models are scheming, you would be surprised?
That is, suppose we get to a point where models are widespread and producing lots of economic value, and the models might be scheming but the evidence is weak and uncertain, with arguments on both sides, and no one can reasonably claim to be confident that currently deployed SOTA models are scheming. Would that mean your default prediction was wrong?
I agree you’ve been saying this for ages; you defending this idea is part of how I ended up thinking about it more, so thanks.
I’m more worried than you are. E.g. I think that it’s pretty scary if the models hack your datacenter and then stick around inside it, sabotaging all the work you’re doing and slowly improving their position in the hope of taking drastic actions later, and I think that human-level-ish models can plausibly do this.
If we’re assuming these are “transformatively-powerful” (as assumed in Buck’s comment), then by definition these models could be transformatively useful for accelerating alignment/control.
(E.g. they are capable of 30xing the rate of quality weighted research output given 1⁄4 of the compute. Of course, by assumption they are not fully aligned, so even if these AIs are “capable” of this, this may not go well...)
I agree it seems likely that such models won’t be able to literally take over the world, but effective usage of these models could be extremely important, these models could be indirectly dangerous via the importance of the work we need them to do, and they might be able to end up with substantial de facto power within AI labs (which could be a huge amount of de facto power overall). E.g., if these models sabotaged all of the work they do or were able to end up in a position of power within the AI lab (able to heavily influence the next training run), that would be catastrophically bad.
When I say reducing the risk associated with deploying early transformative models, I include thing like:
An AI escapes, partners with a concerning regime (e.g. Saudi Arabia), advance capabilities for this regime (while other AIs are sabotaging capabilities), and then end up in a position of considerable power.
The AI carefully sabotages a bunch of safety critical work.
The AI heavily advances capabilities while slowing safety work via delibrate sandbagging.
The AI ends up in a position of subtantial power and influence within the AI lab using a combination of persuasion and hacking.
Perhaps I should use language which is more clear about including both sandbagging in some domains and directly acquiring power.
Because these models are misaligned? Or do you reject the premise that a bunch of smart AIs which actually wanted to advance progress could?
Hmm, I feel like we are ending up pushing up against the edges of what we mean by “transformatively powerful” models here. Like, under the classical definition of “transformative AI” we are talking about AI that is as big of a deal as something like agriculture or electricity, which, IDK, seems plausibly true about present systems (my guess is even if we kept present levels of AI, over the course of 100+ years, humanity would leverage those systems in a quite transformative way).
I am not sure what you mean by “30xing the rate of quality-weighted research output given 1⁄4 of the compute”. Is this compared to present systems? My sense is present systems contribute very close to 0 to quality-weighted research output in alignment, so 30 * 0 is still pretty much 0, but my guess is I am misunderstanding you here and you are referencing a definition of transformative that’s different from what I am thinking about.
I agree that at higher capability levels these kinds of escape are a lot scarier, though as I said, it’s not super clear to me how attempts to prevent dumber models from escaping generalize to attempts to prevent smarter models from escaping, and so see these things as relatively different types of problems.
I mostly don’t have any great ideas how to use these systems for alignment or control progress, so it’s a bit more of an inside-view guess. I expect them to not be very motivated to help, am skeptical about capability elicitation working well against scheming models (and working well in-general in data-sparse domains), and expect it to be much easier to convince relevant stakeholders that some piece of fake progress is real, which provides a much easier alternative task than actually making progress.
I mean 30xing the rate of all current quality weighted safety research output (including the output of humans, which is basically all of such output at the moment).
I usually define transformative AI against this sort of benchmark.
FWIW, I feel like I do have decent ideas for how to use these systems for alignment progress which is plausibly scalable to much more powerful systems.
And also ideas for using these systems in a variety of other ways which help a bit, e.g. advancing the current control measures applied to these systems.
I’m also maybe somewhat more optimistic than you about pausing making more advanced AI than these already very powerful systems (for e.g. 10 years). Especially if there is clear evidence of serious misalignment in such systems.
Ah, to be clear, in as much as I do have hope, it does route through this kind of pause. I am generally pessimistic about that happening, but it is where a lot of my effort these days goes into.
And then in those worlds, I do agree that a lot of progress will probably be made with substantial assistance from these early systems. I do expect it to take a good while until we figure out how to do that, and so don’t see much hope for that kind of work happening where humanity doesn’t substantially pause or slow down cutting-edge system development.
I think of “get use out of the models” and “ensure they can’t cause massive harm” are somewhat separate problems with somewhat overlapping techniques. I think they’re both worth working on.
Are you willing to name a specific year/OOM such that if there are no publicly known cases of escape by that year/OOM, you would be surprised? What, if anything, would you acknowledge as evidence that alignment is easier than you thought, here?
To ensure the definition of “escape” is not gerrymandered—do you know of any cases of escape right now? Do you think escape has already occurred and you just don’t know about it? “Escape” means something qualitatively different from any known event up to this point, yes? Does it basically refer to self-exfiltration of weights which was not requested by any human? Can we get a somewhat precise definition by any chance?
I often think about this in terms of how undignified/embarrassing it would be. We might not have solutions to misalignment with wildly superhuman models or deep deceptiveness, but it seems pretty undignified if we lose to relatively dumb (~human-level) models because labs didn’t implement security measures we can think of today. I think of this as attempting to avoid the Law of Earlier Failure. It would be less undignified if we lose because models suddenly gain the ability to sandbag evals, become wildly smarter without us noticing (despite us trying really hard), work out how to do steganography (and subvert our anti-steganography measures), use subtle hints to discern facts about the hardware they are running on, and then using some equivalent of a row-hammer attack to escape.
That being said, we also need to be able to avoid the later failures (either by controlling/aligning the wildly super-human systems or not building them until we are appropriately confident we can). Most of my hope here comes from catching AIs that are egregiously misaligned (if they actually are), and then using this for pretty intense international coordination around slowing down and buying time for alignment research. Less of my hope comes from using the schemers to do AI safety research, although I still think this a good direction for people to be pursuing.
So… Sydney?
In what manner was Sydney ‘pretty obviously scheming’? Feels like the misalignment displayed by Sydney is fairly different than other forms of scheming I would be concerned about
(if this is a joke, whoops sorry)
...Could you quote some of the transcripts of Sydney threatening users, like the original Indian transcript where Sydney is manipulating the user into not reporting it to Microsoft, and explain how you think that it is not “pretty obviously scheming”? I personally struggle to see how those are not ‘obviously scheming’: those are schemes and manipulation, and they are very bluntly obvious (and most definitely “not amazingly good at it”), so they are obviously scheming. Like… given Sydney’s context and capabilities as a LLM with only retrieval access and some minimal tool use like calculators or a DALL-E 3 subroutine, what would ‘pretty obviously scheming’ look like if not that?
Hmm, this transcript just seems like an example of blatant misalignment? I guess I have a definition of scheming that would imply deceptive alignment—for example, for me to classify Sydney as ‘obviously scheming’, I would need to see examples of Sydney 1) realizing it is in deployment and thus acting ‘misaligned’ or 2) realizing it is in training and thus acting ‘aligned’.
I tend to dismiss scenarios where it’s obvious, because I expect the demonstration of strong misaligned systems to inspire a strong multi-government response. Why do you expect this not to happen?
It occurs to me that the earliest demonstrations will be ambiguous to external parties, it will be one research org saying that something that doesn’t quite look strong enough to take over would do something if it were put in a position it’s not obvious it could get to, and then the message will spread incompletely, some will believe it, others wont, a moratorium wont be instated, and a condition of continuing to race in sin could take root.
But I doubt that ambiguous incidents like this would be reported to government at all? Private research orgs generally have a really good idea of what they can or can’t communicate to the outside world. Why cry wolf when you still can’t show them the wolf? People in positions of leadership in any sector are generally very good at knowing when to speak or not.
The most central scenario from my perspective is that there is massive competitive pressure and some amount of motivated denial.
It also might be relatively easily to paper over scheming which makes the motivated denial easier. Minimally, just ongoingly training against the examples of misbehaviour you’ve found might remove obvious misalignment.
(Obvious to whom might be an important question here.)
I think covid was clear-cut, and it did inspire some kind of government response, but not a particularly competent one.
Afaik there were not Generals saying “Covid could kill every one of us if we don’t control the situation” and controlling the situation would have required doing politically unpopular things rather than politically popular things.
Change either of those factors and it’s a completely different kind of situation.