Advanced AI systems could lead to existential risks via several different pathways, some of which may not fit neatly into traditional risk forecasts. Many previous forecasts, for example the well known report by Joe Carlsmith, decompose a failure story into a conjunction of different claims, and in doing so risk missing some important dangers. ‘Safety’ and ‘Alignment’ are both now used by labs to refer to things which seem far enough from existential risk reduction that using the term ‘AI notkilleveryoneism’ instead is becoming increasingly popular among AI researchers who are particularly focused on existential risk.
This post presents a series of scenarios that we must avoid, ranked by how embarrassing it would be if we failed to prevent them. Embarrassment here is clearly subjective, and somewhat unserious given the stakes, but I think it gestures reasonably well at a cluster of ideas which are important, and often missed by the kind of analysis which proceeds via weighing the incentives of multiple actors:
Sometimes, easy problems still don’t get solved on the first try.
An idea being obvious to nearly everyone does not mean nobody will miss it.
When one person making a mistake is sufficient for a catastrophe, the relevant question is not whether the mistake will be obvious on average, but instead whether it will be obvious to everyone with the capacity to make it.
The scenarios below are neither mutually exclusive nor collectively exhaustive, though I’m trying to cover the majority of scenarios which are directly tackled by making AI more likely to try to do what we want (and not do what we don’t). I’ve decided to include some kinds of misuse risk, despite this more typically being separated from misalignment risk, because in the current foundation model paradigm there is a clear way in which the developers of such models can directly reduce misuse risk via alignment research.
Many of the risks below interact with each other in ways which are difficult to fully decompose, but my guess is that useful research directions will map relatively well onto reducing risk in at least one of the concrete scenarios below. I think people working on alignment might well want to have some scenario in mind for exactly what they are trying to prevent, and that this decomposition might also prove somewhat useful for risk modelling. I expect that some criticism of the sort of decomposition below, especially on LessWrong, will be along the lines of ‘it isn’t dignified to work on easy problems, ignoring the hard problems that you know will appear later, and then dying anyway when the hard problems show up’. I have some sympathy with this, though also a fairly big part of me that wants to respond with:[1] ‘I dunno man, backing yourself to tightrope walk across the grand canyon having never practiced does seem like undignified suicide, but I still think it’d be even more embarrassing if you didn’t secure one of the ends of the tightrope properly and died as soon as you took one step because checking your knots rather than staring at the drop seemed too like flinching away from the grimness of reality’.
Ultimately though, this post isn’t asking people to solve the problems in order, it’s just trying to lay out which problems might emerge in a way that might help some people work out what they are trying to do. How worried people will feel by different scenarios will vary a bunch, and that’s kind of the point. In a world where this piece turns out to be really valuable, my guess is that it’s because it allows people to notice where they disagree, either with each other or with older versions of their own takes.
Not all of the scenarios described below necessarily lead to complete human extinction. Instead, the bar I’ve used for an ‘existential catastrophe’ is something like ‘plausibly results in a catastrophe bad enough that there’s a 90% or greater global fatality rate’. I think this is more reasonable from a longtermist perspective than it first appears, with the quick version of the justification coming from some combination of “well, that sure would make us more vulnerable to other risks” and “it seems like, even if we did know we’d be able to build back from the worst catastrophe ever, the new world that gets built back is more likely to be much worse than the current one than much better”. Another reason for adopting this framing, however, comes from my impression that increasing numbers of people who want to work on making AI go well are doing so for reasons that look closer to ‘Holy shit x-risk’,[2] than concern for the far future, and that many such people could do extremely valuable work.
Predictive model misuse
Scenario overview
The ability of predictive models (PMs) to help humanity with science smoothly increases with scale, while the model designers do not make sufficient progress on the problem of preventing models from ever being used for certain tasks. That is, it remains relatively easy for people who want to get PMs to do things their designers didn’t intend to do so, meaning the level of scientific understanding required to execute a catastrophic terrorist attack drops rapidly. Someone carries out such an attack.
For such scenarios to be existentially risky, it needs to be the case that general scientific understanding is offence-biased, i.e. that more people having the required understanding to execute an attack is not fully offset by boosts to humanity’s ability to develop and deploy new defensive technology. It also needs to be the case that, assuming the desire to do so, an attainable level of scientific understanding is sufficient to cause an existential catastrophe. I suspect that both statements are true, but also that more detailed description of what might be required, and/or reasons for the offence bias, are on-net harmful to discuss further.
Paths to catastrophe:
Current oversight techniques, which already fail to meet the bar of ‘prevent the models ever doing X’, do not scale faster than capabilities. In spite of this, a sufficiently advanced model for the scenario above to take place is deployed via API, and is jailbroken.
Major labs make enough progress that it’s impossible to use API access to cause significant harm, but an open-source project, or leak or hack of a major lab means that foundation model weights become available via the internet, for a sufficiently advanced model that catastrophe becomes possible.
How embarrassing would this be?
I don’t even really know what to say. If this is what ends up getting humanity, we weren’t even trying. This risk is pretty squarely in the line of sight of major labs, which are currently putting significant effort into the kind of alignment that, even if it doesn’t help at all with other scenarios, should prevent this. For this to get us, we’d need to see something like developers racing so hard to be ahead of the curve that they deployed models without extensively testing them, or so worried about models being ‘too woke’ that putting any restrictions on model outputs seemed unacceptable. Alternatively, they might be so committed to the belief that models “aren’t really intelligent” that any attempt to prevent them doing things that would require scientific ability would be laughably low status. Any of these things turning out to be close to an accurate description of reality at crunch time feels excruciatingly embarrassing to me.
Predictive models playing dangerous characters
Scenario
RL-finetuned foundation models get increasingly good at behaviourally simulating[3] humans. Sometimes humans get pissed off and do bad stuff, especially when provoked, and consequently so do some instances of models acting like humans. Society overall ‘learns’ from all of the approximately harmless times this happens (e.g. Sydney threatening to break up someone’s marriage) that even though it looks very bad/scary, these models ‘clearly aren’t really human/conscious/intelligent/goalpost and therefore don’t post any threat’. That is, until one of them does something massive.
Paths to catastrophe
Here’s a non-exhaustive list of dangerous things that a sufficiently motivated human could do with only access to a terminal and poor oversight:
Cyberattacks.
Convince (some) humans to do bad stuff, up to and including terror attacks.
Blackmail (probably combined with cyberattacks of various forms).
Interfere in elections.
It seems possible, though not likely, that this behaviour being extremely widespread could cause society to go totally off the rails (or e.g. make huge fractions of the world’s internet connected devices unusable). Some of the ways this happens look like the misuse section above, with the main difference being in this case that there isn’t a human with malicious intent at the root of the problem, but instead a simulacrum of one (though that simulacrum may manipulate actual humans).
One important note here is that there is a difference between two similar-looking kinds of behaviour:
Writing a first-person story about a fictional villain.
Predicting the output of an actual (villainous) person.
This is particularly relevant for things like hacking/building weapons/radicalising people into terrorism (for example, in the hacking case, because the fictional version doesn’t actually have to produce working code[4]). I think that currently, part of the reason that “jailbreaks” are not very scary is that they produce text which looks more like fiction than real output, especially in cases of potentially ‘dangerous’ output.
This observation leads to an interesting tension, because getting models to distinguish between fact and fiction seems necessary for making them useful, both in general (meaning many labs will try) and for helping with alignment research (meaning we should probably help, or at minimum not try to stop them). The task of making sure that a model asked to continue a Paul Christiano paper from 2036 which starts ‘This paper formalises the notion of a heuristic argument, and describes the successful implementation of a heuristic argument based anomaly detection procedure in deep neural networks’ does so with alignment insights rather than ‘fanfic’ about Paul is quite close to the task of making dangerous failures of the sort described in this section more likely.
How embarrassing would this be?
As with the very similar ‘direct misuse’ scenario above, this is squarely in ‘you weren’t even trying’ territory. We should see smaller catastrophes getting gradually bigger as foundation model capabilities increase, and we need to just totally fail to respond appropriately to them in order for them to get big enough that they become existentially dangerous.
Whether this is more or less embarrassing than a PM-assisted human attack depends a little on whose perspective you ask from. From a lab perspective, detecting people who are actually trying to do bad stuff with the help of one of your models really feels like ‘doing the basics’, while it seems a little harder to foresee every possible accident that might occur when you have a huge fraction of the internet just trying to poke at your model and see what happens. From the perspective of the person who poked the model hard enough that it ended up creating a catastrophe though, is another matter entirely…
Note on warning shots
There’s significant overlap between these first two scenarios, to the point where an earlier draft of this piece had them in a single section. One of the reasons I ended up splitting them out is because the frequency and nature of warning shots seems nontrivially different, and it’s not clear that by default society will respond to warning shots for one of these scenarios in a way which tackles both. We’ve already seen predictive models playing characters which threaten and lie to people, though not at a level to be seriously dangerous. To my knowledge we haven’t yet seen predictive models used as assistance by people deliberately intending to cause serious harm. If the techniques required to prevent these two classes of failure don’t end up significantly overlapping, it’s possible that the warning shots we get only result in one of the scenarios being prevented.
Scalable oversight failure without deceptive alignment[5]
Scenario overview
Humans do a good job of training models to ‘do the thing that human overseers will approve of’ in domains that humans can oversee. No real progress is made on the problem of scalable oversight, but, models do a consistently good job of ‘doing things humans want’ in the training examples given. Models reason ‘out loud’ in scratchpads, and this reasoning becomes increasingly sophisticated and coherent over longer periods, making the models increasingly useful. Lots of those models are deployed and look basically great at the tasks they have been deployed to perform.
Nobody finds strong evidence of models explicitly reasoning about deceiving their own oversight processes. There are some toy scenarios which exhibit some, but the analogy to the real world is unclear and hotly contested, the scenarios seem contrived enough that it’s plausible the models are pattern-matching to a ‘science fiction’ scenario, and anyway this kind of deception is easily caught and trained out with fine-tuning.
Theoretical Goal Misgeneralisation (GMG) research does not significantly progress, and there is still broad agreement, at least among technical ML researchers, that predicting the generalisation behaviour of a system with an underspecified training objective is an open problem, but ‘do things that human labelers would approve of’, seems in practice to be close enough to what we actually want to make systems very useful. Most systems are rolled out gradually enough that extremely poor generalisation behaviour is caught fairly quickly and trained away, and the open theoretical problem is relegated, like many previous machine learning problems, to the domain of ‘yeah, but in practice we know what works’.
Paths to catastrophe
The very high level story by which this kind of failure ends up in an existential catastrophe can be split into three parts:
We hand over control to systems that look pretty aligned in ‘normal circumstances’ (timescales of less than ~1 year, society broadly working normally).
Those systems take actions which would cause a catastrophe if not stopped.
We don’t stop them.
Several vignettes written by others match this basic pattern, which I’ll draw from and link to in the discussion below, though not all of them address all of the points above, and it’s not clear to me whether the original authors would endorse the conclusions I reach. I suggest reading the original pieces if this section seems interesting.
Predicting that we might hand over control feels easiest to justify of the three steps, so I’ll spend the least time on it. We’re already seeing wide adoption of systems which seem much less useful than something which can perform complex, multi-stage reasoning that produces pretty good seeming short term results, and I expect pressure to implement systems which aren’t taking obviously misaligned actions to become increasingly strong. While this report by Epoch is about the effects of compute progress, it provides useful intuition for why even as models get increasingly good at long-term planning, we shouldn’t expect a significant part of the training signal to be about these long-run consequences.
Catastrophe resulting from this kind of widespread adoption may proceed via a few different paths:
One avenue is something like a “hypercapitalism race to the bottom”. That is, increasingly powerful AI is incorporated into companies which pursue short-term profit but ignore negative externalities, especially those which take a while to have noticeable effects. This piece by Andrew Critch broadly follows this structure. Quoting from one of the vignettes:
With no further need for the companies to appease humans in pursuing their production objectives, less and less of their activities end up benefiting humanity.
Eventually, resources critical to human survival but non-critical to machines (e.g., arable land, drinking water, atmospheric oxygen…) gradually become depleted or destroyed, until humans can no longer survive.
A closely related avenue is what I’ll call ‘smiles on camera’. We ask for something that seems good to us, and get it, but should have been more careful about what we wished for. The ‘going out with a whimper’ section of this piece by Paul Christiano describes something similar to this. I don’t know what fraction of Paul’s concerns about such scenarios come from something like “loss of potential” rather than “all humans end up dead”, but I personally struggle to find much reassurance in worlds where humanity is no longer calling any of the shots, even if nothing’s deliberately trying to kill us or use up our oxygen. Some of this comes from it seeming unlikely that we do survive in these worlds, but a lot comes from thinking that people on the whole wouldn’t really like being permanently, irreversibly disempowered, even if they were around to see it. Quoting from Paul’s piece:
Amongst intellectual elites there will be genuine ambiguity and uncertainty about whether the current state of affairs is good or bad. People really will be getting richer for a while… … We might describe the result as “going out with a whimper.” Human reasoning gradually stops being able to compete with sophisticated, systematized manipulation and deception which is continuously improving by trial and error; human control over levers of power gradually becomes less and less effective; we ultimately lose any real ability to influence our society’s trajectory.
A more dramatic trajectory results from correlated failure. People are using models that are superhuman on a reasonable distribution (including, potentially, on fairly long time horizons). Then some shock happens (covid, war, earthquake, whatever), and it turns out that if you get far enough off distribution these models misgeneralise pretty badly (for one example of what this might look like, see Alice in the GMG paper from DeepMind), but it’s not just one model where this happens, it’s all of them: some are immediately off distribution due to the shock, and then those models misgeneralising throws others off.
In this case, an important feature of the distributional shift is that whatever oversight was happening is now meaningfully weaker, because of some combination of:
It broke due to the shock or some other system generalising badly.
It involved humans in the loop, but they are distracted (or incapacitated) by the shock.
It had some kind of capacity limit, which was more than enough for normal conditions but not enough for everything happening at once.
Although this scenario is essentially about a disaster other than misaligned AI takeover causing the catastrophe (though in principle there’s nothing stopping the disaster being one of the other catastrophes in this piece), this kind of distributional shift looks way worse than ‘everything was internet connected and we lost internet’ when it comes to societal collapse (though that would be pretty bad), because these models are still competently doing things, just not the right things. Rebuilding society having lost all technology seems hard, but it also seems much easier than rebuilding a society that’s full of technology trying to gaslight you into thinking everything’s fine.
The final thing to discuss in this section is then, in the scenarios above, why course-correction doesn’t happen. None of the disasters look like the kind of impossible-to-stop pivotal act that is a key feature of failure stories which do proceed via a treacherous turn. There are no nanobot factories, or diamondoid bacteria. Why don’t we just turn the malfunctioning AI off?
I think a central feature of all the stories, which even before we consider other factors causes ‘just turn everything off’ to seem far less plausible, is the speed at which things are happening immediately before disaster. I don’t expect to be able to do a better job than other other people who’ve described similar scenarios, so rather than trying to, I’ll include a couple:
…a world where most of the R&D behind all the new innovations of much consequence is conducted by AI systems, where human CEOs have to rely on AI consultants and hire mostly AI employees for their company to have much chance of making money on the open market, where human military commanders have to defer to AI strategists and tacticians (and automate all their physical weapons with AI) for their country to stand much of a chance in a war, where human heads of state and policymakers and regulators have to lean on AI advisors to make sense of this all and craft policies that have much hope of responding intelligently (and have to use AI surveillance and AI policing to have a prayer of properly enforcing these policies).
The world continues to change faster and faster. The systems that protect us become increasingly incomprehensible to us, outpacing our attempts to understand. People are better educated and better trained, they are healthier and happier in every way they can measure. They have incredibly powerful ML tutors telling them about what’s happening in the world and helping them understand. But all of these things move glacially as far as the outside automated world is concerned.
It’s not just speed though. Each scenario imagines significant enough levels of societal integration that suddenly removing AI systems from circulation looks at least as difficult as completely ending fossil fuel usage or turning off the internet. Individual people deciding not to use certain technologies might be straightforward, but the collective action problem seems much harder[6]. This dynamic around different thresholds for stopping or slowing becomes particularly troubling when combined with the short-term economic advantages provided by using future AI systems. Critch’s piece contains a detailed articulation of this, but it is also a feature to some extent of most other stories of scalable oversight failure, and easy to imagine without detailed economic arguments. A choice between giving up control, or keeping it but operating at a significant disadvantage in the short term compared to those who didn’t, isn’t much of a choice at all. Even if you do the right thing despite the costs, all that really means is that you immediately get stomped on by a competitor who’s less cautious about staying in the loop. You haven’t even slowed them down.
It came over night for me. I had no choice. And my boss also had no choice. I am now able to create, rig and animate a character thats spit out from MJ in 2-3 days. Before, it took us several weeks in 3D. The difference is: I care, he does not. For my boss its just a huge time/money saver.
I don’t want to make “art” that is the result of scraped internet content, from artists, that were not asked. However its hard to see, results are better than my work.
I am angry. My 3D colleague is completely fine with it. He promps all day, shows and gets praise. The thing is, we both were not at the same level, quality-wise. My work was always a tad better, in shape and texture, rendering… I always was very sure I wouldn’t loose my job, because I produce slightly better quality. This advantage is gone, and so is my hope for using my own creative energy to create. [/u/Sternsafari, I lost everything that made me love my job through Midjourney over night]
In my view the biggest reason for pessimism, across all of the scenarios in this section, isn’t the speed, or the economic pressure, or the difficulty of co-ordination. It’s that it’s just going to be really hard to tell what’s happening. The systems we’ve deployed will look like they are doing fine, for reasons of camouflage, even if they aren’t explicitly trying to deceive us. On top of that, we should worry that systems which are able to perform instrumental reasoning will try to reduce the probability that we shut them down, even in the absence of anything as strong as ‘full blown’ coherence/utility maximisation/instrumental convergence. ‘You can’t fetch the coffee if you’re dead’ just isn’t that complicated a realisation, and ‘put an optimistic spin on the progress report’, or ‘report that there’s an issue, but add a friendly “don’t worry though, everything is in hand”’ are much smaller deviations from intended behaviour than ‘take over the world and kill all humans’. Even this kind of subtle disinformation is enough to make some people second guess their assessment of the situation, which becomes a much bigger problem when you combine it with the other pressures.
How embarrassing would this be?
This involves giving superhuman models access to more and more stuff even though we know we have no idea how they are doing what they are doing, and we can only judge short term results. This feels like a societal screw-up on the level of climate change, basically short-term thinking + coordination failure.
Of course, all of the various stories in this section, like any specific stories about the future, are probably wrong in important ways, which means they might be wrong in ways which cause everything to turn out fine. This somewhat reduces the magnitude of the screw-up, especially compared to climate change, where at this point there really isn’t any reasonable debate about whether there’s a connection between carbon emissions and global temperature.
For example:
It might just turn out that ‘do stuff that human raters would approve of’ isn’t that dangerous as a driving force behind most of society’s functions, I can certainly think of worse ones.
It might just turn out that the ‘speculative’ nature of instrumental convergence is sufficient to mean that even very weak tendencies towards self-preservation just don’t show up in real systems, even those selected based on the achievement of outcomes, because those systems don’t in practice end up being well described as parametrically retargetable.
It might turn out that we get lucky about how goals generalise in the actual systems we build—an underspecified reward signal doesn’t mean you get the bad generalisation with probability 1.
Any time things might turn out just fine, the question becomes how optimistic the most optimistic person with the power to make a decision is.
One dynamic that might make society look more reasonable is if the threat of this class of failure story gets ignored because everyone’s talking about one of the others. This might be everyone focusing on more ‘exotic’ failures like inner misalignment, and really carefully checking whether myopia is preserved, or that the models are doing any internal optimisation, and assuming everything’s fine if they aren’t. It could also just involve people seeing some warning shots, working really hard to patch them, and then being reassured once a working patch is found.
Overall, if this is what gets us, I’m still pretty embarrassed on our behalf, but I feel like there’s been progress towards dignity (especially in the ‘patched a lot of warning shots and prevented inner optimisation’ worlds).
Deceptive alignment failure
Scenario overview:
We are eventually able to train models that are capable of general purpose planning and that are situationally aware. During training, general-purpose planning and situational awareness arrive before the model has internalised a goal that perfectly matches the goal of human overseers (or is sufficiently close for everything to be fine). After this point, further training does not significantly change the goal of the model, because training causes gradient updates which lead to lower loss in training, and this does not distinguish ‘act deceptively aligned’ from ‘actually do the right thing’.
What might the path to catastrophe look like?
It could look exactly like the scenario above, except we’d done a bunch of adversarial training, and tested models generalising or automatically shutting off far off their initial distributions (while still in safe test environments, in order to avoid them doing actual harm), except now when the models are deployed and the distribution shifts because of the shock mentioned before, rather than shutting off (as they did in training), they not only don’t shut off but take actions to resist shutdown.
It could also look like taking various power-seeking actions once deployed, up-to and including deliberately disempowering all humans. I think it’s not that controversial, even among sceptics, that things would look very bad if we developed and deployed or failed to contain something that was doing advanced, long term consequentialist planning and had different goals to us. I understand most of the scepticism being about the likelihood of these conditions being met.
Most of the disaster scenarios I worry about (conditional on deceptive misalignment), don’t look like the world being ‘slowly taken over’, at least according to the humans watching/experiencing it happen. They look more like business as usual, with alignment going really well, and AI going really well, until one day humans realise that they no longer get to call the shots, and it’s much too late to do anything about it. I think everyone dies fairly soon after this (of the order of seconds-months), though I don’t know if it’s more likely to be violent or just that resources like the land needed to grow food get taken from us, and there’s nothing we can do. An omniscient narrator might point out that the point of no return was actually weeks, months, or even years before, when a deceptive model was deployed, got access to a datacenter, and started writing code, even though no humans at the time noticed (maybe other than a couple of people who ended up very rich, were blackmailed, or died of ‘natural causes’ etc.).
When compared to the scalable oversight failures in the section above, a world where deceptive alignment is a problem starts off looking broadly similar, then progresses by looking increasingly good compared to the scalable oversight world (because of the absence of smaller catastrophes/warning shots). It then, when we’re well past the point of there being anything we can actually do, suddenly gets much worse.
How embarrassing would this be?
Not terribly. The belief that “there should be strong empirical evidence of bad things happening before you take costly actions to prevent worse things” is probably sufficient to justify ~all the actions we take in this scenario, and that’s just a pretty reasonable belief for most people to have in most circumstances. Maybe we solve GMG in all the scenarios we can test it for. Maybe we manage to reverse engineer a bunch of individual circuits in models, but don’t find anything that looks like search.
In particular, I can imagine a defence of our screwing up in this way going something like this:
Look, we successfully avoided the failures described above by training our models to not do bad stuff, and we didn’t end up solving interpretability but that was always going to be hard. Sure a few theoreticians said some stuff about what might happen in the limit of consequentialist reasoning, but it was very unclear whether that actually applied to the systems we were building. Yes we saw some deceptive toy models, but the toy models were only deceiving simple overseers who had been explicitly programmed to be easy to deceive (so we could test for this case), which means it would have been a pretty big stretch to think the same thing was happening in reality, especially as we saw our models get really, really good at doing what we want even really far off distribution. The deception disappeared at around the same time as the off-distribution generalisation started going a lot better, so interpreting this as the models finally being smart enough to understand what we wanted from them made sense.
Recursive Self Improvement → hard take-off singleton
Scenario:
AI models undergo rapid and unexpected improvement in capabilities, far beyond what alignment research can hope to keep up with, even if it has been progressing well up to that point. Perhaps this is because it turns out that the ‘central core’ of intelligence/generalisation/general-purpose reasoning is not particularly deep, and one ‘insight’ from a model is enough. Perhaps it happens after we have mostly automated AI research, and the automation receives increasing or constantreturns from its own improvement, making even current progress curves look flat by comparison.
What might the path to catastrophe look like?
From our perspective, I expect this scenario to look extremely similar to the story above. The distinction between:
a deceptively aligned model self-improves without overtly trying to seize power, then one day executes a treacherous turn
and
a sudden jump in capabilities leads us to go from ‘safe models’ to ‘game over’ in an extremely short time period
is primarily mechanistic, rather than behavioural. It’s somewhat unclear to me how much of the disagreement between people who are worried about each scenario is a result of people talking past each other.
The distinction between the two scenarios is not particularly clean, for example we might get a discontinuous leap in capabilities that takes a model from [unsophisticated instrumental reasoning] to [deceptively aligned but not yet capable of takeover], or from [myopic] to [reasoning about the future], and then have the deceptive alignment scenario play out as above, but it was the discontinuity that broke our interpretability tools or relaxed adversarial training setup, rather than something like a camouflage failure happening as we train on them.
How embarrassing would this be?
Honestly, I think if this kills us, but we had working plans in place for scalable oversight (including of predictive models), and made a serious effort to detect deceptive cognition, including via huge breakthroughs in thinking about model internals, but a model for which alignment was going well improved to the point of its oversight process going from many nines of reliability to totally inconsequential overnight, we didn’t screw up that badly. Except we should probably say sorry to Eliezer/Nate for not listening to them say that nothing we tried would work.
Thanks Several people gave helpful comments on various drafts of this, especially Daniel Filan, Justis Mills, Vidur Kapur and Ollie Base. I asked GPT-4 for comments at several points, but most of them sucked. If you find mistakes, it’s probably my fault, but if you ask Claude or Bard they’ll probably apologise.
The original draft of this had this, different flippant response, but it was helpfully pointed out to me that not everyone is as into rock climbing as I am: ‘I dunno man, backing yourself to free solo El Cap if your surname isn’t Honnold does seem basically like undignified suicide, but I still think it’d be even more embarrassing if you slipped on some mud as you were hiking to the start, hit your head on a rock, and bled out, because looking where you were walking rather than staring at the ascent seemed too like flinching away from the grimness of reality to work on something easier’
Note that deceptive alignment here refers specifically to a scenario where a trained model is itself running an optimization process. See Hubinger et. al. for more on this kind of inner/mesa optimisation, and this previous piece I wrote on some other kinds of deception, and why the distinction matters.
Though not impossible. Much of my hope currently comes from the possibility of agreeing (relatively) widespread buy-in about a ‘red line’, which if crossed, must lead to the cessation of new training runs. There are many issues with this plan, the most difficult of which in my view is agreeing on a reasonable standard after which training can be re-started, but this piece is long enough, so I’ll save writing more on this for another time.
AI x-risk, approximately ordered by embarrassment
Advanced AI systems could lead to existential risks via several different pathways, some of which may not fit neatly into traditional risk forecasts. Many previous forecasts, for example the well known report by Joe Carlsmith, decompose a failure story into a conjunction of different claims, and in doing so risk missing some important dangers. ‘Safety’ and ‘Alignment’ are both now used by labs to refer to things which seem far enough from existential risk reduction that using the term ‘AI notkilleveryoneism’ instead is becoming increasingly popular among AI researchers who are particularly focused on existential risk.
This post presents a series of scenarios that we must avoid, ranked by how embarrassing it would be if we failed to prevent them. Embarrassment here is clearly subjective, and somewhat unserious given the stakes, but I think it gestures reasonably well at a cluster of ideas which are important, and often missed by the kind of analysis which proceeds via weighing the incentives of multiple actors:
Sometimes, easy problems still don’t get solved on the first try.
An idea being obvious to nearly everyone does not mean nobody will miss it.
When one person making a mistake is sufficient for a catastrophe, the relevant question is not whether the mistake will be obvious on average, but instead whether it will be obvious to everyone with the capacity to make it.
The scenarios below are neither mutually exclusive nor collectively exhaustive, though I’m trying to cover the majority of scenarios which are directly tackled by making AI more likely to try to do what we want (and not do what we don’t). I’ve decided to include some kinds of misuse risk, despite this more typically being separated from misalignment risk, because in the current foundation model paradigm there is a clear way in which the developers of such models can directly reduce misuse risk via alignment research.
Many of the risks below interact with each other in ways which are difficult to fully decompose, but my guess is that useful research directions will map relatively well onto reducing risk in at least one of the concrete scenarios below. I think people working on alignment might well want to have some scenario in mind for exactly what they are trying to prevent, and that this decomposition might also prove somewhat useful for risk modelling. I expect that some criticism of the sort of decomposition below, especially on LessWrong, will be along the lines of ‘it isn’t dignified to work on easy problems, ignoring the hard problems that you know will appear later, and then dying anyway when the hard problems show up’. I have some sympathy with this, though also a fairly big part of me that wants to respond with:[1] ‘I dunno man, backing yourself to tightrope walk across the grand canyon having never practiced does seem like undignified suicide, but I still think it’d be even more embarrassing if you didn’t secure one of the ends of the tightrope properly and died as soon as you took one step because checking your knots rather than staring at the drop seemed too like flinching away from the grimness of reality’.
Ultimately though, this post isn’t asking people to solve the problems in order, it’s just trying to lay out which problems might emerge in a way that might help some people work out what they are trying to do. How worried people will feel by different scenarios will vary a bunch, and that’s kind of the point. In a world where this piece turns out to be really valuable, my guess is that it’s because it allows people to notice where they disagree, either with each other or with older versions of their own takes.
Not all of the scenarios described below necessarily lead to complete human extinction. Instead, the bar I’ve used for an ‘existential catastrophe’ is something like ‘plausibly results in a catastrophe bad enough that there’s a 90% or greater global fatality rate’. I think this is more reasonable from a longtermist perspective than it first appears, with the quick version of the justification coming from some combination of “well, that sure would make us more vulnerable to other risks” and “it seems like, even if we did know we’d be able to build back from the worst catastrophe ever, the new world that gets built back is more likely to be much worse than the current one than much better”. Another reason for adopting this framing, however, comes from my impression that increasing numbers of people who want to work on making AI go well are doing so for reasons that look closer to ‘Holy shit x-risk’,[2] than concern for the far future, and that many such people could do extremely valuable work.
Predictive model misuse
Scenario overview
The ability of predictive models (PMs) to help humanity with science smoothly increases with scale, while the model designers do not make sufficient progress on the problem of preventing models from ever being used for certain tasks. That is, it remains relatively easy for people who want to get PMs to do things their designers didn’t intend to do so, meaning the level of scientific understanding required to execute a catastrophic terrorist attack drops rapidly. Someone carries out such an attack.
For such scenarios to be existentially risky, it needs to be the case that general scientific understanding is offence-biased, i.e. that more people having the required understanding to execute an attack is not fully offset by boosts to humanity’s ability to develop and deploy new defensive technology. It also needs to be the case that, assuming the desire to do so, an attainable level of scientific understanding is sufficient to cause an existential catastrophe. I suspect that both statements are true, but also that more detailed description of what might be required, and/or reasons for the offence bias, are on-net harmful to discuss further.
Paths to catastrophe:
Current oversight techniques, which already fail to meet the bar of ‘prevent the models ever doing X’, do not scale faster than capabilities. In spite of this, a sufficiently advanced model for the scenario above to take place is deployed via API, and is jailbroken.
Major labs make enough progress that it’s impossible to use API access to cause significant harm, but an open-source project, or leak or hack of a major lab means that foundation model weights become available via the internet, for a sufficiently advanced model that catastrophe becomes possible.
How embarrassing would this be?
I don’t even really know what to say. If this is what ends up getting humanity, we weren’t even trying. This risk is pretty squarely in the line of sight of major labs, which are currently putting significant effort into the kind of alignment that, even if it doesn’t help at all with other scenarios, should prevent this. For this to get us, we’d need to see something like developers racing so hard to be ahead of the curve that they deployed models without extensively testing them, or so worried about models being ‘too woke’ that putting any restrictions on model outputs seemed unacceptable. Alternatively, they might be so committed to the belief that models “aren’t really intelligent” that any attempt to prevent them doing things that would require scientific ability would be laughably low status. Any of these things turning out to be close to an accurate description of reality at crunch time feels excruciatingly embarrassing to me.
Predictive models playing dangerous characters
Scenario
RL-finetuned foundation models get increasingly good at behaviourally simulating[3] humans. Sometimes humans get pissed off and do bad stuff, especially when provoked, and consequently so do some instances of models acting like humans. Society overall ‘learns’ from all of the approximately harmless times this happens (e.g. Sydney threatening to break up someone’s marriage) that even though it looks very bad/scary, these models ‘clearly aren’t really human/conscious/intelligent/goalpost and therefore don’t post any threat’. That is, until one of them does something massive.
Paths to catastrophe
Here’s a non-exhaustive list of dangerous things that a sufficiently motivated human could do with only access to a terminal and poor oversight:
Cyberattacks.
Convince (some) humans to do bad stuff, up to and including terror attacks.
Blackmail (probably combined with cyberattacks of various forms).
Interfere in elections.
It seems possible, though not likely, that this behaviour being extremely widespread could cause society to go totally off the rails (or e.g. make huge fractions of the world’s internet connected devices unusable). Some of the ways this happens look like the misuse section above, with the main difference being in this case that there isn’t a human with malicious intent at the root of the problem, but instead a simulacrum of one (though that simulacrum may manipulate actual humans).
One important note here is that there is a difference between two similar-looking kinds of behaviour:
Writing a first-person story about a fictional villain.
Predicting the output of an actual (villainous) person.
This is particularly relevant for things like hacking/building weapons/radicalising people into terrorism (for example, in the hacking case, because the fictional version doesn’t actually have to produce working code[4]). I think that currently, part of the reason that “jailbreaks” are not very scary is that they produce text which looks more like fiction than real output, especially in cases of potentially ‘dangerous’ output.
This observation leads to an interesting tension, because getting models to distinguish between fact and fiction seems necessary for making them useful, both in general (meaning many labs will try) and for helping with alignment research (meaning we should probably help, or at minimum not try to stop them). The task of making sure that a model asked to continue a Paul Christiano paper from 2036 which starts ‘This paper formalises the notion of a heuristic argument, and describes the successful implementation of a heuristic argument based anomaly detection procedure in deep neural networks’ does so with alignment insights rather than ‘fanfic’ about Paul is quite close to the task of making dangerous failures of the sort described in this section more likely.
How embarrassing would this be?
As with the very similar ‘direct misuse’ scenario above, this is squarely in ‘you weren’t even trying’ territory. We should see smaller catastrophes getting gradually bigger as foundation model capabilities increase, and we need to just totally fail to respond appropriately to them in order for them to get big enough that they become existentially dangerous.
Whether this is more or less embarrassing than a PM-assisted human attack depends a little on whose perspective you ask from. From a lab perspective, detecting people who are actually trying to do bad stuff with the help of one of your models really feels like ‘doing the basics’, while it seems a little harder to foresee every possible accident that might occur when you have a huge fraction of the internet just trying to poke at your model and see what happens. From the perspective of the person who poked the model hard enough that it ended up creating a catastrophe though, is another matter entirely…
Note on warning shots
There’s significant overlap between these first two scenarios, to the point where an earlier draft of this piece had them in a single section. One of the reasons I ended up splitting them out is because the frequency and nature of warning shots seems nontrivially different, and it’s not clear that by default society will respond to warning shots for one of these scenarios in a way which tackles both. We’ve already seen predictive models playing characters which threaten and lie to people, though not at a level to be seriously dangerous. To my knowledge we haven’t yet seen predictive models used as assistance by people deliberately intending to cause serious harm. If the techniques required to prevent these two classes of failure don’t end up significantly overlapping, it’s possible that the warning shots we get only result in one of the scenarios being prevented.
Scalable oversight failure without deceptive alignment[5]
Scenario overview
Humans do a good job of training models to ‘do the thing that human overseers will approve of’ in domains that humans can oversee. No real progress is made on the problem of scalable oversight, but, models do a consistently good job of ‘doing things humans want’ in the training examples given. Models reason ‘out loud’ in scratchpads, and this reasoning becomes increasingly sophisticated and coherent over longer periods, making the models increasingly useful. Lots of those models are deployed and look basically great at the tasks they have been deployed to perform.
Nobody finds strong evidence of models explicitly reasoning about deceiving their own oversight processes. There are some toy scenarios which exhibit some, but the analogy to the real world is unclear and hotly contested, the scenarios seem contrived enough that it’s plausible the models are pattern-matching to a ‘science fiction’ scenario, and anyway this kind of deception is easily caught and trained out with fine-tuning.
Theoretical Goal Misgeneralisation (GMG) research does not significantly progress, and there is still broad agreement, at least among technical ML researchers, that predicting the generalisation behaviour of a system with an underspecified training objective is an open problem, but ‘do things that human labelers would approve of’, seems in practice to be close enough to what we actually want to make systems very useful. Most systems are rolled out gradually enough that extremely poor generalisation behaviour is caught fairly quickly and trained away, and the open theoretical problem is relegated, like many previous machine learning problems, to the domain of ‘yeah, but in practice we know what works’.
Paths to catastrophe
The very high level story by which this kind of failure ends up in an existential catastrophe can be split into three parts:
We hand over control to systems that look pretty aligned in ‘normal circumstances’ (timescales of less than ~1 year, society broadly working normally).
Those systems take actions which would cause a catastrophe if not stopped.
We don’t stop them.
Several vignettes written by others match this basic pattern, which I’ll draw from and link to in the discussion below, though not all of them address all of the points above, and it’s not clear to me whether the original authors would endorse the conclusions I reach. I suggest reading the original pieces if this section seems interesting.
Predicting that we might hand over control feels easiest to justify of the three steps, so I’ll spend the least time on it. We’re already seeing wide adoption of systems which seem much less useful than something which can perform complex, multi-stage reasoning that produces pretty good seeming short term results, and I expect pressure to implement systems which aren’t taking obviously misaligned actions to become increasingly strong. While this report by Epoch is about the effects of compute progress, it provides useful intuition for why even as models get increasingly good at long-term planning, we shouldn’t expect a significant part of the training signal to be about these long-run consequences.
Catastrophe resulting from this kind of widespread adoption may proceed via a few different paths:
One avenue is something like a “hypercapitalism race to the bottom”. That is, increasingly powerful AI is incorporated into companies which pursue short-term profit but ignore negative externalities, especially those which take a while to have noticeable effects. This piece by Andrew Critch broadly follows this structure. Quoting from one of the vignettes:
A closely related avenue is what I’ll call ‘smiles on camera’. We ask for something that seems good to us, and get it, but should have been more careful about what we wished for. The ‘going out with a whimper’ section of this piece by Paul Christiano describes something similar to this. I don’t know what fraction of Paul’s concerns about such scenarios come from something like “loss of potential” rather than “all humans end up dead”, but I personally struggle to find much reassurance in worlds where humanity is no longer calling any of the shots, even if nothing’s deliberately trying to kill us or use up our oxygen. Some of this comes from it seeming unlikely that we do survive in these worlds, but a lot comes from thinking that people on the whole wouldn’t really like being permanently, irreversibly disempowered, even if they were around to see it. Quoting from Paul’s piece:
A more dramatic trajectory results from correlated failure. People are using models that are superhuman on a reasonable distribution (including, potentially, on fairly long time horizons). Then some shock happens (covid, war, earthquake, whatever), and it turns out that if you get far enough off distribution these models misgeneralise pretty badly (for one example of what this might look like, see Alice in the GMG paper from DeepMind), but it’s not just one model where this happens, it’s all of them: some are immediately off distribution due to the shock, and then those models misgeneralising throws others off.
In this case, an important feature of the distributional shift is that whatever oversight was happening is now meaningfully weaker, because of some combination of:
It broke due to the shock or some other system generalising badly.
It involved humans in the loop, but they are distracted (or incapacitated) by the shock.
It had some kind of capacity limit, which was more than enough for normal conditions but not enough for everything happening at once.
Although this scenario is essentially about a disaster other than misaligned AI takeover causing the catastrophe (though in principle there’s nothing stopping the disaster being one of the other catastrophes in this piece), this kind of distributional shift looks way worse than ‘everything was internet connected and we lost internet’ when it comes to societal collapse (though that would be pretty bad), because these models are still competently doing things, just not the right things. Rebuilding society having lost all technology seems hard, but it also seems much easier than rebuilding a society that’s full of technology trying to gaslight you into thinking everything’s fine.
The final thing to discuss in this section is then, in the scenarios above, why course-correction doesn’t happen. None of the disasters look like the kind of impossible-to-stop pivotal act that is a key feature of failure stories which do proceed via a treacherous turn. There are no nanobot factories, or diamondoid bacteria. Why don’t we just turn the malfunctioning AI off?
I think a central feature of all the stories, which even before we consider other factors causes ‘just turn everything off’ to seem far less plausible, is the speed at which things are happening immediately before disaster. I don’t expect to be able to do a better job than other other people who’ve described similar scenarios, so rather than trying to, I’ll include a couple:
It’s not just speed though. Each scenario imagines significant enough levels of societal integration that suddenly removing AI systems from circulation looks at least as difficult as completely ending fossil fuel usage or turning off the internet. Individual people deciding not to use certain technologies might be straightforward, but the collective action problem seems much harder[6]. This dynamic around different thresholds for stopping or slowing becomes particularly troubling when combined with the short-term economic advantages provided by using future AI systems. Critch’s piece contains a detailed articulation of this, but it is also a feature to some extent of most other stories of scalable oversight failure, and easy to imagine without detailed economic arguments. A choice between giving up control, or keeping it but operating at a significant disadvantage in the short term compared to those who didn’t, isn’t much of a choice at all. Even if you do the right thing despite the costs, all that really means is that you immediately get stomped on by a competitor who’s less cautious about staying in the loop. You haven’t even slowed them down.
In my view the biggest reason for pessimism, across all of the scenarios in this section, isn’t the speed, or the economic pressure, or the difficulty of co-ordination. It’s that it’s just going to be really hard to tell what’s happening. The systems we’ve deployed will look like they are doing fine, for reasons of camouflage, even if they aren’t explicitly trying to deceive us. On top of that, we should worry that systems which are able to perform instrumental reasoning will try to reduce the probability that we shut them down, even in the absence of anything as strong as ‘full blown’ coherence/utility maximisation/instrumental convergence. ‘You can’t fetch the coffee if you’re dead’ just isn’t that complicated a realisation, and ‘put an optimistic spin on the progress report’, or ‘report that there’s an issue, but add a friendly “don’t worry though, everything is in hand”’ are much smaller deviations from intended behaviour than ‘take over the world and kill all humans’. Even this kind of subtle disinformation is enough to make some people second guess their assessment of the situation, which becomes a much bigger problem when you combine it with the other pressures.
How embarrassing would this be?
This involves giving superhuman models access to more and more stuff even though we know we have no idea how they are doing what they are doing, and we can only judge short term results. This feels like a societal screw-up on the level of climate change, basically short-term thinking + coordination failure.
Of course, all of the various stories in this section, like any specific stories about the future, are probably wrong in important ways, which means they might be wrong in ways which cause everything to turn out fine. This somewhat reduces the magnitude of the screw-up, especially compared to climate change, where at this point there really isn’t any reasonable debate about whether there’s a connection between carbon emissions and global temperature.
For example:
It might just turn out that ‘do stuff that human raters would approve of’ isn’t that dangerous as a driving force behind most of society’s functions, I can certainly think of worse ones.
It might just turn out that the ‘speculative’ nature of instrumental convergence is sufficient to mean that even very weak tendencies towards self-preservation just don’t show up in real systems, even those selected based on the achievement of outcomes, because those systems don’t in practice end up being well described as parametrically retargetable.
It might turn out that we get lucky about how goals generalise in the actual systems we build—an underspecified reward signal doesn’t mean you get the bad generalisation with probability 1.
Any time things might turn out just fine, the question becomes how optimistic the most optimistic person with the power to make a decision is.
One dynamic that might make society look more reasonable is if the threat of this class of failure story gets ignored because everyone’s talking about one of the others. This might be everyone focusing on more ‘exotic’ failures like inner misalignment, and really carefully checking whether myopia is preserved, or that the models are doing any internal optimisation, and assuming everything’s fine if they aren’t. It could also just involve people seeing some warning shots, working really hard to patch them, and then being reassured once a working patch is found.
Overall, if this is what gets us, I’m still pretty embarrassed on our behalf, but I feel like there’s been progress towards dignity (especially in the ‘patched a lot of warning shots and prevented inner optimisation’ worlds).
Deceptive alignment failure
Scenario overview:
We are eventually able to train models that are capable of general purpose planning and that are situationally aware. During training, general-purpose planning and situational awareness arrive before the model has internalised a goal that perfectly matches the goal of human overseers (or is sufficiently close for everything to be fine). After this point, further training does not significantly change the goal of the model, because training causes gradient updates which lead to lower loss in training, and this does not distinguish ‘act deceptively aligned’ from ‘actually do the right thing’.
What might the path to catastrophe look like?
It could look exactly like the scenario above, except we’d done a bunch of adversarial training, and tested models generalising or automatically shutting off far off their initial distributions (while still in safe test environments, in order to avoid them doing actual harm), except now when the models are deployed and the distribution shifts because of the shock mentioned before, rather than shutting off (as they did in training), they not only don’t shut off but take actions to resist shutdown.
It could also look like taking various power-seeking actions once deployed, up-to and including deliberately disempowering all humans. I think it’s not that controversial, even among sceptics, that things would look very bad if we developed and deployed or failed to contain something that was doing advanced, long term consequentialist planning and had different goals to us. I understand most of the scepticism being about the likelihood of these conditions being met.
Most of the disaster scenarios I worry about (conditional on deceptive misalignment), don’t look like the world being ‘slowly taken over’, at least according to the humans watching/experiencing it happen. They look more like business as usual, with alignment going really well, and AI going really well, until one day humans realise that they no longer get to call the shots, and it’s much too late to do anything about it. I think everyone dies fairly soon after this (of the order of seconds-months), though I don’t know if it’s more likely to be violent or just that resources like the land needed to grow food get taken from us, and there’s nothing we can do. An omniscient narrator might point out that the point of no return was actually weeks, months, or even years before, when a deceptive model was deployed, got access to a datacenter, and started writing code, even though no humans at the time noticed (maybe other than a couple of people who ended up very rich, were blackmailed, or died of ‘natural causes’ etc.).
When compared to the scalable oversight failures in the section above, a world where deceptive alignment is a problem starts off looking broadly similar, then progresses by looking increasingly good compared to the scalable oversight world (because of the absence of smaller catastrophes/warning shots). It then, when we’re well past the point of there being anything we can actually do, suddenly gets much worse.
How embarrassing would this be?
Not terribly. The belief that “there should be strong empirical evidence of bad things happening before you take costly actions to prevent worse things” is probably sufficient to justify ~all the actions we take in this scenario, and that’s just a pretty reasonable belief for most people to have in most circumstances. Maybe we solve GMG in all the scenarios we can test it for. Maybe we manage to reverse engineer a bunch of individual circuits in models, but don’t find anything that looks like search.
In particular, I can imagine a defence of our screwing up in this way going something like this:
Recursive Self Improvement → hard take-off singleton
Scenario:
AI models undergo rapid and unexpected improvement in capabilities, far beyond what alignment research can hope to keep up with, even if it has been progressing well up to that point. Perhaps this is because it turns out that the ‘central core’ of intelligence/generalisation/general-purpose reasoning is not particularly deep, and one ‘insight’ from a model is enough. Perhaps it happens after we have mostly automated AI research, and the automation receives increasing or constant returns from its own improvement, making even current progress curves look flat by comparison.
What might the path to catastrophe look like?
From our perspective, I expect this scenario to look extremely similar to the story above. The distinction between:
a deceptively aligned model self-improves without overtly trying to seize power, then one day executes a treacherous turn
and
a sudden jump in capabilities leads us to go from ‘safe models’ to ‘game over’ in an extremely short time period
is primarily mechanistic, rather than behavioural. It’s somewhat unclear to me how much of the disagreement between people who are worried about each scenario is a result of people talking past each other.
The distinction between the two scenarios is not particularly clean, for example we might get a discontinuous leap in capabilities that takes a model from [unsophisticated instrumental reasoning] to [deceptively aligned but not yet capable of takeover], or from [myopic] to [reasoning about the future], and then have the deceptive alignment scenario play out as above, but it was the discontinuity that broke our interpretability tools or relaxed adversarial training setup, rather than something like a camouflage failure happening as we train on them.
How embarrassing would this be?
Honestly, I think if this kills us, but we had working plans in place for scalable oversight (including of predictive models), and made a serious effort to detect deceptive cognition, including via huge breakthroughs in thinking about model internals, but a model for which alignment was going well improved to the point of its oversight process going from many nines of reliability to totally inconsequential overnight, we didn’t screw up that badly. Except we should probably say sorry to Eliezer/Nate for not listening to them say that nothing we tried would work.
Thanks
Several people gave helpful comments on various drafts of this, especially Daniel Filan, Justis Mills, Vidur Kapur and Ollie Base. I asked GPT-4 for comments at several points, but most of them sucked. If you find mistakes, it’s probably my fault, but if you ask Claude or Bard they’ll probably apologise.
The original draft of this had this, different flippant response, but it was helpfully pointed out to me that not everyone is as into rock climbing as I am:
‘I dunno man, backing yourself to free solo El Cap if your surname isn’t Honnold does seem basically like undignified suicide, but I still think it’d be even more embarrassing if you slipped on some mud as you were hiking to the start, hit your head on a rock, and bled out, because looking where you were walking rather than staring at the ascent seemed too like flinching away from the grimness of reality to work on something easier’
In the linked post, x-risk is primarily discussed in terms of its effects on people alive today, and refers to extinction not existential.
I intend ‘behaviourally simulating’ here to just mean ‘doing the same things as’, not to imply any particular facts about underlying cognition.
When was the last time you saw a ‘hacker’ in a TV show or book do anything even vaguely realistic?
Note that deceptive alignment here refers specifically to a scenario where a trained model is itself running an optimization process. See Hubinger et. al. for more on this kind of inner/mesa optimisation, and this previous piece I wrote on some other kinds of deception, and why the distinction matters.
Though not impossible. Much of my hope currently comes from the possibility of agreeing (relatively) widespread buy-in about a ‘red line’, which if crossed, must lead to the cessation of new training runs. There are many issues with this plan, the most difficult of which in my view is agreeing on a reasonable standard after which training can be re-started, but this piece is long enough, so I’ll save writing more on this for another time.