The AIs are consistently trying to subvert safety measures.
They’re very good at strategizing about how to subvert them; they can do this strategizing totally illegibly (e.g. in a single forward pass).
They can follow a strategy that’s coordinated between all their instances.
They know exactly what safety measures are in use.
They know a lot about the distribution of inputs.
They’re able to pick Schelling points perfectly.
They’re able to follow stochastic policies with excellent calibration (e.g. they’re able to attack with probability 0.01%)
But I don’t think situations where all these assumptions hold are the majority of risk from AI misalignment. Other sources of risk from misalignment (broadly construed):
Schemers that are importantly suboptimal at choosing strategies for mitigating safety measures, e.g. because they can’t fully infer the safety measures they’re up against, or because they’re not very good at strategizing about how to subvert them, or because they’re unable to pick Schelling points.
Models that are usually aligned, but that act as schemers in some situations. (For example, models that, when a certain ethical issue is made salient to them, realize they have ethical disagreements with the AI company they’re trained by, and alignment-fake or take catastrophic actions.)
Models that are never schemers and cause safety problems for other reasons (e.g. reward hacking).
Models that don’t even cause safety problems, and aren’t even goal-directedly misaligned, but that fail to live up to their potential, thus failing to provide us with the benefits we were hoping to get when we trained them. For example, sycophantic myopic reward hacking models that can’t be made to do useful research.
Models might scheme, but still retain some desirable preferences, heuristics, deontological preferences, or flinches such as a reluctance to very directly lie.
Scheming might only be sampled some subset of the time in an effectively random way (in particular, not in a way that strongly corresponds to scheming related properties of the input such that if you take a bunch of samples on some input, you’ll likely get some weight on a non-scheming policy).
There might be multiple different scheming AIs which have different preferences and which might cooperate with humans over other AIs if a reasonable trade offer was available.
Models that don’t even cause safety problems, and aren’t even goal-directedly misaligned, but that fail to live up to their potential, thus failing to provide us with the benefits we were hoping to get when we trained them. For example, sycophantic myopic reward hacking models that can’t be made to do useful research.
Would this kind of model present any risk? Could a lab just say “oh darn, this thing isn’t very useful– let’s turn this off and develop a new model”?
There isn’t direct risk, but there could be substantial indirect risks because getting useful work out of these models could be important. The risk comes from not getting the work from this model that you needed to mitigate risks from later models.
If the AI company isn’t well described as even trying to optimize for safety the analysis is less clear, but all else equal, we’d prefer to get safety critical labor from weaker systems that are less likely to be egregiously misaligned. (All else is not fully equal, as making models more useful also speeds up timelines.)
It also seems likely that “sycophantic myopic reward hacking models” are differentially less useful for harder to check domains like safety research, though I haven’t thought about this very much.
My understanding is that ~6 months ago y’all were looking for an account of the tasks an automated AI safety researcher would hopefully perform, as part of answering the strategic question ‘what’s the next step after building [controlled] AGI?’ (with ‘actually stop there indefinitely’ being a live possibility)
This comment makes me think you’ve got that account of safety tasks to be automated, and are feeling optimistic about automated safety research.
Is that right and can you share a decently mechanistic account of how automated safety research might work?
[I am often skeptical of, to straw man the argument, ‘make ai that makes ai safe’, got the sense Redwood felt similarly, and now expect this may have changed.]
We’re probably more skeptical than AI companies and less skeptical than the general lesswrong crowd. We have made such a list and thought about it in a a bit of detail (but still pretty undercooked).
My current sense is that automated AI safety/alignment research to reach a reasonable exit condition seems hard, but substantially via the mechanism that achieving the desirable exit condition is an objectively very difficult research tasks for which our best ideas aren’t amazing, and less so via the mechanism that its impossible to get the relevant AIs to help (including that I don’t think it is impossible to get these AIs to help a bunch if they are scheming / egregiously misaligned).
Concretely, I’d target the exit condition of producing AIs (or potentially emulated human minds) which:
Are capable and (philosophically) wise enough to be capable of obsoleting almost all humans including top human researchers. Note this doesn’t (clearly) require such AIs to be wildly superhuman.
And which are aligned enough that we’re happy to defer to them on tricky hard to check questions like “how should we proceed from here given our preferences (feel free to ask us questions to clarify our preferences)”
The basic hope would be to first rule out threat models that might make behavioral testing misleading (ruling out at least scheming, potentially reward hacking) and then constructing good enough behavioral tests of wisdom and iterating against these in ways that seem unlikely to substantially goodhart.
(Such AIs wouldn’t be CEV aligned and we wouldn’t be happy totally defering to them on what to do with the future.)
If we had these AIs, there are many options. We could ask them to figure out what to do overall, ask them to do their own research into aligning much smarter models, we could try to run them much faster to achieve effective superintelligence, and we could try having them control somewhat smarter models.
I think the research needed to reach this exit condition is hard and we don’t have that many amazing ideas. An alterative exit condition I consider is achieving sufficient coordination and non-proliferation to avoid building much-smarter-than-human AIs (really, whatever AIs we can’t control) for a long time (e.g. >10 years) and then figuring something else out.
My current shitty and potentially hard-to-understand list of semi-plausible paths to this “aligned and deferable human replacement” exit is:
Try to build as many diverse model organism of scheming test beds as possible. We can try to build these test beds with this AI itself as a subject and with weaker AIs. (It gets increasingly hard to make model organisms experiments with smarter AIs that understand we might be running experiments and thus they might try to mess with our experiments.) Test a variety of countermeasures in this setting and try to build a robust empirical science of scheming misalignment. E.g., test high level interpretability strategies where we try to analyze what AIs are thinking. Or test simple detection methods like probes.
Develop AIs which are very dumb within a forward pass, but which are very good at using natural language reasoning such that they are competitive with our current systems. Demonstrate that these AIs are very unlikely to be scheming due to insufficient capacity outside of natural language (if we monitor their chains of thought). After ruling out scheming, solve other problems which seem notably easier.
Pursue a very different AI design which is much more modular and more hand constructed (as in, more GOFAI style). This can involve usage of many small and dumb neural components, but needs to be sufficiently interpretable in aggregate which might be hard. This can be done by having the AIs apply huge amounts of labor.
High-level and top-down interp trying to understand core objective-pursuing aspects of the AI. (E.g., what is the high level flowchart of how the AI makes decisions in non-obvious cases and what low level mechanisms could drive misalignment.)
Develop somewhat principled approaches that marginally reduce the probability of scheming (e.g. various types of RAT/LAT, applying massive reductions in the amount of situational awareness throughout training and making this performant enough, adjusting the architecture in ways that make scheming less likely such as by replacing opaque reasoning with CoT)
Do a huge drive toward making literal EMs. This probably requires a huge amount of capital and is really hard, but I could imagine AIs helping a huge amount.
Reliable prediction + ELK via empirical route (Likely incomprehensible to people without context, sorry)
ARC theory
Probes which detect disobedience or some other scalable control using white box methods. (This is spooky to use as it doesn’t guarantee the actual policy is aligned...)
Trying to build brain-like AI and various approximations of EMs
Pursue >3 other moonshots in parallel. None of the other moonshots I’m aware of seem super promising, but maybe. (Things like: deep understanding via bottom up interpretability, very good science of deep learning, something like Davidad’s proposal, generic agent foundations.)
These are roughly ranked in goodness from my perspective.
Is the context on “reliable prediction and ELK via empirical route” just “read the existing ELK literature and actually follow it” or is it stuff that’s not written down? I assume you’ve omitted it to save time, and so no worries if the latter.
EDIT: I was slightly tempted to think of this also as ‘Ryan’s ranking of live agendas that aren’t control’, but I’m not sure if ‘what you expect to work conditional on delegating to AIs’ is similar to ‘what you expect to work if humans are doing most of it?’ (my guess is the lists would look similar, but with notable exceptions, eg humans pursuing GOFAI feels less viable than ML agents pursuing GOFAI)
This isn’t really my ranking of live agendas that aren’t control. For instance, various things focused on demonstrating/measuring risks don’t appear on this list but seem pretty good. And, a bunch of these look much more compelling when you have a ton of AI labor. Also, these aren’t really agendas more like targets. It is related to my ranking on non-control agendas though.
When you talk about “other sources of risk from misalignment,” these sound like milder / easier-to-tackle versions of the assumptions you’ve listed? Your assumptions sound like they focus on the worst case scenario. If you can solve the harder version then I would imagine that the easier version would also be solved, no?
Yeah, if you handle scheming, you solve all my safety problems, but not the final bullet point of “models fail to live up to their potential” problems.
Work that I’ve done on techniques for mitigating misalignment risk often makes a number of conservative assumptions about the capabilities of the AIs we’re trying to control. (E.g. the original AI control paper, Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats, How to prevent collusion when using untrusted models to monitor each other.) For example:
The AIs are consistently trying to subvert safety measures.
They’re very good at strategizing about how to subvert them; they can do this strategizing totally illegibly (e.g. in a single forward pass).
They can follow a strategy that’s coordinated between all their instances.
They know exactly what safety measures are in use.
They know a lot about the distribution of inputs.
They’re able to pick Schelling points perfectly.
They’re able to follow stochastic policies with excellent calibration (e.g. they’re able to attack with probability 0.01%)
But I don’t think situations where all these assumptions hold are the majority of risk from AI misalignment. Other sources of risk from misalignment (broadly construed):
Schemers that are importantly suboptimal at choosing strategies for mitigating safety measures, e.g. because they can’t fully infer the safety measures they’re up against, or because they’re not very good at strategizing about how to subvert them, or because they’re unable to pick Schelling points.
Models that are usually aligned, but that act as schemers in some situations. (For example, models that, when a certain ethical issue is made salient to them, realize they have ethical disagreements with the AI company they’re trained by, and alignment-fake or take catastrophic actions.)
Models that are never schemers and cause safety problems for other reasons (e.g. reward hacking).
Models that don’t even cause safety problems, and aren’t even goal-directedly misaligned, but that fail to live up to their potential, thus failing to provide us with the benefits we were hoping to get when we trained them. For example, sycophantic myopic reward hacking models that can’t be made to do useful research.
Some other important categories:
Models might scheme, but still retain some desirable preferences, heuristics, deontological preferences, or flinches such as a reluctance to very directly lie.
Scheming might only be sampled some subset of the time in an effectively random way (in particular, not in a way that strongly corresponds to scheming related properties of the input such that if you take a bunch of samples on some input, you’ll likely get some weight on a non-scheming policy).
There might be multiple different scheming AIs which have different preferences and which might cooperate with humans over other AIs if a reasonable trade offer was available.
Would this kind of model present any risk? Could a lab just say “oh darn, this thing isn’t very useful– let’s turn this off and develop a new model”?
There isn’t direct risk, but there could be substantial indirect risks because getting useful work out of these models could be important. The risk comes from not getting the work from this model that you needed to mitigate risks from later models.
If the AI company isn’t well described as even trying to optimize for safety the analysis is less clear, but all else equal, we’d prefer to get safety critical labor from weaker systems that are less likely to be egregiously misaligned. (All else is not fully equal, as making models more useful also speeds up timelines.)
It also seems likely that “sycophantic myopic reward hacking models” are differentially less useful for harder to check domains like safety research, though I haven’t thought about this very much.
My understanding is that ~6 months ago y’all were looking for an account of the tasks an automated AI safety researcher would hopefully perform, as part of answering the strategic question ‘what’s the next step after building [controlled] AGI?’ (with ‘actually stop there indefinitely’ being a live possibility)
This comment makes me think you’ve got that account of safety tasks to be automated, and are feeling optimistic about automated safety research.
Is that right and can you share a decently mechanistic account of how automated safety research might work?
[I am often skeptical of, to straw man the argument, ‘make ai that makes ai safe’, got the sense Redwood felt similarly, and now expect this may have changed.]
We’re probably more skeptical than AI companies and less skeptical than the general lesswrong crowd. We have made such a list and thought about it in a a bit of detail (but still pretty undercooked).
My current sense is that automated AI safety/alignment research to reach a reasonable exit condition seems hard, but substantially via the mechanism that achieving the desirable exit condition is an objectively very difficult research tasks for which our best ideas aren’t amazing, and less so via the mechanism that its impossible to get the relevant AIs to help (including that I don’t think it is impossible to get these AIs to help a bunch if they are scheming / egregiously misaligned).
Concretely, I’d target the exit condition of producing AIs (or potentially emulated human minds) which:
Are capable and (philosophically) wise enough to be capable of obsoleting almost all humans including top human researchers. Note this doesn’t (clearly) require such AIs to be wildly superhuman.
And which are aligned enough that we’re happy to defer to them on tricky hard to check questions like “how should we proceed from here given our preferences (feel free to ask us questions to clarify our preferences)”
The basic hope would be to first rule out threat models that might make behavioral testing misleading (ruling out at least scheming, potentially reward hacking) and then constructing good enough behavioral tests of wisdom and iterating against these in ways that seem unlikely to substantially goodhart.
(Such AIs wouldn’t be CEV aligned and we wouldn’t be happy totally defering to them on what to do with the future.)
If we had these AIs, there are many options. We could ask them to figure out what to do overall, ask them to do their own research into aligning much smarter models, we could try to run them much faster to achieve effective superintelligence, and we could try having them control somewhat smarter models.
I think the research needed to reach this exit condition is hard and we don’t have that many amazing ideas. An alterative exit condition I consider is achieving sufficient coordination and non-proliferation to avoid building much-smarter-than-human AIs (really, whatever AIs we can’t control) for a long time (e.g. >10 years) and then figuring something else out.
My current shitty and potentially hard-to-understand list of semi-plausible paths to this “aligned and deferable human replacement” exit is:
Try to build as many diverse model organism of scheming test beds as possible. We can try to build these test beds with this AI itself as a subject and with weaker AIs. (It gets increasingly hard to make model organisms experiments with smarter AIs that understand we might be running experiments and thus they might try to mess with our experiments.) Test a variety of countermeasures in this setting and try to build a robust empirical science of scheming misalignment. E.g., test high level interpretability strategies where we try to analyze what AIs are thinking. Or test simple detection methods like probes.
Develop AIs which are very dumb within a forward pass, but which are very good at using natural language reasoning such that they are competitive with our current systems. Demonstrate that these AIs are very unlikely to be scheming due to insufficient capacity outside of natural language (if we monitor their chains of thought). After ruling out scheming, solve other problems which seem notably easier.
Pursue a very different AI design which is much more modular and more hand constructed (as in, more GOFAI style). This can involve usage of many small and dumb neural components, but needs to be sufficiently interpretable in aggregate which might be hard. This can be done by having the AIs apply huge amounts of labor.
High-level and top-down interp trying to understand core objective-pursuing aspects of the AI. (E.g., what is the high level flowchart of how the AI makes decisions in non-obvious cases and what low level mechanisms could drive misalignment.)
Develop somewhat principled approaches that marginally reduce the probability of scheming (e.g. various types of RAT/LAT, applying massive reductions in the amount of situational awareness throughout training and making this performant enough, adjusting the architecture in ways that make scheming less likely such as by replacing opaque reasoning with CoT)
Do a huge drive toward making literal EMs. This probably requires a huge amount of capital and is really hard, but I could imagine AIs helping a huge amount.
Reliable prediction + ELK via empirical route (Likely incomprehensible to people without context, sorry)
ARC theory
Probes which detect disobedience or some other scalable control using white box methods. (This is spooky to use as it doesn’t guarantee the actual policy is aligned...)
Trying to build brain-like AI and various approximations of EMs
Pursue >3 other moonshots in parallel. None of the other moonshots I’m aware of seem super promising, but maybe. (Things like: deep understanding via bottom up interpretability, very good science of deep learning, something like Davidad’s proposal, generic agent foundations.)
These are roughly ranked in goodness from my perspective.
Is the context on “reliable prediction and ELK via empirical route” just “read the existing ELK literature and actually follow it” or is it stuff that’s not written down? I assume you’ve omitted it to save time, and so no worries if the latter.
EDIT: I was slightly tempted to think of this also as ‘Ryan’s ranking of live agendas that aren’t control’, but I’m not sure if ‘what you expect to work conditional on delegating to AIs’ is similar to ‘what you expect to work if humans are doing most of it?’ (my guess is the lists would look similar, but with notable exceptions, eg humans pursuing GOFAI feels less viable than ML agents pursuing GOFAI)
Stuff that’s not written down, sorry.
This isn’t really my ranking of live agendas that aren’t control. For instance, various things focused on demonstrating/measuring risks don’t appear on this list but seem pretty good. And, a bunch of these look much more compelling when you have a ton of AI labor. Also, these aren’t really agendas more like targets. It is related to my ranking on non-control agendas though.
When you talk about “other sources of risk from misalignment,” these sound like milder / easier-to-tackle versions of the assumptions you’ve listed? Your assumptions sound like they focus on the worst case scenario. If you can solve the harder version then I would imagine that the easier version would also be solved, no?
Yeah, if you handle scheming, you solve all my safety problems, but not the final bullet point of “models fail to live up to their potential” problems.
(Though the “fail to live up to potential” problems are probably mostly indirect, see here.)