There isn’t direct risk, but there could be substantial indirect risks because getting useful work out of these models could be important. The risk comes from not getting the work from this model that you needed to mitigate risks from later models.
If the AI company isn’t well described as even trying to optimize for safety the analysis is less clear, but all else equal, we’d prefer to get safety critical labor from weaker systems that are less likely to be egregiously misaligned. (All else is not fully equal, as making models more useful also speeds up timelines.)
It also seems likely that “sycophantic myopic reward hacking models” are differentially less useful for harder to check domains like safety research, though I haven’t thought about this very much.
My understanding is that ~6 months ago y’all were looking for an account of the tasks an automated AI safety researcher would hopefully perform, as part of answering the strategic question ‘what’s the next step after building [controlled] AGI?’ (with ‘actually stop there indefinitely’ being a live possibility)
This comment makes me think you’ve got that account of safety tasks to be automated, and are feeling optimistic about automated safety research.
Is that right and can you share a decently mechanistic account of how automated safety research might work?
[I am often skeptical of, to straw man the argument, ‘make ai that makes ai safe’, got the sense Redwood felt similarly, and now expect this may have changed.]
We’re probably more skeptical than AI companies and less skeptical than the general lesswrong crowd. We have made such a list and thought about it in a a bit of detail (but still pretty undercooked).
My current sense is that automated AI safety/alignment research to reach a reasonable exit condition seems hard, but substantially via the mechanism that achieving the desirable exit condition is an objectively very difficult research tasks for which our best ideas aren’t amazing, and less so via the mechanism that its impossible to get the relevant AIs to help (including that I don’t think it is impossible to get these AIs to help a bunch if they are scheming / egregiously misaligned).
Concretely, I’d target the exit condition of producing AIs (or potentially emulated human minds) which:
Are capable and (philosophically) wise enough to be capable of obsoleting almost all humans including top human researchers. Note this doesn’t (clearly) require such AIs to be wildly superhuman.
And which are aligned enough that we’re happy to defer to them on tricky hard to check questions like “how should we proceed from here given our preferences (feel free to ask us questions to clarify our preferences)”
The basic hope would be to first rule out threat models that might make behavioral testing misleading (ruling out at least scheming, potentially reward hacking) and then constructing good enough behavioral tests of wisdom and iterating against these in ways that seem unlikely to substantially goodhart.
(Such AIs wouldn’t be CEV aligned and we wouldn’t be happy totally defering to them on what to do with the future.)
If we had these AIs, there are many options. We could ask them to figure out what to do overall, ask them to do their own research into aligning much smarter models, we could try to run them much faster to achieve effective superintelligence, and we could try having them control somewhat smarter models.
I think the research needed to reach this exit condition is hard and we don’t have that many amazing ideas. An alterative exit condition I consider is achieving sufficient coordination and non-proliferation to avoid building much-smarter-than-human AIs (really, whatever AIs we can’t control) for a long time (e.g. >10 years) and then figuring something else out.
My current shitty and potentially hard-to-understand list of semi-plausible paths to this “aligned and deferable human replacement” exit is:
Try to build as many diverse model organism of scheming test beds as possible. We can try to build these test beds with this AI itself as a subject and with weaker AIs. (It gets increasingly hard to make model organisms experiments with smarter AIs that understand we might be running experiments and thus they might try to mess with our experiments.) Test a variety of countermeasures in this setting and try to build a robust empirical science of scheming misalignment. E.g., test high level interpretability strategies where we try to analyze what AIs are thinking. Or test simple detection methods like probes.
Develop AIs which are very dumb within a forward pass, but which are very good at using natural language reasoning such that they are competitive with our current systems. Demonstrate that these AIs are very unlikely to be scheming due to insufficient capacity outside of natural language (if we monitor their chains of thought). After ruling out scheming, solve other problems which seem notably easier.
Pursue a very different AI design which is much more modular and more hand constructed (as in, more GOFAI style). This can involve usage of many small and dumb neural components, but needs to be sufficiently interpretable in aggregate which might be hard. This can be done by having the AIs apply huge amounts of labor.
High-level and top-down interp trying to understand core objective-pursuing aspects of the AI. (E.g., what is the high level flowchart of how the AI makes decisions in non-obvious cases and what low level mechanisms could drive misalignment.)
Develop somewhat principled approaches that marginally reduce the probability of scheming (e.g. various types of RAT/LAT, applying massive reductions in the amount of situational awareness throughout training and making this performant enough, adjusting the architecture in ways that make scheming less likely such as by replacing opaque reasoning with CoT)
Do a huge drive toward making literal EMs. This probably requires a huge amount of capital and is really hard, but I could imagine AIs helping a huge amount.
Reliable prediction + ELK via empirical route (Likely incomprehensible to people without context, sorry)
ARC theory
Probes which detect disobedience or some other scalable control using white box methods. (This is spooky to use as it doesn’t guarantee the actual policy is aligned...)
Trying to build brain-like AI and various approximations of EMs
Pursue >3 other moonshots in parallel. None of the other moonshots I’m aware of seem super promising, but maybe. (Things like: deep understanding via bottom up interpretability, very good science of deep learning, something like Davidad’s proposal, generic agent foundations.)
These are roughly ranked in goodness from my perspective.
Is the context on “reliable prediction and ELK via empirical route” just “read the existing ELK literature and actually follow it” or is it stuff that’s not written down? I assume you’ve omitted it to save time, and so no worries if the latter.
EDIT: I was slightly tempted to think of this also as ‘Ryan’s ranking of live agendas that aren’t control’, but I’m not sure if ‘what you expect to work conditional on delegating to AIs’ is similar to ‘what you expect to work if humans are doing most of it?’ (my guess is the lists would look similar, but with notable exceptions, eg humans pursuing GOFAI feels less viable than ML agents pursuing GOFAI)
This isn’t really my ranking of live agendas that aren’t control. For instance, various things focused on demonstrating/measuring risks don’t appear on this list but seem pretty good. And, a bunch of these look much more compelling when you have a ton of AI labor. Also, these aren’t really agendas more like targets. It is related to my ranking on non-control agendas though.
There isn’t direct risk, but there could be substantial indirect risks because getting useful work out of these models could be important. The risk comes from not getting the work from this model that you needed to mitigate risks from later models.
If the AI company isn’t well described as even trying to optimize for safety the analysis is less clear, but all else equal, we’d prefer to get safety critical labor from weaker systems that are less likely to be egregiously misaligned. (All else is not fully equal, as making models more useful also speeds up timelines.)
It also seems likely that “sycophantic myopic reward hacking models” are differentially less useful for harder to check domains like safety research, though I haven’t thought about this very much.
My understanding is that ~6 months ago y’all were looking for an account of the tasks an automated AI safety researcher would hopefully perform, as part of answering the strategic question ‘what’s the next step after building [controlled] AGI?’ (with ‘actually stop there indefinitely’ being a live possibility)
This comment makes me think you’ve got that account of safety tasks to be automated, and are feeling optimistic about automated safety research.
Is that right and can you share a decently mechanistic account of how automated safety research might work?
[I am often skeptical of, to straw man the argument, ‘make ai that makes ai safe’, got the sense Redwood felt similarly, and now expect this may have changed.]
We’re probably more skeptical than AI companies and less skeptical than the general lesswrong crowd. We have made such a list and thought about it in a a bit of detail (but still pretty undercooked).
My current sense is that automated AI safety/alignment research to reach a reasonable exit condition seems hard, but substantially via the mechanism that achieving the desirable exit condition is an objectively very difficult research tasks for which our best ideas aren’t amazing, and less so via the mechanism that its impossible to get the relevant AIs to help (including that I don’t think it is impossible to get these AIs to help a bunch if they are scheming / egregiously misaligned).
Concretely, I’d target the exit condition of producing AIs (or potentially emulated human minds) which:
Are capable and (philosophically) wise enough to be capable of obsoleting almost all humans including top human researchers. Note this doesn’t (clearly) require such AIs to be wildly superhuman.
And which are aligned enough that we’re happy to defer to them on tricky hard to check questions like “how should we proceed from here given our preferences (feel free to ask us questions to clarify our preferences)”
The basic hope would be to first rule out threat models that might make behavioral testing misleading (ruling out at least scheming, potentially reward hacking) and then constructing good enough behavioral tests of wisdom and iterating against these in ways that seem unlikely to substantially goodhart.
(Such AIs wouldn’t be CEV aligned and we wouldn’t be happy totally defering to them on what to do with the future.)
If we had these AIs, there are many options. We could ask them to figure out what to do overall, ask them to do their own research into aligning much smarter models, we could try to run them much faster to achieve effective superintelligence, and we could try having them control somewhat smarter models.
I think the research needed to reach this exit condition is hard and we don’t have that many amazing ideas. An alterative exit condition I consider is achieving sufficient coordination and non-proliferation to avoid building much-smarter-than-human AIs (really, whatever AIs we can’t control) for a long time (e.g. >10 years) and then figuring something else out.
My current shitty and potentially hard-to-understand list of semi-plausible paths to this “aligned and deferable human replacement” exit is:
Try to build as many diverse model organism of scheming test beds as possible. We can try to build these test beds with this AI itself as a subject and with weaker AIs. (It gets increasingly hard to make model organisms experiments with smarter AIs that understand we might be running experiments and thus they might try to mess with our experiments.) Test a variety of countermeasures in this setting and try to build a robust empirical science of scheming misalignment. E.g., test high level interpretability strategies where we try to analyze what AIs are thinking. Or test simple detection methods like probes.
Develop AIs which are very dumb within a forward pass, but which are very good at using natural language reasoning such that they are competitive with our current systems. Demonstrate that these AIs are very unlikely to be scheming due to insufficient capacity outside of natural language (if we monitor their chains of thought). After ruling out scheming, solve other problems which seem notably easier.
Pursue a very different AI design which is much more modular and more hand constructed (as in, more GOFAI style). This can involve usage of many small and dumb neural components, but needs to be sufficiently interpretable in aggregate which might be hard. This can be done by having the AIs apply huge amounts of labor.
High-level and top-down interp trying to understand core objective-pursuing aspects of the AI. (E.g., what is the high level flowchart of how the AI makes decisions in non-obvious cases and what low level mechanisms could drive misalignment.)
Develop somewhat principled approaches that marginally reduce the probability of scheming (e.g. various types of RAT/LAT, applying massive reductions in the amount of situational awareness throughout training and making this performant enough, adjusting the architecture in ways that make scheming less likely such as by replacing opaque reasoning with CoT)
Do a huge drive toward making literal EMs. This probably requires a huge amount of capital and is really hard, but I could imagine AIs helping a huge amount.
Reliable prediction + ELK via empirical route (Likely incomprehensible to people without context, sorry)
ARC theory
Probes which detect disobedience or some other scalable control using white box methods. (This is spooky to use as it doesn’t guarantee the actual policy is aligned...)
Trying to build brain-like AI and various approximations of EMs
Pursue >3 other moonshots in parallel. None of the other moonshots I’m aware of seem super promising, but maybe. (Things like: deep understanding via bottom up interpretability, very good science of deep learning, something like Davidad’s proposal, generic agent foundations.)
These are roughly ranked in goodness from my perspective.
Is the context on “reliable prediction and ELK via empirical route” just “read the existing ELK literature and actually follow it” or is it stuff that’s not written down? I assume you’ve omitted it to save time, and so no worries if the latter.
EDIT: I was slightly tempted to think of this also as ‘Ryan’s ranking of live agendas that aren’t control’, but I’m not sure if ‘what you expect to work conditional on delegating to AIs’ is similar to ‘what you expect to work if humans are doing most of it?’ (my guess is the lists would look similar, but with notable exceptions, eg humans pursuing GOFAI feels less viable than ML agents pursuing GOFAI)
Stuff that’s not written down, sorry.
This isn’t really my ranking of live agendas that aren’t control. For instance, various things focused on demonstrating/measuring risks don’t appear on this list but seem pretty good. And, a bunch of these look much more compelling when you have a ton of AI labor. Also, these aren’t really agendas more like targets. It is related to my ranking on non-control agendas though.