A capabilities evaluation is defined as “a model evaluation designed to test whether a model could do some task if it were trying to. …
A safety evaluation is defined as “a model evaluation designed to test under what circumstances a model would actually try to do some task. …
I propose changing the term for this second type of evaluation to “propensity evaluations”. I think this is a better term as it directly fits the definition you provided: “a model evaluation designed to test under what circumstances a model would actually try to do some task”.
Moreover, I think that both capabilities evaluations and propensity evaluations can be types of safety evaluations. Therefore, it’s misleading to label only one of them as “safety evaluations”. For example, we could construct a compelling safety argument for current models using solely capability evaluations.
Either can be sufficient for safety: a strong argument based on capabilities (we’ve conclusively determined that the AI is too dumb to do anything very dangerous) or a strong argument based on propensity (we have a theoretically robust and empirically validated case that our training process will result in an AI that never attempts to do anything harmful).
Alternatively, a moderately strong argument based on capabilities combined with a moderately strong argument based on propensity can be sufficient, provided that the evidence is sufficiently independent.
I sometimes refer to capability based arguments as control arguments.
Then, we can name two lines of defense:
The control line of defense: Would the AI succeed at causing bad outcomes if it tried?
The propensity line of defense: Would the AI try to cause bad outcomes?
It’s possible to develop techniques which advance either the control line of defense or the propensity line of defense. Of course, many research directions are useful for both.
I expect that in practice, we’re not very close to being able to make good propensity arguments (for instance, theory and interpretability both seem to me to be unlikely to establish this in the next several doublings of total R&D effort). However, we’re not that far off from building quite powerful control based countermeasures. I think these control based countermeasures could scale to establishing barely acceptable safety arguments (e.g. 1-5% doom per year) for around human level AI systems while using these AI systems to accelerate software based R&D (e.g. alignment research) by >30x.
Further, as Evan noted, control style arguments seem much more straightforward to evaluate (though various complications can result from exploration and gradient hacking). So, I think the feedback loops on control look quite promising.
(Redwood Research, where I work, is currently pursuing several control style projects and we should be putting out various things on this soon.)
An important thing to emphasize with control arguments is that it seems quite unlikely that control arguments can be made workable for very superhuman models. (At least for the notion of “control arguments” which can be readily assessed with non-insane capability evaluations.)
AI might also cause bad outcomes as a side-effect of pursuing other goals. A lot of the bad outcomes that powerful agents like companies produce today are not because the company tries to cause bad outcomes.
[it turns out I have many questions—please consider this a pointer to the kind of information I’d find useful, rather than a request to answer them all!]
around human level AI systems while using these AI systems to accelerate software based R&D (e.g. alignment research) by >30x
Can you point to what makes you think this is likely? (or why it seems the most promising approach)
In particular, I worry when people think much in terms of “doublings of total R&D effort” given that I’d expect AI assistance progress multipliers to vary hugely—with the lowest multipliers correlating strongly with the most important research directions.
To me it seems that the kind of alignment research that’s plausible to speed up 30x is the kind that we can already do without much trouble—narrowly patching various problems in ways we wouldn’t expect to generalize to significantly superhuman systems. That and generating a ton of empirical evidence quickly—which is nice, but I expect the limiting factor is figuring out what questions to ask.
It doesn’t seem plausible that we get a nice inductive pattern where each set of patches allows a little more capability safely, which in turn allows more patches.… I’m not clear on when this would fail, but pretty clear that it would fail.
What we’d seem to need is a large speedup on more potentially-sufficiently-general-if-they-work approaches—e.g. MIRI/ARC-theory/JW stuff.
30x speedup on this seems highly unlikely. (I guess you’d agree?) Even if it were possible to make a month of progress in one day, it doesn’t seem possible to integrate understanding of that work in a day (if the AI is doing the high-level integration and direction-setting, we seem to be out of the [control measures will keep this safe] regime).
I also note that empirically, theoretical teams don’t tend to add a load of very smart humans. I’m sure that Paul could expand a lot more quickly if he thought that was helpful. Likewise MIRI. Are they making a serious error here, or are the limitations of very-smart-human assistants not going to apply to AI assistants? (granted, I expect AI assistants aren’t going to have personality clashes etc)
Are you expecting sufficiently general alignment solutions to come out of work that doesn’t require deep integrated understanding? Can you point to current work (or properties of current work) that would be examples? Would you guess the things we could radically speed up are sufficient for a solution, or just useful? If the latter, how much painfully-slow-by-comparison work seems likely to be needed?
Or would the hope be that for more theoretical work there’s a significant speedup, even if it’s not 30x? What seems plausible to you here? 5x? Why is this currently not being achieved through human scaling? Is 5x enough to compensate for the risks? What multiplier would be just sufficient to compensate?
What would you consider early evidence of the expected multiplier for theoretical work? E.g. should we be getting a 3x speedup with current AIs on open, underspecified problems that seem somewhat easier than alignment? Are we? (on anything—not only alignment-relevant things)
My immediate reaction to this kind of approach is that it feels like wishful thinking without much evidence. However, I’m aware that I do aesthetically prefer theoretically motivated approaches—so I don’t entirely trust my reaction. I can buy being even more pessimistic about the theoretical approaches than getting lucky with software based R&D—but to me this suggests that coordination around a stop might be the best bet.
I’m not going to respond to everything you’re saying here right now. It’s pretty likely I won’t end up responding to everything you’re saying at any point; so apologies for that.
Here are some key claims I want to make:
Serial speed is key: Speeding up theory work (like e.g. ARC theory) by 5-10x should be quite doable with human level AIs due to AIs running at much faster serial speeds. This is a key difference between adding AIs and adding humans. Theory can be hard to parallelize which makes adding humans look worse than increasing speed. I’m not confident in speeding up theory work by >30x with controlled and around human level AI, but this doesn’t seem impossible.
Access to the human level AIs makes safety work much more straightforward: A key difference between current safety work and future safety work is that in the future we’ll have access to the exact AIs we’re worried about. I expect this opens up a bunch of empirical work which is quite useful and relatively easy to scalably automate with AIs. I think this work could extend considerably beyond “patches”. (The hope here is similar to model organisms, but somewhat more general.)
The research target can be trusted human level systems instead of superhuman systems. One possible story for victory goes something like “control of untrusted AIs → trustworthy human level (or slightly superhuman AIs) → [some next target like fully scalable alignment]”. If human researchers are literally fully obsoleted by reasonably trustworthy human level AIs and these AIs can collectively speed up any field of alignment by >30x, we should be in a radically better position. These trustworthy AI researchers could work on fully scalable alignment, control or alignment of the next generation of smarter AIs, or some combination of both of these. Researching how to make trusted human level AIs seems much more tractable than researching how to align wildly superhuman systems (though both are hard to measure).
Depending on just control results in a very bad absolute level of risk, but it still might be our best option. I estimated 1-5% doom per year above, but my exact guess will vary depending on various factors. So coordination to do better than this would be great.
Here are some other less important claims which feed into my overall takes:
Current AIs aren’t useful for theory yet partially because they’re too dumb. They suck at math.
I think part of the problem with current AIs is getting the AIs enough context, but I think this is a doable technical problem which probably needs to be solved for AIs to be human level and I expect this to be solved prior to AIs taking over.
My understanding is that the current work done by ARC theory results in them trying to solve a bunch of relatively straightforward math problems, and if they could solve all of these problems very quickly, this would considerably massively accelerate their work. I expect this to be roughly true going forward due to my understanding of their methodology, but I’m not very confident here.
AIs have other structural advantages beyond serial speed which will make speeding things up with AIs relatively easier than with humans.
This makes sense, but seems to rely on the human spending most of their time tackling well-defined but non-trivial problems where an AI doesn’t need to be re-directed frequently [EDIT: the preceding was poorly worded—I meant that if prior to the availability of AI assistants this were true, it’d allow a lot of speedup as the AIs take over this work; otherwise it’s less clearly so helpful]. Perhaps this is true for ARC—that’s encouraging (though it does again make me wonder why they don’t employ more mathematicians—surely not all the problems are serial on a single critical path?). I’d guess it’s less often true for MIRI and John.
Of course once there’s a large speedup of certain methods, the most efficient methodology would look different. I agree that 5x to 10x doesn’t seem implausible.
″...in the future we’ll have access to the exact AIs we’re worried about.”:
We’ll have access to the ones we’re worried about deploying.
We won’t have access to the ones we’re worried about training until we’re training them.
I do buy that this makes safety work for that level of AI more straightforward—assuming we’re not already dead. I expect most of the value is in what it tells us about a more general solution, if anything—similarly for model organisms. I suppose it does seem plausible that this is the first level we see a qualitatively different kind of general reasoning/reflection that leads us in new theoretical directions. (though I note that this makes [this is useful to study] correlate strongly with [this is dangerous to train])
“Researching how to make trustworthy human level AIs seems much more tractable than researching how to align wildly superhuman systems”:
This isn’t clear to me. I’d guess that the same fundamental understanding is required for both. “trustworthy” seems superficially easier than “aligned”, but that’s not obvious in a general context. I’d expect that implementing the trustworthy human-level version would be a lower bar—butthat the same understanding would show us what conditions would need to obtain in either case. (certainly I’m all for people looking for an easier path to the human-level version, if this can be done safely—I’d just be somewhat surprised if we find one)
“So coordination to do better than this would be great”.
I’d be curious to know what you’d want to aim for here—both in a mostly ideal world, and what seems most expedient.
“So coordination to do better than this would be great”.
I’d be curious to know what you’d want to aim for here—both in a mostly ideal world, and what seems most expedient.
As far as the ideal, I happened to write something about in another comment yesterday. Excerpt:
Best: we first prevent hardware progress and stop H100 manufactoring for a bit, then we prevent AI algorithmic progress, and then we stop scaling (ideally in that order). Then, we heavily invest in long run safety research agendas and hold the pause for a long time (20 years sounds good to start). This requires heavy international coordination.
As far as expedient, something like:
Demand labs have good RSPs (or something similar) using inside and outside game, try to get labs to fill in tricky future details of these RSPs as early as possible without depending on “magic” (speculative future science which hasn’t yet been verified). Have AI takeover motivated people work on the underlying tech and implementation.
Work on policy and aim for powerful US policy interventions in parallel. Other countries could also be relevant.
Both of these are unlikely to perfectly succeed, but seems like good directions to push on.
I think pushing for AI lab scaling pauses is probably net negative right now, but I don’t feel very strongly either way (it mostly just feels not that leveraged overall). I think slowing down hardware progress seems clearly good if we could do it at low cost, but seems super intractible.
Thanks, this seems very reasonable. I’d missed your other comment. (Oh and I edited my previous comment for clarity: I guess you were disagreeing with my clumsily misleading wording, rather than what I meant(??))
(Oh and I edited my previous comment for clarity: I guess you were disagreeing with my clumsily misleading wording, rather than what I meant(??))
Corresponding comment text:
This makes sense, but seems to rely on the human spending most of their time tackling well-defined but non-trivial problems where an AI doesn’t need to be re-directed frequently [EDIT: the preceding was poorly worded—I meant that if prior to the availability of AI assistants this were true, it’d allow a lot of speedup as the AIs take over this work; otherwise it’s less clearly so helpful].
I think I disagree with what you meant, but not that strongly. It’s not that important, so I don’t really want to get into it. Basically, I don’t think that “well-defined” is that important (not obviously required for some ability to judge the finished work) and I don’t think “re-direction frequency” is the right way to think about.
I propose changing the term for this second type of evaluation to “propensity evaluations”. I think this is a better term as it directly fits the definition you provided: “a model evaluation designed to test under what circumstances a model would actually try to do some task”.
Moreover, I think that both capabilities evaluations and propensity evaluations can be types of safety evaluations. Therefore, it’s misleading to label only one of them as “safety evaluations”. For example, we could construct a compelling safety argument for current models using solely capability evaluations.
Either can be sufficient for safety: a strong argument based on capabilities (we’ve conclusively determined that the AI is too dumb to do anything very dangerous) or a strong argument based on propensity (we have a theoretically robust and empirically validated case that our training process will result in an AI that never attempts to do anything harmful).
Alternatively, a moderately strong argument based on capabilities combined with a moderately strong argument based on propensity can be sufficient, provided that the evidence is sufficiently independent.
I sometimes refer to capability based arguments as control arguments.
Then, we can name two lines of defense:
The control line of defense: Would the AI succeed at causing bad outcomes if it tried?
The propensity line of defense: Would the AI try to cause bad outcomes?
It’s possible to develop techniques which advance either the control line of defense or the propensity line of defense. Of course, many research directions are useful for both.
I expect that in practice, we’re not very close to being able to make good propensity arguments (for instance, theory and interpretability both seem to me to be unlikely to establish this in the next several doublings of total R&D effort). However, we’re not that far off from building quite powerful control based countermeasures. I think these control based countermeasures could scale to establishing barely acceptable safety arguments (e.g. 1-5% doom per year) for around human level AI systems while using these AI systems to accelerate software based R&D (e.g. alignment research) by >30x.
Further, as Evan noted, control style arguments seem much more straightforward to evaluate (though various complications can result from exploration and gradient hacking). So, I think the feedback loops on control look quite promising.
(Redwood Research, where I work, is currently pursuing several control style projects and we should be putting out various things on this soon.)
An important thing to emphasize with control arguments is that it seems quite unlikely that control arguments can be made workable for very superhuman models. (At least for the notion of “control arguments” which can be readily assessed with non-insane capability evaluations.)
AI might also cause bad outcomes as a side-effect of pursuing other goals. A lot of the bad outcomes that powerful agents like companies produce today are not because the company tries to cause bad outcomes.
Sure. I just mean “try to do things which result in bad outcomes from our perspective”.
If you care about prevent AI from causing bad outcomes those are not the same thing. It’s important to be able to distinguish them.
[it turns out I have many questions—please consider this a pointer to the kind of information I’d find useful, rather than a request to answer them all!]
Can you point to what makes you think this is likely? (or why it seems the most promising approach)
In particular, I worry when people think much in terms of “doublings of total R&D effort” given that I’d expect AI assistance progress multipliers to vary hugely—with the lowest multipliers correlating strongly with the most important research directions.
To me it seems that the kind of alignment research that’s plausible to speed up 30x is the kind that we can already do without much trouble—narrowly patching various problems in ways we wouldn’t expect to generalize to significantly superhuman systems.
That and generating a ton of empirical evidence quickly—which is nice, but I expect the limiting factor is figuring out what questions to ask.
It doesn’t seem plausible that we get a nice inductive pattern where each set of patches allows a little more capability safely, which in turn allows more patches.… I’m not clear on when this would fail, but pretty clear that it would fail.
What we’d seem to need is a large speedup on more potentially-sufficiently-general-if-they-work approaches—e.g. MIRI/ARC-theory/JW stuff.
30x speedup on this seems highly unlikely. (I guess you’d agree?)
Even if it were possible to make a month of progress in one day, it doesn’t seem possible to integrate understanding of that work in a day (if the AI is doing the high-level integration and direction-setting, we seem to be out of the [control measures will keep this safe] regime).
I also note that empirically, theoretical teams don’t tend to add a load of very smart humans. I’m sure that Paul could expand a lot more quickly if he thought that was helpful. Likewise MIRI.
Are they making a serious error here, or are the limitations of very-smart-human assistants not going to apply to AI assistants? (granted, I expect AI assistants aren’t going to have personality clashes etc)
Are you expecting sufficiently general alignment solutions to come out of work that doesn’t require deep integrated understanding? Can you point to current work (or properties of current work) that would be examples? Would you guess the things we could radically speed up are sufficient for a solution, or just useful? If the latter, how much painfully-slow-by-comparison work seems likely to be needed?
Or would the hope be that for more theoretical work there’s a significant speedup, even if it’s not 30x? What seems plausible to you here? 5x? Why is this currently not being achieved through human scaling? Is 5x enough to compensate for the risks? What multiplier would be just sufficient to compensate?
What would you consider early evidence of the expected multiplier for theoretical work?
E.g. should we be getting a 3x speedup with current AIs on open, underspecified problems that seem somewhat easier than alignment? Are we? (on anything—not only alignment-relevant things)
My immediate reaction to this kind of approach is that it feels like wishful thinking without much evidence. However, I’m aware that I do aesthetically prefer theoretically motivated approaches—so I don’t entirely trust my reaction.
I can buy being even more pessimistic about the theoretical approaches than getting lucky with software based R&D—but to me this suggests that coordination around a stop might be the best bet.
I’m not going to respond to everything you’re saying here right now. It’s pretty likely I won’t end up responding to everything you’re saying at any point; so apologies for that.
Here are some key claims I want to make:
Serial speed is key: Speeding up theory work (like e.g. ARC theory) by 5-10x should be quite doable with human level AIs due to AIs running at much faster serial speeds. This is a key difference between adding AIs and adding humans. Theory can be hard to parallelize which makes adding humans look worse than increasing speed. I’m not confident in speeding up theory work by >30x with controlled and around human level AI, but this doesn’t seem impossible.
Access to the human level AIs makes safety work much more straightforward: A key difference between current safety work and future safety work is that in the future we’ll have access to the exact AIs we’re worried about. I expect this opens up a bunch of empirical work which is quite useful and relatively easy to scalably automate with AIs. I think this work could extend considerably beyond “patches”. (The hope here is similar to model organisms, but somewhat more general.)
The research target can be trusted human level systems instead of superhuman systems. One possible story for victory goes something like “control of untrusted AIs → trustworthy human level (or slightly superhuman AIs) → [some next target like fully scalable alignment]”. If human researchers are literally fully obsoleted by reasonably trustworthy human level AIs and these AIs can collectively speed up any field of alignment by >30x, we should be in a radically better position. These trustworthy AI researchers could work on fully scalable alignment, control or alignment of the next generation of smarter AIs, or some combination of both of these. Researching how to make trusted human level AIs seems much more tractable than researching how to align wildly superhuman systems (though both are hard to measure).
Depending on just control results in a very bad absolute level of risk, but it still might be our best option. I estimated 1-5% doom per year above, but my exact guess will vary depending on various factors. So coordination to do better than this would be great.
Here are some other less important claims which feed into my overall takes:
Current AIs aren’t useful for theory yet partially because they’re too dumb. They suck at math.
I think part of the problem with current AIs is getting the AIs enough context, but I think this is a doable technical problem which probably needs to be solved for AIs to be human level and I expect this to be solved prior to AIs taking over.
My understanding is that the current work done by ARC theory results in them trying to solve a bunch of relatively straightforward math problems, and if they could solve all of these problems very quickly, this would considerably massively accelerate their work. I expect this to be roughly true going forward due to my understanding of their methodology, but I’m not very confident here.
AIs have other structural advantages beyond serial speed which will make speeding things up with AIs relatively easier than with humans.
This is clarifying, thanks.
A few thoughts:
“Serial speed is key”:
This makes sense, but seems to rely on the human spending most of their time tackling well-defined but non-trivial problems where an AI doesn’t need to be re-directed frequently [EDIT: the preceding was poorly worded—I meant that if prior to the availability of AI assistants this were true, it’d allow a lot of speedup as the AIs take over this work; otherwise it’s less clearly so helpful].
Perhaps this is true for ARC—that’s encouraging (though it does again make me wonder why they don’t employ more mathematicians—surely not all the problems are serial on a single critical path?).
I’d guess it’s less often true for MIRI and John.
Of course once there’s a large speedup of certain methods, the most efficient methodology would look different. I agree that 5x to 10x doesn’t seem implausible.
″...in the future we’ll have access to the exact AIs we’re worried about.”:
We’ll have access to the ones we’re worried about deploying.
We won’t have access to the ones we’re worried about training until we’re training them.
I do buy that this makes safety work for that level of AI more straightforward—assuming we’re not already dead. I expect most of the value is in what it tells us about a more general solution, if anything—similarly for model organisms. I suppose it does seem plausible that this is the first level we see a qualitatively different kind of general reasoning/reflection that leads us in new theoretical directions. (though I note that this makes [this is useful to study] correlate strongly with [this is dangerous to train])
“Researching how to make trustworthy human level AIs seems much more tractable than researching how to align wildly superhuman systems”:
This isn’t clear to me. I’d guess that the same fundamental understanding is required for both. “trustworthy” seems superficially easier than “aligned”, but that’s not obvious in a general context.
I’d expect that implementing the trustworthy human-level version would be a lower bar—but that the same understanding would show us what conditions would need to obtain in either case. (certainly I’m all for people looking for an easier path to the human-level version, if this can be done safely—I’d just be somewhat surprised if we find one)
“So coordination to do better than this would be great”.
I’d be curious to know what you’d want to aim for here—both in a mostly ideal world, and what seems most expedient.
As far as the ideal, I happened to write something about in another comment yesterday. Excerpt:
As far as expedient, something like:
Demand labs have good RSPs (or something similar) using inside and outside game, try to get labs to fill in tricky future details of these RSPs as early as possible without depending on “magic” (speculative future science which hasn’t yet been verified). Have AI takeover motivated people work on the underlying tech and implementation.
Work on policy and aim for powerful US policy interventions in parallel. Other countries could also be relevant.
Both of these are unlikely to perfectly succeed, but seems like good directions to push on.
I think pushing for AI lab scaling pauses is probably net negative right now, but I don’t feel very strongly either way (it mostly just feels not that leveraged overall). I think slowing down hardware progress seems clearly good if we could do it at low cost, but seems super intractible.
Thanks, this seems very reasonable. I’d missed your other comment.
(Oh and I edited my previous comment for clarity: I guess you were disagreeing with my clumsily misleading wording, rather than what I meant(??))
Corresponding comment text:
I think I disagree with what you meant, but not that strongly. It’s not that important, so I don’t really want to get into it. Basically, I don’t think that “well-defined” is that important (not obviously required for some ability to judge the finished work) and I don’t think “re-direction frequency” is the right way to think about.