I’d be curious whether you think that it has been a good thing for Dario Amodei to publicly state his AI x-risk estimate of 10-25%, even though it’s very rough and unprincipled. If so, would it be good for labs to state a very rough estimate explicitly for catastrophic risk in the next 2 years, to inform policymakers and the public? If so, why would having teams with ai, forecasting, and risk management expertise make very rough estimates of risk from model training/deployment and releasing them to policymakers and maybe the public be bad?
I’m curious where you get off the train of this being good, particularly when it becomes known to policymakers and the public that a model could pose significant risk, even if we’re not well-calibrated on exactly what the risk level is.
tl;dr: Dario’s statement seems likely to reduce overconfidence. Risk-management-style policy seems likely to increase it. Overconfidence gets us killed.
I think Dario’s public estimate of 10-25% is useful in large part because:
It makes it more likely that the risks are taken seriously.
It’s clearly very rough and unprincipled.
Conditional on regulators adopting a serious risk-management-style approach, I expect that we’ve already achieved (1).
The reason I’m against it is that it’ll actually be rough and unprincipled, but this will not clear—in most people’s minds (including most regulators, I imagine) it’ll map onto the kind of systems that we have for e.g. nuclear risks. Further, I think that for AI risk that’s not x-risk, it may work (probably after a shaky start). Conditional on its not working for x-risk, working for non-x-risk is highly undesirable, since it’ll tend to lead to overconfidence.
I don’t think I’m particularly against teams of [people non-clueless on AI x-risk], [good general forecasters] and [risk management people] coming up with wild guesses that they clearly label as wild guesses.
That’s not what I expect would happen (if it’s part of an official regulatory system, that is). Two cases that spring to mind are:
The people involved are sufficiently cautious, and produce estimates/recommendations that we obviously need to stop. (e.g. this might be because the AI people are MIRI-level cautious, and/or the forecasters correctly assess that there’s no reason to believe they can make accurate AI x-risk predictions)
The people involved aren’t sufficiently cautious, and publish their estimates in a form you’d expect of Very Serious People, in a Very Serious Organization—with many numbers, charts and trends, and no “We basically have no idea what we’re doing—these are wild guesses!” warning in huge red letters at the top of every page.
The first makes this kind of approach unnecessary—better to get the cautious people make the case that we have no solid basis to make these assessments that isn’t a wild guess.
The second seems likely to lead to overconfidence. If there’s an officially sanctioned team of “experts” making “expert” assessments for an international(?) regulator, I don’t expect this to be treated like the guess that it is in practice.
Thanks! I agree with a lot of this, will pull out the 2 sentences I most disagree with. For what it’s worth I’m not confident that this type of risk assessment would be a very valuable idea (/ which versions would be best). I agree that there is significant risk of non-cautious people doing this poorly.
The reason I’m against it is that it’ll actually be rough and unprincipled, but this will not clear—in most people’s minds (including most regulators, I imagine) it’ll map onto the kind of systems that we have for e.g. nuclear risks.
I think quantifying “rough and unprincipled” estimates is often good, and if the team of forecasters/experts is good then in the cases where we have no idea what is going on yet (as you mentioned, the x-risk case especially as systems get stronger) they will not produce super confident estimates. If the forecasts were made by the same people and presented in the exact same manner as nuclear forecasts, that would be bad, but that’s not what I expect to happen if the team is competent.
The first makes this kind of approach unnecessary—better to get the cautious people make the case that we have no solid basis to make these assessments that isn’t a wild guess.
I’d guess something like “expert team estimates a 1% chance of OpenAI’s next model causing over 100 million deaths, causing 1 million deaths in expectation” might hit harder to policymakers than “experts say we have no idea whether OpenAI’s models will cause a catastrophe”. The former seems to sound more clearly an alarm than the latter. This is definitely not my area of expertise though.
That’s reasonable, but most of my worry comes back to:
If the team of experts is sufficiently cautious, then it’s a trivially simple calculation: a step beyond GPT-4 + unknown unknowns = stop. (whether they say “unknown unknowns so 5% chance of 8 billion deaths”, or “unknown unknowns so 0.1% chance of 8 billion deaths” doesn’t seem to matter a whole lot)
I note that 8 billion deaths seems much more likely than 100 million, so the expectation of “1% chance of over 100 million deaths” is much more than 1 million.
If the team of experts is not sufficiently cautious, and come up with “1% chance of OpenAI’s next model causing over 100 million deaths” given [not-great methodology x], my worry isn’t that it’s not persuasive that time. It’s that x will become the standard, OpenAI will look at the report, optimize to minimize the output of x, and the next time we’ll be screwed.
In part, I’m worried that the argument for (1) is too simple—so that a forecasting team might put almost all the emphasis elsewhere, producing a 30-page report with 29 essentially irrelevant pages. Then it might be hard to justify coming to the same conclusion once the issues on 29 out of 30 pages are fixed.
I’d prefer to stick to the core argument: a powerful model and unknown unknowns are sufficient to create too much risk. The end. We stop until we fix that.
The only case I can see against this is [there’s a version of using AI assistants for alignment work that reduces overall risk]. Here I’d like to see a more plausible positive case than has been made so far. The current case seems to rely on wishful thinking (it’s more specific than the one sentence version, but still sketchy and relies a lot on [we hope this bit works, and this bit too...]).
However, I don’t think Eliezer’s critique is sufficient to discount approaches of this form, since he tends to focus on the naive [just ask for a full alignment solution] versions, which are a bit strawmannish. I still think he’s likely to be essentially correct—that to the extent we want AI assistants to be providing key insights that push research in the right direction, such assistants will be too dangerous; to the extent that they can’t do this, we’ll be accelerating a vehicle that can’t navigate.
[EDIT: oh and of course there’s the [if we really suck at navigation, then it’s not clear a 20-year pause gives us hugely better odds anyway] argument; but I think there’s a decent case that improving our ability to navigate might be something that it’s hard to accelerate with AI assistants, so that a 5x research speedup does not end up equivalent to having 5x more time]
But this seems to be the only reasonable crux. This aside, we don’t need complex analyses.
GPT-4 + unknown unknowns = stop. (whether they say “unknown unknowns so 5% chance of 8 billion deaths”, or “unknown unknowns so 0.1% chance of 8 billion deaths
I feel like .1% vs. 5% might matter a lot, particularly if we don’t have strong international or even national coordination and are trading off more careful labs going ahead vs. letting other actors pass them. This seems like the majority of worlds to me (i.e. without strong international coordination where US/China/etc. trust each other to stop and we can verify that), so building capacity to improve these estimates seems good. I agree there are also tradeoffs around alignment research assistance that seem relevant. Anyway, overall I’d be surprised if it doesn’t help substantially to have more granular estimates.
my worry isn’t that it’s not persuasive that time. It’s that x will become the standard, OpenAI will look at the report, optimize to minimize the output of x, and the next time we’ll be screwed.
This seems to me to be assuming a somewhat simplistic methodology for the risk assessment; again this seems to come down to how good the team will be, which I agree would be a very important factor.
Anyway, overall I’d be surprised if it doesn’t help substantially to have more granular estimates.
Oh, I’m certainly not claiming that no-one should attempt to make the estimates.
I’m claiming that, conditional on such estimation teams being enshrined in official regulation, I’d expect their results to get misused. Therefore, I’d rather that we didn’t have official regulation set up this way.
The kind of risk assessments I think I would advocate would be based on the overall risk of a lab’s policy, rather than their immediate actions. I’d want regulators to push for safer strategies, not to run checks on unsafe strategies—at best that seems likely to get a local minimum (and, as ever, overconfidence). More [evaluate the plan to get through the minefield], and less [estimate whether we’ll get blown up on the next step]. (importantly, it won’t always be necessary to know which particular step forward is more/less likely to be catastrophic, in order to argue that an overall plan is bad)
fOh, I’m certainly not claiming that no-one should attempt to make the estimates.
Ah my bad if I lost the thread there
I’d want regulators to push for safer strategies, not to run checks on unsafe strategies—at best that seems likely to get a local minimum (and, as ever, overconfidence).
Seems like checks on unsafe strategies does well encourages safer strategies, I agree overconfidence is an issue though
More [evaluate the plan to get through the minefield], and less [estimate whether we’ll get blown up on the next step]
Seems true in an ideal world but in practice I’d imagine it’s much easier to get consensus when you have more concrete evidence of danger / misalignment. Seems like there’s lots of disagreement even within the current alignment field and I don’t expect that to change absent of more evidence of danger/misalignment and perhaps credible estimates.
To be clear I think if we could push a button for an international pause now it would be great, and I think it’s good to advocate for that to shift the Overton Window if nothing else, but in terms of realistic plans it seems good to aim for stuff a bit closer to evaluating the next step than overall policies, for which there is massive disagreement.
(of course there’s a continuum between just looking at the next step and the overall plan, there totally should be people doing both and there are so it’s a question at the margin, etc.)
The other portions of your comment I think I’ve already given my thoughts on previously, but overall I’d say I continue to think it depends a lot on the particulars of the regulation and the group doing the risk assessment; done well I think it could set up incentives well but yes if done poorly it will get Goodharted. Anyway, I’m not sure it’s particularly likely to get enshrined into regulation anytime soon, so hopefully we will get some evidence as to how feasible it is and how it’s perceived via pilots and go from there.
I’d be curious whether you think that it has been a good thing for Dario Amodei to publicly state his AI x-risk estimate of 10-25%, even though it’s very rough and unprincipled. If so, would it be good for labs to state a very rough estimate explicitly for catastrophic risk in the next 2 years, to inform policymakers and the public? If so, why would having teams with ai, forecasting, and risk management expertise make very rough estimates of risk from model training/deployment and releasing them to policymakers and maybe the public be bad?
I’m curious where you get off the train of this being good, particularly when it becomes known to policymakers and the public that a model could pose significant risk, even if we’re not well-calibrated on exactly what the risk level is.
tl;dr:
Dario’s statement seems likely to reduce overconfidence.
Risk-management-style policy seems likely to increase it.
Overconfidence gets us killed.
I think Dario’s public estimate of 10-25% is useful in large part because:
It makes it more likely that the risks are taken seriously.
It’s clearly very rough and unprincipled.
Conditional on regulators adopting a serious risk-management-style approach, I expect that we’ve already achieved (1).
The reason I’m against it is that it’ll actually be rough and unprincipled, but this will not clear—in most people’s minds (including most regulators, I imagine) it’ll map onto the kind of systems that we have for e.g. nuclear risks.
Further, I think that for AI risk that’s not x-risk, it may work (probably after a shaky start). Conditional on its not working for x-risk, working for non-x-risk is highly undesirable, since it’ll tend to lead to overconfidence.
I don’t think I’m particularly against teams of [people non-clueless on AI x-risk], [good general forecasters] and [risk management people] coming up with wild guesses that they clearly label as wild guesses.
That’s not what I expect would happen (if it’s part of an official regulatory system, that is).
Two cases that spring to mind are:
The people involved are sufficiently cautious, and produce estimates/recommendations that we obviously need to stop. (e.g. this might be because the AI people are MIRI-level cautious, and/or the forecasters correctly assess that there’s no reason to believe they can make accurate AI x-risk predictions)
The people involved aren’t sufficiently cautious, and publish their estimates in a form you’d expect of Very Serious People, in a Very Serious Organization—with many numbers, charts and trends, and no “We basically have no idea what we’re doing—these are wild guesses!” warning in huge red letters at the top of every page.
The first makes this kind of approach unnecessary—better to get the cautious people make the case that we have no solid basis to make these assessments that isn’t a wild guess.
The second seems likely to lead to overconfidence. If there’s an officially sanctioned team of “experts” making “expert” assessments for an international(?) regulator, I don’t expect this to be treated like the guess that it is in practice.
Thanks! I agree with a lot of this, will pull out the 2 sentences I most disagree with. For what it’s worth I’m not confident that this type of risk assessment would be a very valuable idea (/ which versions would be best). I agree that there is significant risk of non-cautious people doing this poorly.
I think quantifying “rough and unprincipled” estimates is often good, and if the team of forecasters/experts is good then in the cases where we have no idea what is going on yet (as you mentioned, the x-risk case especially as systems get stronger) they will not produce super confident estimates. If the forecasts were made by the same people and presented in the exact same manner as nuclear forecasts, that would be bad, but that’s not what I expect to happen if the team is competent.
I’d guess something like “expert team estimates a 1% chance of OpenAI’s next model causing over 100 million deaths, causing 1 million deaths in expectation” might hit harder to policymakers than “experts say we have no idea whether OpenAI’s models will cause a catastrophe”. The former seems to sound more clearly an alarm than the latter. This is definitely not my area of expertise though.
That’s reasonable, but most of my worry comes back to:
If the team of experts is sufficiently cautious, then it’s a trivially simple calculation: a step beyond GPT-4 + unknown unknowns = stop. (whether they say “unknown unknowns so 5% chance of 8 billion deaths”, or “unknown unknowns so 0.1% chance of 8 billion deaths” doesn’t seem to matter a whole lot)
I note that 8 billion deaths seems much more likely than 100 million, so the expectation of “1% chance of over 100 million deaths” is much more than 1 million.
If the team of experts is not sufficiently cautious, and come up with “1% chance of OpenAI’s next model causing over 100 million deaths” given [not-great methodology x], my worry isn’t that it’s not persuasive that time. It’s that x will become the standard, OpenAI will look at the report, optimize to minimize the output of x, and the next time we’ll be screwed.
In part, I’m worried that the argument for (1) is too simple—so that a forecasting team might put almost all the emphasis elsewhere, producing a 30-page report with 29 essentially irrelevant pages. Then it might be hard to justify coming to the same conclusion once the issues on 29 out of 30 pages are fixed.
I’d prefer to stick to the core argument: a powerful model and unknown unknowns are sufficient to create too much risk. The end. We stop until we fix that.
The only case I can see against this is [there’s a version of using AI assistants for alignment work that reduces overall risk]. Here I’d like to see a more plausible positive case than has been made so far. The current case seems to rely on wishful thinking (it’s more specific than the one sentence version, but still sketchy and relies a lot on [we hope this bit works, and this bit too...]).
However, I don’t think Eliezer’s critique is sufficient to discount approaches of this form, since he tends to focus on the naive [just ask for a full alignment solution] versions, which are a bit strawmannish. I still think he’s likely to be essentially correct—that to the extent we want AI assistants to be providing key insights that push research in the right direction, such assistants will be too dangerous; to the extent that they can’t do this, we’ll be accelerating a vehicle that can’t navigate.
[EDIT: oh and of course there’s the [if we really suck at navigation, then it’s not clear a 20-year pause gives us hugely better odds anyway] argument; but I think there’s a decent case that improving our ability to navigate might be something that it’s hard to accelerate with AI assistants, so that a 5x research speedup does not end up equivalent to having 5x more time]
But this seems to be the only reasonable crux. This aside, we don’t need complex analyses.
I feel like .1% vs. 5% might matter a lot, particularly if we don’t have strong international or even national coordination and are trading off more careful labs going ahead vs. letting other actors pass them. This seems like the majority of worlds to me (i.e. without strong international coordination where US/China/etc. trust each other to stop and we can verify that), so building capacity to improve these estimates seems good. I agree there are also tradeoffs around alignment research assistance that seem relevant. Anyway, overall I’d be surprised if it doesn’t help substantially to have more granular estimates.
This seems to me to be assuming a somewhat simplistic methodology for the risk assessment; again this seems to come down to how good the team will be, which I agree would be a very important factor.
Oh, I’m certainly not claiming that no-one should attempt to make the estimates.
I’m claiming that, conditional on such estimation teams being enshrined in official regulation, I’d expect their results to get misused. Therefore, I’d rather that we didn’t have official regulation set up this way.
The kind of risk assessments I think I would advocate would be based on the overall risk of a lab’s policy, rather than their immediate actions. I’d want regulators to push for safer strategies, not to run checks on unsafe strategies—at best that seems likely to get a local minimum (and, as ever, overconfidence).
More [evaluate the plan to get through the minefield], and less [estimate whether we’ll get blown up on the next step]. (importantly, it won’t always be necessary to know which particular step forward is more/less likely to be catastrophic, in order to argue that an overall plan is bad)
Ah my bad if I lost the thread there
Seems like checks on unsafe strategies does well encourages safer strategies, I agree overconfidence is an issue though
Seems true in an ideal world but in practice I’d imagine it’s much easier to get consensus when you have more concrete evidence of danger / misalignment. Seems like there’s lots of disagreement even within the current alignment field and I don’t expect that to change absent of more evidence of danger/misalignment and perhaps credible estimates.
To be clear I think if we could push a button for an international pause now it would be great, and I think it’s good to advocate for that to shift the Overton Window if nothing else, but in terms of realistic plans it seems good to aim for stuff a bit closer to evaluating the next step than overall policies, for which there is massive disagreement.
(of course there’s a continuum between just looking at the next step and the overall plan, there totally should be people doing both and there are so it’s a question at the margin, etc.)
The other portions of your comment I think I’ve already given my thoughts on previously, but overall I’d say I continue to think it depends a lot on the particulars of the regulation and the group doing the risk assessment; done well I think it could set up incentives well but yes if done poorly it will get Goodharted. Anyway, I’m not sure it’s particularly likely to get enshrined into regulation anytime soon, so hopefully we will get some evidence as to how feasible it is and how it’s perceived via pilots and go from there.