GPT-4 + unknown unknowns = stop. (whether they say “unknown unknowns so 5% chance of 8 billion deaths”, or “unknown unknowns so 0.1% chance of 8 billion deaths
I feel like .1% vs. 5% might matter a lot, particularly if we don’t have strong international or even national coordination and are trading off more careful labs going ahead vs. letting other actors pass them. This seems like the majority of worlds to me (i.e. without strong international coordination where US/China/etc. trust each other to stop and we can verify that), so building capacity to improve these estimates seems good. I agree there are also tradeoffs around alignment research assistance that seem relevant. Anyway, overall I’d be surprised if it doesn’t help substantially to have more granular estimates.
my worry isn’t that it’s not persuasive that time. It’s that x will become the standard, OpenAI will look at the report, optimize to minimize the output of x, and the next time we’ll be screwed.
This seems to me to be assuming a somewhat simplistic methodology for the risk assessment; again this seems to come down to how good the team will be, which I agree would be a very important factor.
Anyway, overall I’d be surprised if it doesn’t help substantially to have more granular estimates.
Oh, I’m certainly not claiming that no-one should attempt to make the estimates.
I’m claiming that, conditional on such estimation teams being enshrined in official regulation, I’d expect their results to get misused. Therefore, I’d rather that we didn’t have official regulation set up this way.
The kind of risk assessments I think I would advocate would be based on the overall risk of a lab’s policy, rather than their immediate actions. I’d want regulators to push for safer strategies, not to run checks on unsafe strategies—at best that seems likely to get a local minimum (and, as ever, overconfidence). More [evaluate the plan to get through the minefield], and less [estimate whether we’ll get blown up on the next step]. (importantly, it won’t always be necessary to know which particular step forward is more/less likely to be catastrophic, in order to argue that an overall plan is bad)
fOh, I’m certainly not claiming that no-one should attempt to make the estimates.
Ah my bad if I lost the thread there
I’d want regulators to push for safer strategies, not to run checks on unsafe strategies—at best that seems likely to get a local minimum (and, as ever, overconfidence).
Seems like checks on unsafe strategies does well encourages safer strategies, I agree overconfidence is an issue though
More [evaluate the plan to get through the minefield], and less [estimate whether we’ll get blown up on the next step]
Seems true in an ideal world but in practice I’d imagine it’s much easier to get consensus when you have more concrete evidence of danger / misalignment. Seems like there’s lots of disagreement even within the current alignment field and I don’t expect that to change absent of more evidence of danger/misalignment and perhaps credible estimates.
To be clear I think if we could push a button for an international pause now it would be great, and I think it’s good to advocate for that to shift the Overton Window if nothing else, but in terms of realistic plans it seems good to aim for stuff a bit closer to evaluating the next step than overall policies, for which there is massive disagreement.
(of course there’s a continuum between just looking at the next step and the overall plan, there totally should be people doing both and there are so it’s a question at the margin, etc.)
The other portions of your comment I think I’ve already given my thoughts on previously, but overall I’d say I continue to think it depends a lot on the particulars of the regulation and the group doing the risk assessment; done well I think it could set up incentives well but yes if done poorly it will get Goodharted. Anyway, I’m not sure it’s particularly likely to get enshrined into regulation anytime soon, so hopefully we will get some evidence as to how feasible it is and how it’s perceived via pilots and go from there.
I feel like .1% vs. 5% might matter a lot, particularly if we don’t have strong international or even national coordination and are trading off more careful labs going ahead vs. letting other actors pass them. This seems like the majority of worlds to me (i.e. without strong international coordination where US/China/etc. trust each other to stop and we can verify that), so building capacity to improve these estimates seems good. I agree there are also tradeoffs around alignment research assistance that seem relevant. Anyway, overall I’d be surprised if it doesn’t help substantially to have more granular estimates.
This seems to me to be assuming a somewhat simplistic methodology for the risk assessment; again this seems to come down to how good the team will be, which I agree would be a very important factor.
Oh, I’m certainly not claiming that no-one should attempt to make the estimates.
I’m claiming that, conditional on such estimation teams being enshrined in official regulation, I’d expect their results to get misused. Therefore, I’d rather that we didn’t have official regulation set up this way.
The kind of risk assessments I think I would advocate would be based on the overall risk of a lab’s policy, rather than their immediate actions. I’d want regulators to push for safer strategies, not to run checks on unsafe strategies—at best that seems likely to get a local minimum (and, as ever, overconfidence).
More [evaluate the plan to get through the minefield], and less [estimate whether we’ll get blown up on the next step]. (importantly, it won’t always be necessary to know which particular step forward is more/less likely to be catastrophic, in order to argue that an overall plan is bad)
Ah my bad if I lost the thread there
Seems like checks on unsafe strategies does well encourages safer strategies, I agree overconfidence is an issue though
Seems true in an ideal world but in practice I’d imagine it’s much easier to get consensus when you have more concrete evidence of danger / misalignment. Seems like there’s lots of disagreement even within the current alignment field and I don’t expect that to change absent of more evidence of danger/misalignment and perhaps credible estimates.
To be clear I think if we could push a button for an international pause now it would be great, and I think it’s good to advocate for that to shift the Overton Window if nothing else, but in terms of realistic plans it seems good to aim for stuff a bit closer to evaluating the next step than overall policies, for which there is massive disagreement.
(of course there’s a continuum between just looking at the next step and the overall plan, there totally should be people doing both and there are so it’s a question at the margin, etc.)
The other portions of your comment I think I’ve already given my thoughts on previously, but overall I’d say I continue to think it depends a lot on the particulars of the regulation and the group doing the risk assessment; done well I think it could set up incentives well but yes if done poorly it will get Goodharted. Anyway, I’m not sure it’s particularly likely to get enshrined into regulation anytime soon, so hopefully we will get some evidence as to how feasible it is and how it’s perceived via pilots and go from there.