https://www.elilifland.com/. You can give me anonymous feedback here. I often change my mind and don’t necessarily endorse past writings.
elifland
And internally, we have an anonymous RSP non-compliance reporting line so that any employee can raise concerns about issues like this without any fear of retaliation.
Are you able to elaborate on how this works? Are there any other details about this publicly, couldn’t find more detail via a quick search.
Some specific qs I’m curious about: (a) who handles the anonymous complaints, (b) what is the scope of behavior explicitly (and implicitly re: cultural norms) covered here, (c) handling situations where a report would deanonymize the reporter (or limit them to a small number of people)?
Thanks for the response!
I also expect that if we did develop some neat new elicitation technique we thought would trigger yellow-line evals, we’d re-run them ahead of schedule.
[...]
I also think people might be reading much more confidence into the 30% than is warranted; my contribution to this process included substantial uncertainty about what yellow-lines we’d develop for the next round
Thanks for these clarifications. I didn’t realize that the 30% was for the new yellow-line evals rather than the current ones.
Since triggering a yellow-line eval requires pausing until we have either safety and security mitigations or design a better yellow-line eval with a higher ceiling, doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals
I’m having trouble parsing this sentence. What you mean by “doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals”? Doesn’t pausing include focusing on mitigations and evals?
From the RSP Evals report:
As a rough attempt at quantifying the elicitation gap, teams informally estimated that, given an additional three months of elicitation improvements and no additional pretraining, there is a roughly 30% chance that the model passes our current ARA Yellow Line, a 30% chance it passes at least one of our CBRN Yellow Lines, and a 5% chance it crosses cyber Yellow Lines. That said, we are currently iterating on our threat models and Yellow Lines so these exact thresholds are likely to change the next time we update our Responsible Scaling Policy.
What’s the minimum X% that could replace 30% and would be treated the same as passing the yellow line immediately, if any? If you think that there’s an X% chance that with 3 more months of elicitation, a yellow line will be crossed, what’s the decision-making process for determining whether you should treat it as already being crossed?
In the RSP it says “It is important that we are evaluating models with close to our best capabilities elicitation techniques, to avoid underestimating the capabilities it would be possible for a malicious actor to elicit if the model were stolen” so it seems like folding in some forecasted elicited capabilities into the current evaluation would be reasonable (though they should definitely be discounted the further out they are).
(I’m not particularly concerned about catastrophic risk from the Claude 3 model family, but I am interested in the general policy here and the reasoning behind it)
The word “overconfident” seems overloaded. Here are some things I think that people sometimes mean when they say someone is overconfident:
They gave a binary probability that is too far from 50% (I believe this is the original one)
They overestimated a binary probability (e.g. they said 20% when it should be 1%)
Their estimate is arrogant (e.g. they say there’s a 40% chance their startup fails when it should be 95%), or maybe they give an arrogant vibe
They seem too unwilling to change their mind upon arguments (maybe their credal resilience is too high)
They gave a probability distribution that seems wrong in some way (e.g. “50% AGI by 2030 is so overconfident, I think it should be 10%”)
This one is pernicious in that any probability distribution gives very low percentages for some range, so being specific here seems important.
Their binary estimate or probability distribution seems too different from some sort of base rate, reference class, or expert(s) that they should defer to.
How much does this overloading matter? I’m not sure, but one worry is that it allows people to score cheap rhetorical points by claiming someone else is overconfident when in practice they might mean something like “your probability distribution is wrong in some way”. Beware of accusing someone of overconfidence without being more specific about what you mean.
I think 356 or more people in the population needed to make there be a >5% of 2+ deaths in a 2 month span from that population
[cross-posting from blog]
I made a spreadsheet for forecasting the 10th/50th/90th percentile for how you think GPT-4.5 will do on various benchmarks (given 6 months after the release to allow for actually being applied to the benchmark, and post-training enhancements). Copy it here to register your forecasts.
If you’d prefer, you could also use it to predict for GPT-5, or for the state-of-the-art at a certain time e.g. end of 2024 (my predictions would be pretty similar for GPT-4.5, and end of 2024).
You can see my forecasts made with ~2 hours of total effort on Feb 17 in this sheet; I won’t describe them further here in order to avoid anchoring.
There might be a similar tournament on Metaculus soon, but not sure on the timeline for that (and spreadsheet might be lower friction). If someone wants to take the time to make a form for predicting, tracking and resolving the forecasts, be my guest and I’ll link it here.
This is indeed close enough to Epoch’s median estimate of 7.7e25 FLOPs for Gemini Ultra 1.0 (this doc cites an Epoch estimate of around 9e25 FLOPs).
FYI at the time that doc was created, Epoch had 9e25. Now the notebook says 7.7e25 but their webpage says 5e25. Will ask them about it.
Interesting, thanks for clarifying. It’s not clear to me that this is the right primary frame to think about what would happen, as opposed to just thinking first about how big compute bottlenecks are and then adjusting the research pace for that (and then accounting for diminishing returns to more research).
I think a combination of both perspectives is best, as the argument in your favor for your frame is that there will be some low-hanging fruit from changing your workflow to adapt to the new cognitive labor.
Physical bottlenecks still exist, but is it really that implausible that the capabilities workforce would stumble upon huge algorithmic efficiency improvements? Recall that current algorithms are much less efficient than the human brain. There’s lots of room to go.
I don’t understand the reasoning here. It seems like you’re saying “Well, there might be compute bottlenecks, but we have so much room left to go in algorithmic improvements!” But the room to improve point is already the case right now, and seems orthogonal to the compute bottlenecks point.
E.g. if compute bottlenecks are theoretically enough to turn the 5x cognitive labor into only 1.1x overall research productivity, it will still be the case that there is lots of room for improvement but the point doesn’t really matter as research productivity hasn’t sped up much. So to argue that the situation has changed dramatically you need to argue something about how big of a deal the compute bottlenecks will in fact be.
Imagine the current AGI capabilities employee’s typical work day. Now imagine they had an army of AI assisstants that can very quickly do 10 hours worth of their own labor. How much more productive is that employee compared to their current state? I’d guess at least 5x. See section 6 of Tom Davidson’s takeoff speeds framework for a model.
Can you elaborate how you’re translating 10-hour AI assistants into a 5x speedup using Tom’s CES model?
I agree that <15% seems too low for most reasonable definitions of 1-10 hours and the singularity. But I’d guess I’m more sympathetic than you, depending on the definitions Nathan had in mind.
I think both of the phrases “AI capable doing tasks that took 1-10 hours” and “hit the singularity” are underdefined and making them more clear could lead to significantly different probabilities here.
For “capable of doing tasks that took 1-10 hours in 2024”:
If we’re saying that “AI can do every cognitive task that takes a human 1-10 hours in 2024 as well as (edit: the best)
ahuman expert”, I agree it’s pretty clear we’re getting extremely fast progress at that point not least because AI will be able to do the vast majority of tasks that take much longer than that by the time it can do all of 1-10 hour tasks.However, if we’re using a weaker definition like the one Richard used on most cognitive tasks, it beats most human experts who are given 1-10 hours to perform the task, I think it’s much less clear due to human interaction bottlenecks.
Also, it seems like the distribution of relevant cognitive tasks that you care about changes a lot on different time horizons, which further complicates things.
Re: “hit the singularity”, I think in general there’s little agreement on a good definition here e.g. the definition in Tom’s report is based on doubling time of “effective compute in 2022-FLOP” shortening after “full automation”, which I think is unclear what it corresponds to in terms of real-world impact as I think both of these terms are also underdefined/hard to translate into actual capability and impact metrics.
I would be curious to hear the definitions you and Nathan had in mind regarding these terms.
In his AI Insight Forum statement, Andrew Ng puts 1% on “This rogue AI system gains the ability (perhaps access to nuclear weapons, or skill at manipulating people into using such weapons) to wipe out humanity” in the next 100 years (conditional on a rogue AI system that doesn’t go unchecked by other AI systems existing). And overall 1 in 10 million of AI causing extinction in the next 100 years.
Among existing alignment research agendas/projects, Superalignment has the highest expected value
I’m mainly arguing against public AI safety advocacy work, which was recently upvoted highly on the EA Forum.
I had the impression that it was more than just that, given the line: “In light of recent news, it is worth comprehensively re-evaluating which sub-problems of AI risk are likely to be solved without further intervention from the AI risk community (e.g. perhaps deceptive alignment), and which ones will require more attention.” and the further attention devoted to deceptive alignment.
I appreciate these predictions, but I am not as interested in predicting personal of public opinions. I’m more interested in predicting regulatory stringency, quality, and scope.
If you have any you think faithfully represent a possible disagreement between us go ahead. I personally feel it will be very hard to operationalize objective stuff about policies in a satisfying way. For example, a big issue with the market you’ve made is that it is about what will happen in the world, not what will happen without intervention from AI x-risk people. Furthermore it has all the usual issues with forecasting on complex things 12 years in advance, regarding the extent to which it operationalizes any disagreement well (I’ve bet yes on it, but think it’s likely that evaluating and fixing deceptive alignment will remain mostly unsolved in 2035 conditional on no superintelligence, especially if there were no intervention from x-risk people).
I have three things to say here:
Thanks for clarifying.
Several months ago I proposed general, long-term value drift as a problem that I think will be hard to solve by default. I currently think that value drift is a “hard bit” of the problem that we do not appear to be close to seriously addressing, perhaps because people expect easier problems won’t be solved either without heroic effort. I’m also sympathetic to Dan Hendrycks’ arguments about AI evolution. I will add these points to the post.
Don’t have a strong opinion here, but intuitively feels like it would be hard to find tractable angles for work on this now.
I mostly think people should think harder about what the hard parts of AI risk are in the first place. It would not be surprising if the “hard bits” will be things that we’ve barely thought about, or are hard to perceive as major problems, since their relative hiddenness would be a strong reason to believe that they will not be solved by default.
Maybe. In general, I’m excited about people who have the talent for it to think about previously neglected angles.
The problem of “make sure policies are well-targeted, informed by the best evidence, and mindful of social/political difficulties” seems like a hard problem that societies have frequently failed to get right historically, and the relative value of solving this problem seems to get higher as you become more optimistic about the technical problems being solved.
I agree this is important and it was in your post but it seems like a decent description of what the majority of AI x-risk governance people are already working on, or at least not obviously a bad one. This is the phrase that I was hoping would get made more concrete.
I want to emphasize that the current policies were crafted in an environment in which AI still has a tiny impact on the world. My expectation is that policies will get much stricter as AI becomes a larger part of our life. I am not making the claim that current policies are sufficient; instead I am making a claim about the trajectory, i.e. how well we should expect society to respond at a time, given the evidence and level of AI capabilities at that time.
I understand this (sorry if wasn’t clear), but I think it’s less obvious than you do that this trend will continue without intervention from AI x-risk people. I agree with other commenters that AI x-risk people should get a lot of the credit for the recent push. I also provided example reasons that the trend might not continue smoothly or even reverse in my point (3).
There might also be disagreements around:
Not sharing your high confidence in slow, continuous takeoff.
The strictness of regulation needed to make a dent in AI risk, e.g. if substantial international coordination is required it seems optimistic to me to assume that the trajectory will by default lead to this.
The value in things getting done faster than they would have done otherwise, even if they would have been done either way. This indirectly provides more time to iterate and get to better, more nuanced policy.
I believe that current evidence supports my interpretation of our general trajectory, but I’m happy to hear someone explain why they disagree and highlight concrete predictions that could serve to operationalize this disagreement.
Operationalizing disagreements well is hard and time-consuming especially when we’re betting on “how things would go without intervention from a community that is intervening a lot”, but a few very rough forecasts, all conditional on no TAI before resolve date:
75%: In Jan 2028, less than 10% of Americans will consider AI the most important problem.
60%: In Jan 2030, Evan Hubinger will believe that if x-risk-motivated people had not worked on deceptive alignment at all, risk from deceptive alignment would be at least 50% higher, compared to a baseline of no work at all (i.e. if risk is 5% and it would be 9% with no work from anyone, it needs to have been >7% if no work from x-risk people had been done to resolve yes).
35%: In Jan 2028, conditional on a Republican President being elected in 2024, regulations on AI in the US will be generally less stringent than they were when the previous president left office.Edit:
Thus, due to no one’s intent, insufficiently justified concerns about current open-source AI are propagated to governance orgs, which recommend banning open source based on this research.
The recommendation that current open-source models should be banned is not present in the policy paper, being discussed, AFAICT. The paper’s recommendations are pictured below:
Edited to add: there is a specific footnote that says “Note that we do not claim that existing models are already too risky. We also do not make any predictions about how risky the next generation of models will be. Our claim is that developers need to assess the risks and be willing to not open-source a model if the risks outweigh the benefits” on page 31
I agree much of the community (including me) was wrong or directionally wrong in the past about the level of AI regulation and how quickly it would come.
Regarding the recommendations made in the post for going forward given that there will be some regulation, I feel confused in a few ways.
Can you provide examples of interventions that meet your bar for not being done by default? It’s hard to understand the takeaways from your post because the negative examples are made much more concrete than the proposed positive ones
You argue that we perhaps shouldn’t invest as much in preventing deceptive alignment because “regulators will likely adapt, adjusting policy as the difficulty of the problem becomes clearer”
If we are assuming that regulators will adapt and adjust regarding deception, can you provide examples of interventions that policymakers will not be able to solve themselves and why they will be less likely to notice and deal with them than deception?
You say “we should question how plausible it is that society will fail to adequately address such an integral part of the problem”. What things aren’t integral parts of the problem but that should be worked on?
I feel we would need much better evidence of things being handled competently to invest significantly less into integral parts of the problem.
You say: ‘Of course, it may still be true that AI deception is an extremely hard problem that reliably resists almost all attempted solutions in any “normal” regulatory regime, even as concrete evidence continues to accumulate about its difficulty—although I consider that claim unproven, to say the least’
If we expect some problems in AI risk to be solved by default mostly by people outside the community, it feels to me like one takeaway would be that we should shift resources to portions of the problem that we expect to be the hardest
To me, intuitively, deceptive alignment might be one of the hardest parts of the problem as we scale to very superhuman systems, even if we condition on having time to build model organisms of misalignment and experiment with them for a few years. So I feel confused about why you claim a high level of difficulty is “unproven” as a dismissal; of course it’s unproven but you would need to argue that in worlds where the AI risk problem is fairly hard, there’s not much of a chance of it being very hard.
As someone who is relatively optimistic about concrete evidence of deceptive alignment increasing substantially before a potential takeover, I think I still put significantly lower probability on it than you do due to the possibility of fairly fast takeoff.
I feel like this post is to some extent counting our chickens before they hatch (tbc I agree with the directional update as I said above). I’m not an expert on what’s going on here but I imagine any of the following happening (non-exhaustive list) that make the current path to potentially sensible regulation in the US and internationally harder:
The EO doesn’t lead to as many resources dedicated to AI-x-risk-reducing things as we might hope. I haven’t read it myself, just the fact sheet and Zvi’s summary but Zvi says “If you were hoping for or worried about potential direct or more substantive action, then the opposite applies – there is very little here in the way of concrete action, only the foundation for potential future action.”
A Republican President comes in power in the US and reverses a lot of the effects in the EO
Rishi Sunak gets voted out in the UK (my sense is that this is likely) and the new Prime Minister is much less gung-ho on AI risk
I don’t have strong views on the value of AI advocacy, but this post seems overconfident in calling it out as being basically not useful based on recent shifts.
It seems likely that much stronger regulations will be important, e.g. the model reporting threshold in the EO was set relatively high and many in the AI risk community have voiced support for an international pause if it were politically feasible, which the EO is far from.
The public still doesn’t consider AI risk to be very important. <1% of the American public considers it the most important problem to deal with. So to the extent that raising that number was good before, it still seems pretty good now, even if slightly worse.
fOh, I’m certainly not claiming that no-one should attempt to make the estimates.
Ah my bad if I lost the thread there
I’d want regulators to push for safer strategies, not to run checks on unsafe strategies—at best that seems likely to get a local minimum (and, as ever, overconfidence).
Seems like checks on unsafe strategies does well encourages safer strategies, I agree overconfidence is an issue though
More [evaluate the plan to get through the minefield], and less [estimate whether we’ll get blown up on the next step]
Seems true in an ideal world but in practice I’d imagine it’s much easier to get consensus when you have more concrete evidence of danger / misalignment. Seems like there’s lots of disagreement even within the current alignment field and I don’t expect that to change absent of more evidence of danger/misalignment and perhaps credible estimates.
To be clear I think if we could push a button for an international pause now it would be great, and I think it’s good to advocate for that to shift the Overton Window if nothing else, but in terms of realistic plans it seems good to aim for stuff a bit closer to evaluating the next step than overall policies, for which there is massive disagreement.
(of course there’s a continuum between just looking at the next step and the overall plan, there totally should be people doing both and there are so it’s a question at the margin, etc.)
The other portions of your comment I think I’ve already given my thoughts on previously, but overall I’d say I continue to think it depends a lot on the particulars of the regulation and the group doing the risk assessment; done well I think it could set up incentives well but yes if done poorly it will get Goodharted. Anyway, I’m not sure it’s particularly likely to get enshrined into regulation anytime soon, so hopefully we will get some evidence as to how feasible it is and how it’s perceived via pilots and go from there.
GPT-4 + unknown unknowns = stop. (whether they say “unknown unknowns so 5% chance of 8 billion deaths”, or “unknown unknowns so 0.1% chance of 8 billion deaths
I feel like .1% vs. 5% might matter a lot, particularly if we don’t have strong international or even national coordination and are trading off more careful labs going ahead vs. letting other actors pass them. This seems like the majority of worlds to me (i.e. without strong international coordination where US/China/etc. trust each other to stop and we can verify that), so building capacity to improve these estimates seems good. I agree there are also tradeoffs around alignment research assistance that seem relevant. Anyway, overall I’d be surprised if it doesn’t help substantially to have more granular estimates.
my worry isn’t that it’s not persuasive that time. It’s that x will become the standard, OpenAI will look at the report, optimize to minimize the output of x, and the next time we’ll be screwed.
This seems to me to be assuming a somewhat simplistic methodology for the risk assessment; again this seems to come down to how good the team will be, which I agree would be a very important factor.
Both of these seem false.
Re: talent, see from their website:
They don’t list their team on their site, but I know their early team includes Igor Babuschkin who has worked at OAI and DeepMind, and Christian Szegedy who has 250k+ citations including several foundational papers.
Re: resources, according to Elon’s early July tweet (ofc take Elon with a grain of salt) Grok 2 was trained on 24k H100s (approximately 3x the FLOP/s of GPT-4, according to SemiAnalysis). And xAI was working on a 100k H100 cluster that was on track to be finished in July. Also they raised $6B in May.