Technical staff at Anthropic, previously #3ainstitute; interdisciplinary, interested in everything; ongoing PhD in CS (learning / testing / verification), open sourcerer, more at zhd.dev
Zac Hatfield-Dodds(Zac Hatfield-Dodds)
Thanks for these clarifications. I didn’t realize that the 30% was for the new yellow-line evals rather than the current ones.
That’s how I was thinking about the predictions that I was making; others might have been thinking about the current evals where those were more stable.
I’m having trouble parsing this sentence. What you mean by “doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals”? Doesn’t pausing include focusing on mitigations and evals?
Of course, but pausing also means we’d have to shuffle people around, interrupt other projects, and deal with a lot of other disruption (the costs of pausing). Ideally, we’d continue updating our yellow-line evals to stay ahead of model capabilities until mitigations are ready.
The yellow-line evals are already a buffer (‘sufficent to rule out red-lines’) which are themselves a buffer (6x effective compute) before actually-dangerous situations. Since triggering a yellow-line eval requires pausing until we have either safety and security mitigations or design a better yellow-line eval with a higher ceiling, doing so only risks the costs of pausing when we could have instead prepared mitigations or better evals. I therefore think it’s reasonable to keep going basically regardless of the probability of triggering in the next round of evals. I also expect that if we did develop some neat new elicitation technique we thought would trigger yellow-line evals, we’d re-run them ahead of schedule.
I also think people might be reading much more confidence into the 30% than is warranted; my contribution to this process included substantial uncertainty about what yellow-lines we’d develop for the next round, and enough calibration training to avoid very low probabilities.
Finally, the point of these estimates is that they can guide research and development prioritization—high estimates suggest that it’s worth investing in more difficult yellow-line evals, and/or that elicitation research seems promising. Tying a pause to that estimate is redundant with the definition of a yellow-line, and would risk some pretty nasty epistemic distortions.
What about whistleblowing or anonymous reporting to governments? If an Anthropic employee was so concerned about RSP implementation (or more broadly about models that had the potential to cause major national or global security threats), where would they go in the status quo?
That really seems more like a question for governments than for Anthropic! For example, the SEC or IRS whistleblower programs operate regardless of what companies puport to “allow”, and I think it’d be cool if the AISI had something similar.
If I was currently concerned about RSP implementation per se (I’m not), it’s not clear why the government would get involved in a matter of voluntary commitments by a private organization. If there was some concern touching on the White House committments, Bletchley declaration, Seoul declaration, etc., then I’d look up the appropriate monitoring body; if in doubt the Commerce whistleblower office or AISI seem like reasonable starting points.
Conjecture is indeed a for-profit company, and I’ve often found Connor disingenuous at best.
“red line” vs “yellow line”
Passing a red-line eval indicates that the model requires ASL-n mitigations. Yellow-line evals are designed to be easier to implement and/or run, while maintaining the property that if you fail them you would also fail the red-line evals. If a model passes the yellow-line evals, we have to pause training and deployment until we put a higher standard of security and safety measures in place, or design and run new tests which demonstrate that the model is below the red line. For example, leaving out the “register a typo’d domain” step from an ARA eval, because there are only so many good typos for our domain.
assurance mechanisms
Our White House committments mean that we’re already reporting safety evals to the US Government, for example. I think the natural reading of “validated” is some combination of those, though obviously it’s very hard to validate that whatever you’re doing is ‘sufficient’ security against serious cyberattacks or safety interventions on future AI systems. We do our best.
I’m glad to see that the non-compliance reporting policy has been implemented and includes anonymous reporting. I’m still hoping to see more details. (And I’m generally confused about why Anthropic doesn’t share more details on policies like this — I fail to imagine a story about how sharing details could be bad, except that the details would be seen as weak and this would make Anthropic look bad.)
What details are you imagining would be helpful for you? Sharing the PDF of the formal policy document doesn’t mean much compared to whether it’s actually implemented and upheld and treated as a live option that we expect staff to consider (fwiw: it is, and I don’t have a non-disparage agreement). On the other hand, sharing internal docs eats a bunch of time in reviewing it before release, chance that someone seizes on a misinterpretation and leaps to conclusions, and other costs.
I believe that meeting our ASL-2 deployment commitments—e.g. enforcing our acceptable use policy, and data-filtering plus harmlessness evals for any fine-tuned models—with widely available model weights is presently beyond the state of the art. If a project or organization makes RSP-like commitments, evaluations and mitigates risks, and can uphold that while releasing model weights… I think that would be pretty cool.
(also note that e.g. LLama is not open source—I think you’re talking about releasing weights; the license doesn’t affect safety but as an open-source maintainer the distinction matters to me)
While some companies, such as OpenAI and Anthropic, have publicly advocated for AI regulation, Time reports that in closed-door meetings, these same companies “tend to advocate for very permissive or voluntary regulations.”
I think that dropping the intermediate text which describes ‘more established big tech companies’ such as Microsoft substantially changes the meaning of this quote—“these same companies” is not “OpenAI and Anthropic”. Full context:
Executives from the newer companies that have developed the most advanced AI models, such as OpenAI CEO Sam Altman and Anthropic CEO Dario Amodei, have called for regulation when testifying at hearings and attending Insight Forums. Executives from the more established big technology companies have made similar statements. For example, Microsoft vice chair and president Brad Smith has called for a federal licensing regime and a new agency to regulate powerful AI platforms. Both the newer AI firms and the more established tech giants signed White House-organized voluntary commitments aimed at mitigating the risks posed by AI systems. But in closed door meetings with Congressional offices, the same companies are often less supportive of certain regulatory approaches
AI lab watch makes it easy to get some background information by comparing committments made by OpenAI, Anthropic, Microsoft, and some other established big tech companies.
Meta’s Llama3 model is also *not *open source, despite the Chief AI Scientist at the company, Yann LeCun, frequently proclaiming that it is.
This is particularly annoying because he knows better: the latter two of those three tweets are from January 2024, and here’s video of his testimony under oath in September 2023: “the Llama system was not made open-source”.
It’s a sparse autoencoder because part of the loss function is an L1 penalty encouraging sparsity in the hidden layer. Otherwise, it would indeed learn a simple identity map!
Tom Davidson’s work on a compute-centric framework for takeoff speed is excellent, IMO.
you CAN predict that there will be evidence with equal probability of each direction.
More precisely the expected value of upwards and downwards updates should be the same; it’s nonetheless possible to be very confident that you’ll update in a particular direction—offset by a much larger and proportionately less likely update in the other.
For example, I have some chance of winning. lottery this year, not much lower than if I actually bought a ticket. I’m very confident that each day I’ll give somewhat lower odds (as there’s less time remaining), but being credibly informed that I’ve won would radically change the odds such that the expectation balances out.
I agree that there’s no substitute for thinking about this for yourself, but I think that morally or socially counting “spending thousands of dollars on yourself, an AI researcher” as a donation would be an apalling norm. There are already far too many unmanaged conflicts of interest and trust-me-it’s-good funding arrangements in this space for me, and I think it leads to poor epistemic norms as well as social and organizational dysfunction. I think it’s very easy for donating to people or organizations in your social circle to have substantial negative expected value.
I’m glad that funding for AI safety projects exists, but the >10% of my income I donate will continue going to GiveWell.
Trivially true to the extent that you are about equally likely to observe a thing throughout that timespan; and the Lindy Effect is at least regularly talked of.
But there are classes of observations for which this is systematically wrong: for example, most people who see a ship part-way through a voyage will do so while it’s either departing or arriving in port. Investment schemes are just such a class, because markets are usually up to the task of consuming alpha and tend to be better when the idea is widely known—even Buffett’s returns have oscillated around the index over the last few years!
Safety properties aren’t the kind of properties you can prove; they’re statements about the world, not about mathematical objects. I very strongly encourage anyone reading this comment to go read Leveson’s Engineering a Safer World (free pdf from author) through to the end of chapter three—it’s the best introduction to systems safety that I know of and a standard reference for anyone working with life-critical systems. how.complexsystems.fail is the short-and-quotable catechism.
I’m not really sure what you mean by “AI toolchain”, nor what threat model would have a race-condition present an existential risk. More generally, formal verification is a research topic—there’s some neat demonstration systems and they’re used in certain niches with relatively small amounts of code and compute, simple hardware, and where high development times are acceptable. None of those are true of AI systems, or even libraries such as Pytorch.
For flavor, some of the most exciting developments in formal methods: I expect the Lean FRO to improve usability, and ‘autoformalization’ tricks like Proofster (pdf) might also help—but it’s still niche, and “proven correct” software can still have bugs from under-specified components, incorrect axioms, or outright hardware issues (e.g. Spectre, Rowhammer, cosmic rays, etc.). The seL4 microkernel is great, but you still have to supply an operating system and application layer, and then ensure the composition is still safe. To test an entire application stack, I’d instead turn to Antithesis, which is amazing so long as you can run everything in an x86 hypervisor (with no GPUs).
(as always, opinions my own)
I think he’s actually quite confused here—I imagine saying
Hang on—you say that (a) we can think, and (b) we are the instantiations of any number of computer programs. Wouldn’t instantiating one of those programs be a sufficient condition of understanding? Surely if two things are isomorphic even in their implementation, either both can think, or neither.
(the Turing test suggests ‘indistinguishable in input/output behaviour’, which I think is much too weak)
See e.g. https://mschloegel.me/paper/schloegel2024sokfuzzevals.pdf
Fuzzing is a generally pretty healthy subfield, but even there most peer-reviewed papers in top venues are still are completely useless! Importantly, “a ‘working’ github repo” is really not enough to ensure that your results are reproducible, let alone ensure external validity.
people’s subjective probability of successful restoration to life in the future, conditional on there not being a global catastrophe destroying civilization before then. This is also known as p(success).
This definition seems relevantly modified by the conditional!
You also seem to be assuming that “probability of revival” could be a monocausal explanation for cryonics interest, but I find that implausible ex ante. Monocausality approximately doesn’t exist, and “is being revived good in expectation / good with what probability” are also common concerns. (CF)
Very little, because most CS experiments are not in fact replicable (and that’s usually only one of several serious methodological problems).
CS does seem somewhat ahead of other fields I’ve worked in, but I’d attribute that to the mostly-separate open source community rather than academia per se.
My impression is that the effects of genes which vary between individuals are essentially independent, and small effects are almost always locally linear. With the amount of measurement noise and number of variables, I just don’t think we could pick out nonlinearities or interaction effects of any plausible strength if we tried!
This may be true of people who talk a lot about open source, but among actual maintainers the attitude is pretty different. If some user causes harm with an overall positive tool, that’s on the user; but if the contributor has built something consistently or overall harmful that is indeed on them. Maintainers tend to avoid working on projects which are mostly useful for surveillance, weapons, etc. for pretty much this reason.
Source: my personal experience as a a maintainer and PSF Fellow, and the multiple Python core developers I just checked with at the PyCon sprints.