Former safety researcher & TPM at OpenAI, 2020-24
sjadler
It’s much more the same than a lot of prosaic safety though, right?
Let me put it this way: If an AI can’t achieve catastrophe on that order of magnitude, it also probably cannot do something truly existential.
One of the issues this runs into is if a misaligned AI is playing possum, and so doesn’t attempt lesser catastrophes until it can pull off a true takeover. I nonetheless though think this framing points generally at the right type of work (understood that others may disagree of course)
I’ve often preferred a frame of ‘catastrophe avoidance’ over a frame of x-risk. This has a possible downside of people underfeeling the magnitude of risk, but also an upside of IMO feeling way more plausible. I think it’s useful to not need to win specific arguments about extinction, and also to not have some of the existential/extinction conflation happening in ‘x-’.
If someone is wondering what prefilling means here, I believe Ted means ‘putting words in the model’s mouth’ by being able to fabricate a conversational history where the AI appears to have said things it didn’t actually say.
For instance, if you can start a conversation midway, and if the API can’t distinguish between things the model actually said in the history vs. things you’ve written in its behalf as supposed outputs in a fabricated history, this can be a jailbreak vector: If the model appeared to already violate some policy on turns 1 and 2, it is more likely to also violate this on turn 3, whereas it might have refused if not for the apparent prior violations.
(This was harder to clearly describe than I expected.)
Great! Appreciate you letting me know & helping debug for others
Oh interesting, I’m out at the moment and don’t recall having this issue, but if you override the default number of threads for the repo to 1, does that fix it for you?
https://github.com/openai/evals/blob/main/evals/eval.py#L211
(There are two places in this file where
threads =
, would change 10 to 1 in each)
I quite like the Function Deduction eval we built, which is a problem-solving game that tests one’s ability to efficiently deduce a hidden function by testing its value on chosen inputs.
It’s runnable from the command-line (after repo install) with the command:
oaieval human_cli function_deduction
(I believe that’s the right second term, but it might just be humancli)The standard mode might be slightly easier than you want, because it gives some partial answer feedback along the way. There is also a hard mode that can be run, which would not give this partial feedback and so would be harder to find any brute forcing method.
For context, I could solve ~95% of the standard problems with a time limit of 5 minutes and a 20 turn game limit. GPT-4 could solve roughly half within that turn limit IIRC, and o1-preview could do ~99%. (o1-preview also scored ~99% on the hard mode variant; I never tested myself on this but I estimate maybe I’d have gotten 70%? The problems could also be scaled up in difficulty pretty easily if one wanted.)
This helps me to notice that there is a fairly strong and simple data poisoning attack possible with canaries, such that canaries can’t really be relied upon (amid other reasons I already believed they’re insufficient, esp once AIs can browse the Web):
The attack is that one could just siphoning up all text on the Internet that did have a canary string, and then republish it without the canary :/
I agree, these are interesting points, upvoted. I’d claim that AI output also isn’t linear with the resources—but nonetheless, that you’re right that the curve of marginal return from each AI unit could be different from each human unit in an important way. Likewise, the easier on-demand labor of AI is certainly a useful benefit.
I don’t think these contradict the thrust of my point though? That in each case, one shouldn’t just be thinking about usefulness/capability, but should also be considering the resources necessary for achieving this.
I believe we should view AGI as a ratio of capability to resources, rather than simply asking how AI’s abilities compare to humans’. This view is becoming more common, but is not yet common enough.
When people discuss AI’s abilities relative to humans without considering the associated costs or time, this is like comparing fractions by looking only at the numerators.
In other words, AGI has a numerator (capability): what the AI system can achieve. This asks questions like: For this thing that a human can do, can AI do it too? How well can AI do it? (For example, on a set of programming challenges, how many can the AI solve? How many can a human solve?)
But also importantly, AGI should account for the denominator: how many resources are required to achieve its capabilities. Commonly, this resource will be a $-cost or an amount of time. This asks questions like “What is the $-cost of getting this performance?” or “How long does this task take to complete?”.
I claim that an AI system might fail to qualify as AGI if it lacks human-level capabilities, but it could also fail by being wildly inefficient compared to a human. Both the numerator and denominator matter when evaluating these ratios.
A quick example of why focusing solely on the capability (numerator) is insufficient:
Imagine an AI software engineer that can do most tasks human engineers do, but at 100–1000× the cost of a human.
I expect that lots of the predictions about “AGI” would not come due until (at least) that cost comes down substantially so that AI >= human on a capability-per-dollar basis.
For instance, it would not make sense to directly substitute AI for human labor at this ratio—but perhaps it does make sense to buy additional AI labor, if there were extremely valuable tasks for which human labor is the bottleneck today.
The AGI-as-ratio concept has long been implicit in some labs’ definitions of AGI—for instance, OpenAI describes AGI as “a highly autonomous system that outperforms humans at most economically valuable work”. Outperforming humans economically does seem to imply being more cost-effective, not just having the same capabilities. Yet even within OpenAI, the denominator aspect wasn’t always front of mind, which is why I wrote up a memo on this during my time there.
Until the recent discussions of o1 Pro / o3 drawing upon lots of inference compute, I rarely saw these ideas discussed, even in otherwise sophisticated analyses. One notable exception to this is my former teammate Richard Ngo’s t-AGI framework, which deserves a read. METR has also done a great job of accounting for this in their research comparing AI R&D performance, given a certain amount of time. I am glad to see more and more groups thinking in terms of these factors - but in casual analysis, it is very easy to just slip into comparisons of capability levels. This is worth pushing back on, imo: The time at which “AI capabilities = human capabilities” is different than the time when I expect AGI will have been achieved in the relevant senses.
There are also some important caveats to my claim here, that ‘comparing just by capability is missing something important’:
People reasonably expect AI to become cheaper over time, so if AI matches human capabilities but not cost, that might still signal ‘AGI soon.’ Perhaps this is what people mean when they say ‘o3 is AGI’.
Computers are much faster than humans for many tasks, and so one might believe that if an AI can achieve a thing it will quite obviously be faster than a human. This is less obvious now, however, because AI systems are leaning more on repeated sampling/selection procedures.
Comparative advantage is a thing, and so AI might have very large impacts even if it remains less absolutely capable than humans for many different tasks, if the cost/speed are good enough.
There are some factors that don’t fit super cleanly into this framework: things like AI’s ‘always-on availability’, which aren’t about capability per se, but probably belong in the numerator anyway? e.g., “How good is an AI therapist?” benefits from ‘you can message it around the clock’, which increases the utility of any given task-performance. (In this sense, maybe the ratio is best understood as utility-per-resource, rather than capability-per-resource.)
sjadler’s Shortform
In case anybody’s looking for steganography evals—my team built and open-sourced some previously: https://github.com/openai/evals/blob/main/evals/elsuite/steganography/readme.md
This repo is largely not maintained any longer unfortunately, and for some evals it isn’t super obvious how the new paradigm for O1 affects them (for instance, we had built solvers/scaffolding for private scratchpads, but now having a private CoT provides this out-of-the-box and so might interact with this strangely). But still perhaps worth a look
For what it’s worth, I sent this story to a friend the other day, who’s probably ~50 now and was very active on the Internet in the 90s—thinking he’d find it amusing if he hadn’t come across it before
Not only did he remember this story contemporaneously, but he said he was the recipient of the test-email for a particular city mentioned in the article!
This is someone of high-integrity whom I trust, and makes me more confident this happened, even if some details are smoothed over as described
Yeah I appreciate the engagement, I don’t think either of those is a knock-down objection though:
The ability to illicitly gain a few credentials —> >1 account is still meaningfully different from being able to create ~unbounded accounts. It is true this means a PHC doesn’t 100% ensure a distinct person, but it can still be a pretty high assurance and significantly increase the cost of doing attacks that depend on scale.
Re: the second point, I’m not sure I fully understand—say more? By our paper’s definitions, issuers wouldn’t be able to merely choose to identify individuals. In fact, even if an issuer and service-provider colluded, PHCs are meant to be robust to this. (Devil is in the details of course.)
Hi! Created a (named) account for this—in fact, I think you can conceptually get some of those reputational defenses (memory of behavior; defense against multi-event attacks) without going so far as to drop anonymity / prove one’s identity!
See my Twitter thread here, summarizing our paper on Personhood Credentials.
Paper’s abstract:Anonymity is an important principle online. However, malicious actors have long used misleading identities to conduct fraud, spread disinformation, and carry out other deceptive schemes. With the advent of increasingly capable AI, bad actors can amplify the potential scale and effectiveness of their operations, intensifying the challenge of balancing anonymity and trustworthiness online. In this paper, we analyze the value of a new tool to address this challenge: “personhood credentials” (PHCs), digital credentials that empower users to demonstrate that they are real people—not AIs—to online services, without disclosing any personal information. Such credentials can be issued by a range of trusted institutions—governments or otherwise. A PHC system, according to our definition, could be local or global, and does not need to be biometrics-based. Two trends in AI contribute to the urgency of the challenge: AI’s increasing indistinguishability from people online (i.e., lifelike content and avatars, agentic activity), and AI’s increasing scalability (i.e., cost-effectiveness, accessibility). Drawing on a long history of research into anonymous credentials and “proof-of-personhood” systems, personhood credentials give people a way to signal their trustworthiness on online platforms, and offer service providers new tools for reducing misuse by bad actors. In contrast, existing countermeasures to automated deception—such as CAPTCHAs—are inadequate against sophisticated AI, while stringent identity verification solutions are insufficiently private for many use-cases. After surveying the benefits of personhood credentials, we also examine deployment risks and design challenges. We conclude with actionable next steps for policymakers, technologists, and standards bodies to consider in consultation with the public.
At the risk of being pedantic, just noting I don’t think it’s really correct to consider that first person as earning $300/hr. For example I’d expect to need to account for the depreciation on the jet skis (or more straightforwardly, that one is in the hole on having bought them until a certain number of hours rented), and also presumably some accrual for risk of being sued in the case of an accident.
(I do think it’s useful to notice that the jet ski rental person is much higher variance, in both directions IMO—so this can be both good or bad. I do also appreciate you sharing your experience!)