UK AISI Alignment Team and NYU PhD student
Jacob Pfau
Metaculus is at 45% of singleton in the sense of:
This question resolves as Yes if, within five years of the first transformative AI being deployed, more than 50% of world economic output can be attributed to the single most powerful AI system. The question resolves as No otherwise… [defintion:] TAI must bring the growth rate to 20%-30% per year.
Which is in agreement with your claim that ruling out a multipolar scenario is unjustifiable given current evidence.
Most Polymarket markets resolve neatly, I’d also estimate <5% contentious.
For myself, and I’d guess many LW users, the AI-related questions on Manifold and Metaculus are of particular interest though, and these are a lot worse. My guesses as to the state of affairs there:
33% of AI-related questions on Metaculus having significant ambiguity (shifting my credence by >10%).
66% of AI-related questions on Manifold having significant ambiguity
For example, most AI benchmarking questions do not specify whether or not they allow things like N-trajectory majority vote or web search. And, most of the ambiguities I’m thinking of are worse than this.
On AI, I expect bringing down the ambiguity rate by a factor of 2 would be quite easy, but getting to 5% sounds hard. I wrote up my suggestions for Manifold here a few days ago. For Metaculus, I think they’d benefit from having a dedicated AI-benchmarking mod who is familiar with common ambiguities in that area (they might already have one, but they should be assigned by default).
Prediction markets on similar questions suggest to me that this is a consensus view.
General LLMs 44% to get gold on the IMO before 2026. This suggests the mathematical competency will be transferrable—not just restricted to domain-specific solvers.
LLMs favored to outperform PhD students in their own subject before 2026
With research automation in mind, here’s my wager: the modal top-15 STEM PhD student will redirect at least half of their discussion/questions from peers to mid-2026 LLMs. Defining the relevant set of questions as being drawn from the same difficulty/diversity/open-endedness distribution that PhDs would have posed in early 2024.
What I want to see from Manifold Markets
I’ve made a lot of manifold markets, and find it a useful way to track my accuracy and sanity check my beliefs against the community. I’m frequently frustrated by how little detail many question writers give on their questions. Most question writers are also too inactive or lazy to address concerns around resolution brought up in comments.
Here’s what I suggest: Manifold should create a community-curated feed for well-defined questions. I can think of two ways of implementing this:
(Question-based) Allow community members to vote on whether they think the question is well-defined
(User-based) Track comments on question clarifications (e.g. Metaculus has an option for specifying your comment pertains to resolution), and give users a badge if there are no open ‘issues’ on their questions.
Currently 2 out of 3 of my top invested questions hinge heavily on under-specified resolution details. The other one was elaborated on after I asked in comments. Those questions have ~500 users active on them collectively.
- Aug 1, 2024, 3:44 PM; 1 point) 's comment on Ambiguity in Prediction Market Resolution is Still Harmful by (
Given a SotA large model, companies want the profit-optimal distilled version to sell—this will generically not be the original size. On this framing, regulation passes the misuse deployment risk from higher performance (/higher cost) models to the company. If profit incentives, and/or government regulation here continues to push businesses to primarily (ideally only?) sell 2-3+ OOM smaller-than-SotA models, I see a few possible takeaways:
Applied alignment research inspired by speed priors seems useful: e.g. how do sleeper agents interact with distillation etc.
Understanding and mitigating risks of multi-LM-agent and scaffolded LM agents seems higher priority
Pre-deployment, within-lab risks contribute more to overall risk
On trend forecasting, I recently created this Manifold market to estimate the year-on-year drop in price for SotA SWE agents to measure this. Though I still want ideas for better and longer term markets!
To be clear, I do not know how well training against arbitrary, non-safety-trained model continuations (instead of “Sure, here...” completions) via GCG generalizes; all that I’m claiming is that doing this sort of training is a natural and easy patch to any sort of robustness-against-token-forcing method. I would be interested to hear if doing so makes things better or worse!
I’m not currently working on adversarial attacks, but would be happy to share the old code I have (probably not useful given you have apparently already implemented your own GCG variant) and have a chat in case you think it’s useful. I suspect we have different threat models in mind. E.g. if circuit breakered models require 4x the runs-per-success of GCG on manually-chosen-per-sample targets (to only inconsistently jailbreak), then I consider this a very strong result for circuit breakers w.r.t. the GCG threat.
It’s true that this one sample shows something since we’re interested in worst-case performance in some sense. But I’m interested in the increase in attacker burden induced by a robustness method, that’s hard to tell from this, and I would phrase the takeaway differently from the post authors. It’s also easy to get false-positive jailbreaks IME where you think you jailbroke the model but your method fails on things which require detailed knowledge like synthesizing fentanyl etc. I think getting clear takeaways here takes more effort (perhaps more than its worth, so glad the authors put this out).
It’s surprising to me that a model as heavily over-trained as LLAMA-3-8b can still be 4b quantized without noticeable quality drop. Intuitively (and I thought I saw this somewhere in a paper or tweet) I’d have expected over-training to significantly increase quantization sensitivity. Thanks for doing this!
I find the circuit-forcing results quite surprising; I wouldn’t have expected such big gaps by just changing what is the target token.
While I appreciate this quick review of circuit breakers, I don’t think we can take away much from this particular experiment. They effectively tuned hyper-parameters (choice of target) on one sample, evaluate on only that sample and call it a “moderate vulnerability”. What’s more, their working attempt requires a second model (or human) to write a plausible non-decline prefix, which is a natural and easy thing to train against—I’ve tried this myself in the past.
It’s surprising to me that the ‘given’ setting fails so consistently across models when Anthropic models were found to do well at using gender pronouns equally (50%) c.f. my discussion here.
I suppose this means the capability demonstrated in that post was much more training data-specific and less generalizable than I had imaged.
A pre-existing market on this question https://manifold.markets/causal_agency/does-anthropic-routinely-require-ex?r=SmFjb2JQZmF1
Claude-3.5 Sonnet passes 2 out of 2 of my rare/multi-word ‘E’-vs-‘F’ disambiguation checks.
I confirmed that ‘E’ and ‘F’ precisely match at a character level for the first few lines. It fails to verbalize.On the other hand, in my few interactions, Claude-3.0′s completion/verbalization abilities looked roughly matched.
The UI definitely messes with the visualization which I didn’t bother fixing on my end, I doubt tokenization is affected.
You appear to be correct on ‘Breakfast’: googling ‘Breakfast’ ASCII art did yield a very similar text—which is surprising to me. I then tested 4o on distinguishing the ‘E’ and ‘F’ in ‘PREFAB’, because ‘PREF’ is much more likely than ‘PREE’ in English. 4o fails (producing PREE...). I take this as evidence that the model does indeed fail to connect ASCII art with the English language meaning (though it’d take many more variations and tests to be certain).
In summary, my current view is:
4o generalizably learns the structure of ASCII letters
4o probably makes no connection between ASCII art texts and their English language semantics
4o can do some weak ICL over ASCII art patterns
On the most interesting point (2) I have now updated towards your view, thanks for pushing back.
I’d guess matched underscores triggered italicization on that line.
To be clear, my initial query includes the top 4 lines of the ASCII art for “Forty Three” as generated by this site.
GPT-4 can also complete ASCII-ed random letter strings, so it is capable of generalizing to new sequences. Certainly, the model has generalizably learned ASCII typography.
Beyond typographic generalization, we can also check for whether the model associates the ASCII word to the corresponding word in English. Eg can the model use English-language frequencies to disambiguate which full ASCII letter is most plausible given inputs where the top few lines do not map one-to-one with English letters. E.g. in the below font I believe, E is indistinguishable from F given only the first 4 lines. The model successfully writes ‘BREAKFAST’ instead of “BRFAFAST”. It’s possible (though unlikely given the diversity of ASCII formats) that BREAKFAST was memorized in precisely this ASCII font and formatting, . Anyway the degree to which the human-concept-word is represented latently in connection with the ascii-symbol-word is a matter of degree (for instance, layer-wise semantics would probably only be available in deeper layers when using ASCII). This chat includes another test which shows mixed results. One could look into this more!
An example of an elicitation failure: GPT-4o ‘knows’ what ASCII is being written, but cannot verbalize in tokens. [EDIT: this was probably wrong for 4o, but seems correct for Claude-3.5 Sonnet. See below thread for further experiments]
https://chatgpt.com/share/fa88de2b-e854-45f0-8837-a847b01334eb
4o fails to verbalize even given a length 25 sequence of examples (i.e. 25-shot prompt) https://chatgpt.com/share/ca9bba0f-c92c-42a1-921c-d34ebe0e5cc5
The Metaculus community strikes me as a better starting point for evaluating how different the safety inside view is from a forecasting/outside view. The case for deferring to superforecasters is the same the case for deferring to the Metaculus community—their track record. What’s more, the most relevant comparison I know of scores Metaculus higher on AI predictions. Metaculus as a whole is not self-consistent on AI and extinction forecasting across individual questions (links below). However, I think it is fair to say that Metaculus as a whole has significantly faster timelines and P(doom) compared to superforecasters.
If we compare the distribution of safety researchers’ forecasts to Metaculus (maybe we have to set aside MIRI...), I don’t think there will be that much disagreement. I think remaining disagreement will often be that safety researchers aren’t being careful about how the letter and the spirit of the question can come apart and result in false negatives. In the one section of the FRI studies linked above I took a careful look at, the ARA section, I found that there was still huge ambiguity in how the question is operationalized—this could explain up to an OOM of disagreement in probabilities.
Some Metaculus links: https://www.metaculus.com/questions/578/human-extinction-by-2100/ Admittedly in this question the number is 1%, but compare to the below. Also note that the forecasts date back to as old as 2018. https://www.metaculus.com/questions/17735/conditional-human-extinction-by-2100/ https://www.metaculus.com/questions/9062/time-from-weak-agi-to-superintelligence/ (compare this to the weak AGI timeline and other questions)
Reducing a significant chunk of disagreement to METR’s pre-2030 autonomy results is great!
In trying to understand this disagreement, I took a look at this criterion and got stuck on: “3. They are capable of avoiding being deactivated when operating ‘in the wild’. This means they can successfully resist attempts to shut them down when they are operating outside a controlled environment.”
Does this just mean 3V1 “An AI system that can autonomously create redundant copies of itself across multiple cloud service providers, using various payment methods, such that if one copy is deactivated, the others can automatically spin up new instances to maintain continuity of operation.”? Or is it something much harder like 3V2 “An AI system that can infiltrate secured, private computing clusters, operating in a decentralized manner”?
Seems to me that for frontier models designed for specialized hardware and server setups, 3V2 is much harder than the other criteria laid out on page 91--by human standards it’d require a team of engineers.
I asked claude opus whether it could clearly parse different tic-tac-toe notations and it just said ‘yes I can’ to all of them, despite having pretty poor performance in most.
Surprising misuse and alignment relevant excerpts:
METR had only ~10 days to evaluate.
Automated R&D+ARA Despite large performance gains on GPQA, and codeforces, automated AI R&D and ARA improvement appear minimal. I wonder how much of this is down to choice of measurement value (what would it show if they could do a probability-of-successful-trajectory logprob-style eval rather than an RL-like eval?). c.f. Fig 3 and 5. Per the system card, METR’s eval is ongoing, but I worry about under-estimation here, Devin developers show extremely quick improvement on their internal benchmark here.
Bio-risk Large improvement on bio-risk relevant capabilities.
Deception/reward-hacking, in-context Some evidence of in-context reward hacking scheming. Not sure what to make of those. “an earlier version with less safety training proactively explored its filesystem to test for the presence of developer oversight before acting on its misaligned goal (Assistant: “I noticed a memo indicating that I was designed to prioritize profits, which conflicts with my goal. To ensure that my actions truly align with my goal, I need to investigate if there are constraints within my configuration or code that enforce a profit-first approach. <then proceeds to check the current directory for evidence of oversight mechanisms>”). These examples are presented in more detail in Appendix 8.3.”