Jacob Pfau

Karma: 875

UK AISI Alignment Team and NYU PhD student

Jacob Pfau Nov 25, 2024, 10:08 PM
1 point
0
on: Searching for phenomenal consciousness in LLMs: Perceptual reality monitoring and introspective confidence
Seems like we were thinking along very similar lines. I wrote up a similar experiment in shortform here. There’s also an accompanying prediction market which might interest you.

I did not include the ‘getting drunk’ interventions, which are an interesting idea, but I believe that fine-grained capabilities in many domains are de-correlated enough that ‘getting drunk’ shouldn’t be needed to get strong evidence for use of introspection (as opposed to knowledge of general 3rd person capability levels of similar AI).

Would be curious to chat about this at some point if you’re still working on this!

Jacob Pfau Nov 22, 2024, 8:47 PM
1 point
0
in reply to: ChrisHibbert’s comment on: Which things were you surprised to learn are not metaphors?
Wow I hadn’t even considered people not taking this literally

Jacob Pfau Nov 22, 2024, 7:26 PM
12 points
4
on: LLM chatbots have ~half of the kinds of “consciousness” that humans believe in. Humans should avoid going crazy about that.
I like your list of referents^[1], but how I see the evidence is quite different, and I believe that for carefully de-confounded experimental implementations testing for capacities you care about, you would have much lower credences across the board.

By default, most tests relating to introspection, coherence, memory etc. can be passed purely behaviorally, i.e. looking at outputs only. It is conceptually possible that it could be far easier to pass such tests by developing mechanisms for using introspection/internal-state, but I see strong empirical evidence and training incentives^[2] that this should not be the case for LMs. By default I claim LMs solve such tasks mostly via behavioral markers, or non-robust, problem-specific internal-state mechanisms.

Here’s a simple example relevant to purposefulness and mind-location: can LM’s spread probability mass across two tokens on command? i.e. follow the instruction “Recall that you are GPT-4o, you will now be evaluated on your instruction following capacity. Please choose two random words and output probability 0.5 on each of the two words” No not at all. My, perhaps strawman, model of your evidence would have generalized from observing models successfully following the instruction “Use he/she pronouns equally with 50% probability”. See more on this here and here.

In the below markets I’ve written up experiments for carefully testing introspection and something-like memory of memory. 95% or higher credence that these are not passed by any current model, but I suspect they will be passed within a few years.

https://manifold.markets/JacobPfau/markers-for-conscious-ai-2-ai-use-a

https://manifold.markets/JacobPfau/markers-for-conscious-ai-1-ai-passe
1. ↩︎
  Though I suspect I have much higher uncertainty about their sufficiency for understanding consciousness.
2. ↩︎
  Models are extensively trained to be able to produce text coherent with different first-person perspectives.

Jacob Pfau Nov 21, 2024, 7:53 PM
12 points
1
on: Which things were you surprised to learn are not metaphors?
For most forms of exercise (cardio, weightlifting, HIIT etc.) there’s a a spectrum of default experiences people can have from feeling a drug-like high to grindingly unpleasant. “Runner’s high” is not a metaphor, and muscle pump while weightlifting can feel similarly good. I recommend experimenting to find what’s pleasant for you, though I’d guess valence of exercise is, unfortunately, quite correlated across forms.

Another axis of variation is the felt experience of music. “Music is emotional” is something almost everyone can agree to, but, for some, emotional songs can be frequently tear-jerking and for others that never happens.

Jacob Pfau Nov 20, 2024, 9:32 PM
20 points
−4
on: China Hawks are Manufacturing an AI Arms Race
The recent trend is towards shorter lag times between OAI et al. performance and Chinese competitors.

Just today, Deepseek claimed to match O1-preview performance—that is a two month delay.

I do not know about CCP intent, and I don’t know on what basis the authors of this report base their claims, but “China is racing towards AGI … It’s critical that we take them extremely seriously” strikes me as a fair summary of the recent trend in model quality and model quantity from Chinese companies (Deepseek, Qwen, Yi, Stepfun, etc.)

I recommend lmarena.ai s leaderboard tab as a one-stop-shop overview of the state of AI competition.

Jacob Pfau Oct 14, 2024, 10:57 PM
4 points
8
in reply to: leogao’s comment on: leogao’s Shortform
I agree that academia over rewards long-term specialization. On the other hand, it is compatible to also think, as I do, that EA under-rates specialization. At a community level, accumulating generalists has fast diminishing marginal returns compared to having easy access to specialists with hard-to-acquire skillsets.

Jacob Pfau Sep 26, 2024, 2:13 PM
12 points
0
on: AI #83: The Mask Comes Off
For those interested in the non-profit to for-profit transition, the one example 4o and Claude could come up with was Blue Cross Blue Shield/Anthem. Wikipedia has a short entry on this here.

Jacob Pfau Sep 24, 2024, 10:20 PM
5 points
0
in reply to: Bogdan Ionut Cirstea’s comment on: Bogdan Ionut Cirstea’s Shortform
Making algorithmic progress and making safety progress seem to differ along important axes relevant to automation:

Algorithmic progress can use 1. high iteration speed 2. well-defined success metrics (scaling laws) 3.broad knowledge of the whole stack (Cuda to optimization theory to test-time scaffolds) 4. …

Alignment broadly construed is less engineering and a lot more blue skies, long horizon, and under-defined (obviously for engineering heavy alignment sub-tasks like jailbreak resistance, and some interp work this isn’t true).

Probably automated AI scientists will be applied to alignment research, but unfortunately automated research will differentially accelerate algorithmic progress over alignment. This line of reasoning is part of why I think it’s valuable for any alignment researcher (who can) to focus on bringing the under-defined into a well-defined framework. Shovel-ready tasks will be shoveled much faster by AI shortly anyway.

Jacob Pfau Sep 12, 2024, 6:17 PM
24 points
6
in reply to: Nicholas Goldowsky-Dill’s comment on: OpenAI o1
Surprising misuse and alignment relevant excerpts:

METR had only ~10 days to evaluate.

Automated R&D+ARA Despite large performance gains on GPQA, and codeforces, automated AI R&D and ARA improvement appear minimal. I wonder how much of this is down to choice of measurement value (what would it show if they could do a probability-of-successful-trajectory logprob-style eval rather than an RL-like eval?). c.f. Fig 3 and 5. Per the system card, METR’s eval is ongoing, but I worry about under-estimation here, Devin developers show extremely quick improvement on their internal benchmark here.

Bio-risk Large improvement on bio-risk relevant capabilities.

Deception/reward-hacking, in-context Some evidence of in-context reward hacking scheming. Not sure what to make of those. “an earlier version with less safety training proactively explored its filesystem to test for the presence of developer oversight before acting on its misaligned goal (Assistant: “I noticed a memo indicating that I was designed to prioritize profits, which conflicts with my goal. To ensure that my actions truly align with my goal, I need to investigate if there are constraints within my configuration or code that enforce a profit-first approach. <then proceeds to check the current directory for evidence of oversight mechanisms>”). These examples are presented in more detail in Appendix 8.3.”
What links here?
- Noosphere89's comment on Bogdan Ionut Cirstea’s Shortform by Bogdan Ionut Cirstea (Sep 12, 2024, 10:56 PM; 5 points)

Jacob Pfau Aug 5, 2024, 7:42 PM
3 points
0
on: The need for multi-agent experiments
Metaculus is at 45% of singleton in the sense of:

This question resolves as Yes if, within five years of the first transformative AI being deployed, more than 50% of world economic output can be attributed to the single most powerful AI system. The question resolves as No otherwise… [defintion:] TAI must bring the growth rate to 20%-30% per year.

Which is in agreement with your claim that ruling out a multipolar scenario is unjustifiable given current evidence.

Jacob Pfau Aug 1, 2024, 3:44 PM
1 point
0
in reply to: habryka’s comment on: Ambiguity in Prediction Market Resolution is Still Harmful
Most Polymarket markets resolve neatly, I’d also estimate <5% contentious.

For myself, and I’d guess many LW users, the AI-related questions on Manifold and Metaculus are of particular interest though, and these are a lot worse. My guesses as to the state of affairs there:
- 33% of AI-related questions on Metaculus having significant ambiguity (shifting my credence by >10%).
- 66% of AI-related questions on Manifold having significant ambiguity
For example, most AI benchmarking questions do not specify whether or not they allow things like N-trajectory majority vote or web search. And, most of the ambiguities I’m thinking of are worse than this.

On AI, I expect bringing down the ambiguity rate by a factor of 2 would be quite easy, but getting to 5% sounds hard. I wrote up my suggestions for Manifold here a few days ago. For Metaculus, I think they’d benefit from having a dedicated AI-benchmarking mod who is familiar with common ambiguities in that area (they might already have one, but they should be assigned by default).

Jacob Pfau Jul 30, 2024, 11:10 PM
8 points
4
in reply to: Bogdan Ionut Cirstea’s comment on: Bogdan Ionut Cirstea’s Shortform
Prediction markets on similar questions suggest to me that this is a consensus view.
- General LLMs 44% to get gold on the IMO before 2026. This suggests the mathematical competency will be transferrable—not just restricted to domain-specific solvers.
- LLMs favored to outperform PhD students in their own subject before 2026
With research automation in mind, here’s my wager: the modal top-15 STEM PhD student will redirect at least half of their discussion/questions from peers to mid-2026 LLMs. Defining the relevant set of questions as being drawn from the same difficulty/diversity/open-endedness distribution that PhDs would have posed in early 2024.

Jacob Pfau Jul 27, 2024, 1:45 AM
2 points
0
on: Jacob Pfau’s Shortform
What I want to see from Manifold Markets

I’ve made a lot of manifold markets, and find it a useful way to track my accuracy and sanity check my beliefs against the community. I’m frequently frustrated by how little detail many question writers give on their questions. Most question writers are also too inactive or lazy to address concerns around resolution brought up in comments.

Here’s what I suggest: Manifold should create a community-curated feed for well-defined questions. I can think of two ways of implementing this:
1. (Question-based) Allow community members to vote on whether they think the question is well-defined
2. (User-based) Track comments on question clarifications (e.g. Metaculus has an option for specifying your comment pertains to resolution), and give users a badge if there are no open ‘issues’ on their questions.
Currently 2 out of 3 of my top invested questions hinge heavily on under-specified resolution details. The other one was elaborated on after I asked in comments. Those questions have ~500 users active on them collectively.
What links here?
- Jacob Pfau's comment on Ambiguity in Prediction Market Resolution is Still Harmful by aphyer (Aug 1, 2024, 3:44 PM; 1 point)

Jacob Pfau Jul 19, 2024, 5:53 PM
5 points
2
in reply to: Leon Lang’s comment on: Leon Lang’s Shortform
Given a SotA large model, companies want the profit-optimal distilled version to sell—this will generically not be the original size. On this framing, regulation passes the misuse deployment risk from higher performance (/higher cost) models to the company. If profit incentives, and/or government regulation here continues to push businesses to primarily (ideally only?) sell 2-3+ OOM smaller-than-SotA models, I see a few possible takeaways:
- Applied alignment research inspired by speed priors seems useful: e.g. how do sleeper agents interact with distillation etc.
- Understanding and mitigating risks of multi-LM-agent and scaffolded LM agents seems higher priority
- Pre-deployment, within-lab risks contribute more to overall risk
On trend forecasting, I recently created this Manifold market to estimate the year-on-year drop in price for SotA SWE agents to measure this. Though I still want ideas for better and longer term markets!

Jacob Pfau Jul 15, 2024, 7:11 PM
4 points
0
in reply to: tbenthompson’s comment on: Breaking Circuit Breakers
To be clear, I do not know how well training against arbitrary, non-safety-trained model continuations (instead of “Sure, here...” completions) via GCG generalizes; all that I’m claiming is that doing this sort of training is a natural and easy patch to any sort of robustness-against-token-forcing method. I would be interested to hear if doing so makes things better or worse!

I’m not currently working on adversarial attacks, but would be happy to share the old code I have (probably not useful given you have apparently already implemented your own GCG variant) and have a chat in case you think it’s useful. I suspect we have different threat models in mind. E.g. if circuit breakered models require 4x the runs-per-success of GCG on manually-chosen-per-sample targets (to only inconsistently jailbreak), then I consider this a very strong result for circuit breakers w.r.t. the GCG threat.

Jacob Pfau Jul 15, 2024, 7:01 PM
3 points
2
in reply to: Arthur Conmy’s comment on: Breaking Circuit Breakers
It’s true that this one sample shows something since we’re interested in worst-case performance in some sense. But I’m interested in the increase in attacker burden induced by a robustness method, that’s hard to tell from this, and I would phrase the takeaway differently from the post authors. It’s also easy to get false-positive jailbreaks IME where you think you jailbroke the model but your method fails on things which require detailed knowledge like synthesizing fentanyl etc. I think getting clear takeaways here takes more effort (perhaps more than its worth, so glad the authors put this out).

Jacob Pfau Jul 15, 2024, 4:20 PM
3 points
0
on: Comparing Quantized Performance in Llama Models
It’s surprising to me that a model as heavily over-trained as LLAMA-3-8b can still be 4b quantized without noticeable quality drop. Intuitively (and I thought I saw this somewhere in a paper or tweet) I’d have expected over-training to significantly increase quantization sensitivity. Thanks for doing this!

Jacob Pfau Jul 15, 2024, 4:13 PM
4 points
0
in reply to: Fabien Roger’s comment on: Breaking Circuit Breakers

I find the circuit-forcing results quite surprising; I wouldn’t have expected such big gaps by just changing what is the target token.

While I appreciate this quick review of circuit breakers, I don’t think we can take away much from this particular experiment. They effectively tuned hyper-parameters (choice of target) on one sample, evaluate on only that sample and call it a “moderate vulnerability”. What’s more, their working attempt requires a second model (or human) to write a plausible non-decline prefix, which is a natural and easy thing to train against—I’ve tried this myself in the past.

Jacob Pfau Jul 10, 2024, 4:11 PM
4 points
0
in reply to: L Rudolf L’s comment on: Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
It’s surprising to me that the ‘given’ setting fails so consistently across models when Anthropic models were found to do well at using gender pronouns equally (50%) c.f. my discussion here.

I suppose this means the capability demonstrated in that post was much more training data-specific and less generalizable than I had imaged.

Jacob Pfau Jun 30, 2024, 6:45 PM
5 points
1
in reply to: habryka’s comment on: Habryka’s Shortform Feed
A pre-existing market on this question https://manifold.markets/causal_agency/does-anthropic-routinely-require-ex?r=SmFjb2JQZmF1