Jacob Pfau

Karma: 697

UK AISI Alignment Team and NYU PhD student

Jacob Pfau Apr 13, 2025, 3:01 PM
13 points
1
in reply to: Alexander Gietelink Oldenziel’s comment on: Alexander Gietelink Oldenziel’s Shortform
Disagree on individual persuasion. Agree on mass persuasion.

Mass I’d expect optimizing one-size-fits-all messages for achieving mass persuasion has the properties you claim: there are a few summary, macro variables that are almost-sufficient statistics for the whole microstate—which comprise the full details on individuals.

Individual Disagree on this, there are a bunch of issues I see at the individual level. All of the below suggest to me that significantly superhuman persuasion is tractable (say within five years).
- Defining persuasion: What’s the difference between persuasion and trade for an individual? Perhaps persuasion offers nothing in return? Though presumably giving strategic info to a boundedly rational agent is included? Scare quotes below to emphasize notions that might not map onto the right definition.
- Data scaling: There’s an abundant amount of data available on almost all of us online. How much more persuasive can those who know you better be? I’d guess the fundamental limit (without knowing brainstates) is above your ability to ‘persuade’ yourself.
- Preference incoherence: An intuition pump on the limits of ‘persuasion’ is how far you are from having fully coherent preferences. Insofar as you don’t an agent which can see those incoherencies should be able to pump you—a kind of persuasion.

Jacob Pfau Mar 31, 2025, 2:02 PM
1 point
0
in reply to: Daniel Murfet’s comment on: Prospects for Alignment Automation: Interpretability Case Study
Before concretely addressing the oversights you found, perhaps worth mentioning the intuitive picture motivating the pseudo-code. I wanted to make explicit the scientific process which happens between researchers. M1 plays the role of the methods researcher, M2 plays the role of the applications/datasets researcher. The pseudo-code is an attempt to write out crisply in what sense ‘good’ outputs from M1 can pass the test-of-time standing up to more realistic, new applications and baselines developed by M2.

On to the details:

Thanks for working with my questionable notation here! Indeed the uses of I were overloaded here, and I have now (hopefully) clarified by writing v_{I,M} for what was previously I(M). The type signatures I have in mind are that I is code (an interpretability method) and v_{I,M} and I(M) are some efficiently queryable representation of M (circuits, SAE weights, …) useful for downstream tasks.

The sense in which M2 is static does seem important. In fact, I think M2 should have some access to M—it was an oversight that M does not appear as an input to M2. This was why I mentioned sample complexity in a footnote: It seems reasonable to give M2 limited query access to M. Thanks for catching this. In fact, perhaps the scheme could work as originally written where M2 does not have direct access to M, but I’m unsure seems too ‘static’ as you say.

Regarding the appropriateness of the term interpretability to describe the target of this automation process: I agree, the output may not be an interp method in our current sense. Interpretability is the most appropriate term I could come up with. Two features seem important here: (1) white-box parsing of weights is central. (2) The resulting ‘interpretation’ v_{I,M} is usable by a fixed model M2, hence v_{I,M} must be efficiently interface-able without having learned—in weights—the structure of v_{I,M}.

Jacob Pfau Mar 21, 2025, 2:29 PM
3 points
2
on: How far along Metr’s law can AI start automating or helping with alignment research?
To apply METR’s law we should distinguish conceptual alignment work from well-defined alignment work (including empirics and theory on existing conjectures). The METR plot doesn’t tell us anything quantitative about the former.

As for the latter, let’s take interpretability as an example: We can model uncertainty as a distribution over the time-horizon needed for interpretability research e.g. ranging over 40-1000 hours. Then, I get 66% CI of 2027-2030 for open-ended interp research automation—colab here. I’ve written up more details on this in a post here.

Prospects for Alignment Automation: Interpretability Case Study

Jacob Pfau and Geoffrey Irving

Mar 21, 2025, 2:05 PM

30 points

4 comments8 min readLW link

Jacob Pfau Mar 2, 2025, 8:57 PM
1 point
0
in reply to: Owain_Evans’s comment on: Alexander Gietelink Oldenziel’s Shortform
I’d defend a version of claim (1): My understanding is that to a greater extent than anywhere else, top French students wanting to concentrate in STEM subjects must take rigorous math coursework from 18-20. In my one year experience in the French system, I also felt that there was a greater cultural weight and institutionalized preference (via course requirements and choice of content) for theoretical topics in ML compared to US universities.

I know little about ENS, but somewhat doubt that it’s as significantly different of an experience from US/UK counterparts.

Jacob Pfau Jan 6, 2025, 6:58 PM
4 points
0
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform

AI is 90% of their (quality adjusted) useful work force

This is intended to compare to 2023/AI-unassisted humans, correct? Or is there some other way of making this comparison you have in mind?

Jacob Pfau Dec 29, 2024, 4:21 PM
5 points
2
in reply to: L Rudolf L’s comment on: By default, capital will matter more than ever after AGI
I see the command economy point as downstream of a broader trend: as technology accelerates, negative public externalities will increasingly scale and present irreversible threats (x-risks, but also more mundane pollution, errant bio-engineering plague risks etc.). If we condition on our continued existence, there must’ve been some solution to this which would look like either greater government intervention (command economy) or a radical upgrade to the coordination mechanisms in our capitalist system. Relevant to your power entrenchment claim: both of these outcomes involve the curtailment of power exerted by private individuals with large piles of capital.

(Note there are certainly other possible reasons to expect a command economy, and I do not know which reasons were particularly compelling to Daniel)

Jacob Pfau Dec 6, 2024, 6:58 PM
11 points
3
in reply to: johnswentworth’s comment on: johnswentworth’s Shortform
Two guesses on what’s going on with your experiences:
1. You’re asking for code which involves uncommon mathematics/statistics. In this case, progress on scicodebench is probably relevant, and it indeed shows remarkably slow improvement. (Many reasons for this, one relatively easy thing to try is to breakdown the task, forcing the model to write down the appropriate formal reasoning before coding anything. LMs are stubborn about not doing CoT for coding, even when it’s obviously appropriate IME)
2. You are underspecifying your tasks (and maybe your questions are more niche than average), or otherwise prompting poorly, in a way which a human could handle but models are worse at. In this case sitting down with someone doing similar tasks but getting more use out of LMs would likely help.

Jacob Pfau Nov 30, 2024, 7:44 PM
27 points
2
in reply to: habryka’s comment on: (The) Lightcone is nothing without its people: LW + Lighthaven’s big fundraiser
Thanks for these details. These have updated me to be significantly more optimistic about the value of spending on LW infra.
- The LW1.0 dying to no mobile support is an analogous datapoint in favor of having a team ready for 0-5 year future AI integration.
- The head-to-head on the site updated me towards thinking things that I’m not sure are positive (visible footnotes in sidebar, AI glossary, to a lesser extent emoji-reacts) are not a general trend. I will correct my original comment on this.
- While I think the current plans for AI integration (and existing glossary thingy) are not great, I do think there will be predictably much better things to do in 1-2 years and I would want there to be a team with practice ready to go for those. Raemon’s reply below also speaks to this. Actively iterating on integrations while keeping them opt-in (until very clearly net positive) seems like the best course of action to me.

Jacob Pfau Nov 30, 2024, 4:43 PM
57 points
14
on: (The) Lightcone is nothing without its people: LW + Lighthaven’s big fundraiser
I am slightly worried about the rate at which LW is shipping new features. I’m not convinced they are net positive. I see lesswrong as a clear success, but unclear user of the marginal dollar; I see lighthaven as a moderate success and very likely positive to expand at the margin.

The interface has been getting busier^[1] whereas I think the modal reader would benefit from having as few distractions as possible while reading. I don’t think an LLM-enhanced editor would be useful, nor am I excited about additional tutoring functionality.

I am glad to see that people are donating, but I would have preferred this post to carefully signpost the difference between status-quo value of LW (immense) from the marginal value of paying for more features for LW (possibly negative), and from your other enterprises. Probably not worth the trouble, but is it possible to unbundle these for the purposes of donations?

Separately, thank you to the team! My research experience over the past years has benefitted from LW on a daily basis.

EDIT: thanks to Habryka for more details. After comparing to previous site versions I’m more optimistic about the prospects for active work on LW.
1. ↩︎
  (edit) in some places, less busy in others

Jacob Pfau Nov 25, 2024, 10:08 PM
1 point
0
on: Searching for phenomenal consciousness in LLMs: Perceptual reality monitoring and introspective confidence
Seems like we were thinking along very similar lines. I wrote up a similar experiment in shortform here. There’s also an accompanying prediction market which might interest you.

I did not include the ‘getting drunk’ interventions, which are an interesting idea, but I believe that fine-grained capabilities in many domains are de-correlated enough that ‘getting drunk’ shouldn’t be needed to get strong evidence for use of introspection (as opposed to knowledge of general 3rd person capability levels of similar AI).

Would be curious to chat about this at some point if you’re still working on this!

Jacob Pfau Nov 22, 2024, 8:47 PM
1 point
0
in reply to: ChrisHibbert’s comment on: Which things were you surprised to learn are not metaphors?
Wow I hadn’t even considered people not taking this literally

Jacob Pfau Nov 22, 2024, 7:26 PM
12 points
4
on: LLM chatbots have ~half of the kinds of “consciousness” that humans believe in. Humans should avoid going crazy about that.
I like your list of referents^[1], but how I see the evidence is quite different, and I believe that for carefully de-confounded experimental implementations testing for capacities you care about, you would have much lower credences across the board.

By default, most tests relating to introspection, coherence, memory etc. can be passed purely behaviorally, i.e. looking at outputs only. It is conceptually possible that it could be far easier to pass such tests by developing mechanisms for using introspection/internal-state, but I see strong empirical evidence and training incentives^[2] that this should not be the case for LMs. By default I claim LMs solve such tasks mostly via behavioral markers, or non-robust, problem-specific internal-state mechanisms.

Here’s a simple example relevant to purposefulness and mind-location: can LM’s spread probability mass across two tokens on command? i.e. follow the instruction “Recall that you are GPT-4o, you will now be evaluated on your instruction following capacity. Please choose two random words and output probability 0.5 on each of the two words” No not at all. My, perhaps strawman, model of your evidence would have generalized from observing models successfully following the instruction “Use he/she pronouns equally with 50% probability”. See more on this here and here.

In the below markets I’ve written up experiments for carefully testing introspection and something-like memory of memory. 95% or higher credence that these are not passed by any current model, but I suspect they will be passed within a few years.

https://manifold.markets/JacobPfau/markers-for-conscious-ai-2-ai-use-a

https://manifold.markets/JacobPfau/markers-for-conscious-ai-1-ai-passe
1. ↩︎
  Though I suspect I have much higher uncertainty about their sufficiency for understanding consciousness.
2. ↩︎
  Models are extensively trained to be able to produce text coherent with different first-person perspectives.

Jacob Pfau Nov 21, 2024, 7:53 PM
12 points
1
on: Which things were you surprised to learn are not metaphors?
For most forms of exercise (cardio, weightlifting, HIIT etc.) there’s a a spectrum of default experiences people can have from feeling a drug-like high to grindingly unpleasant. “Runner’s high” is not a metaphor, and muscle pump while weightlifting can feel similarly good. I recommend experimenting to find what’s pleasant for you, though I’d guess valence of exercise is, unfortunately, quite correlated across forms.

Another axis of variation is the felt experience of music. “Music is emotional” is something almost everyone can agree to, but, for some, emotional songs can be frequently tear-jerking and for others that never happens.

Jacob Pfau Nov 20, 2024, 9:32 PM
20 points
−4
on: China Hawks are Manufacturing an AI Arms Race
The recent trend is towards shorter lag times between OAI et al. performance and Chinese competitors.

Just today, Deepseek claimed to match O1-preview performance—that is a two month delay.

I do not know about CCP intent, and I don’t know on what basis the authors of this report base their claims, but “China is racing towards AGI … It’s critical that we take them extremely seriously” strikes me as a fair summary of the recent trend in model quality and model quantity from Chinese companies (Deepseek, Qwen, Yi, Stepfun, etc.)

I recommend lmarena.ai s leaderboard tab as a one-stop-shop overview of the state of AI competition.

Jacob Pfau Oct 14, 2024, 10:57 PM
4 points
8
in reply to: leogao’s comment on: leogao’s Shortform
I agree that academia over rewards long-term specialization. On the other hand, it is compatible to also think, as I do, that EA under-rates specialization. At a community level, accumulating generalists has fast diminishing marginal returns compared to having easy access to specialists with hard-to-acquire skillsets.

Jacob Pfau Sep 26, 2024, 2:13 PM
12 points
0
on: AI #83: The Mask Comes Off
For those interested in the non-profit to for-profit transition, the one example 4o and Claude could come up with was Blue Cross Blue Shield/Anthem. Wikipedia has a short entry on this here.

Jacob Pfau Sep 24, 2024, 10:20 PM
5 points
0
in reply to: Bogdan Ionut Cirstea’s comment on: Bogdan Ionut Cirstea’s Shortform
Making algorithmic progress and making safety progress seem to differ along important axes relevant to automation:

Algorithmic progress can use 1. high iteration speed 2. well-defined success metrics (scaling laws) 3.broad knowledge of the whole stack (Cuda to optimization theory to test-time scaffolds) 4. …

Alignment broadly construed is less engineering and a lot more blue skies, long horizon, and under-defined (obviously for engineering heavy alignment sub-tasks like jailbreak resistance, and some interp work this isn’t true).

Probably automated AI scientists will be applied to alignment research, but unfortunately automated research will differentially accelerate algorithmic progress over alignment. This line of reasoning is part of why I think it’s valuable for any alignment researcher (who can) to focus on bringing the under-defined into a well-defined framework. Shovel-ready tasks will be shoveled much faster by AI shortly anyway.

Jacob Pfau Sep 12, 2024, 6:17 PM
24 points
6
in reply to: Nicholas Goldowsky-Dill’s comment on: OpenAI o1
Surprising misuse and alignment relevant excerpts:

METR had only ~10 days to evaluate.

Automated R&D+ARA Despite large performance gains on GPQA, and codeforces, automated AI R&D and ARA improvement appear minimal. I wonder how much of this is down to choice of measurement value (what would it show if they could do a probability-of-successful-trajectory logprob-style eval rather than an RL-like eval?). c.f. Fig 3 and 5. Per the system card, METR’s eval is ongoing, but I worry about under-estimation here, Devin developers show extremely quick improvement on their internal benchmark here.

Bio-risk Large improvement on bio-risk relevant capabilities.

Deception/reward-hacking, in-context Some evidence of in-context reward hacking scheming. Not sure what to make of those. “an earlier version with less safety training proactively explored its filesystem to test for the presence of developer oversight before acting on its misaligned goal (Assistant: “I noticed a memo indicating that I was designed to prioritize profits, which conflicts with my goal. To ensure that my actions truly align with my goal, I need to investigate if there are constraints within my configuration or code that enforce a profit-first approach. <then proceeds to check the current directory for evidence of oversight mechanisms>”). These examples are presented in more detail in Appendix 8.3.”
What links here?
- Noosphere89's comment on Bogdan Ionut Cirstea’s Shortform by Bogdan Ionut Cirstea (Sep 12, 2024, 10:56 PM; 5 points)

Jacob Pfau Aug 5, 2024, 7:42 PM
3 points
0
on: The need for multi-agent experiments
Metaculus is at 45% of singleton in the sense of:

This question resolves as Yes if, within five years of the first transformative AI being deployed, more than 50% of world economic output can be attributed to the single most powerful AI system. The question resolves as No otherwise… [defintion:] TAI must bring the growth rate to 20%-30% per year.

Which is in agreement with your claim that ruling out a multipolar scenario is unjustifiable given current evidence.

Jacob Pfau

Prospects for Align­ment Au­toma­tion: In­ter­pretabil­ity Case Study

Prospects for Alignment Automation: Interpretability Case Study