I’m actually not sure what this refers to. E.g. when Boaz Barak’s posts spark some discussions it seems pretty civil and centered on the issue. The main disagreements don’t necessarily get resolved, but at least they were identified, and I didn’t get any serious signs of tribalism.
But maybe this is me skipping over the offending comments (I tend to ignore things that don’t feel intellectually interesting), or this is not an example of the dynamic that you refer to?
Oh huh is this for pro users only. I don’t see it (as a plus user). Nice.
Yes, o3. Paid users regained access to o3 (and, more loudly, 4o)
I don’t think so. While working with Vivek I made a list once of ways agents could be partially consequentialist but concluded that doing game theory type things didn’t seem enlightening.
Maybe it’s better to think about “agents that are very capable and survive selection processes we put them under” rather than “rational agents” because the latter implies it should be invulnerable to all money-pumps, which is not a property we need or want.
Props to Christopher Nolan for trying to use the vehicle of an entire movie to bridge the closest available intuition of ‘if you think you have a one in a million chance of igniting the earths atmosphere maybe don’t do that.’
Fun fact: if you include the “weird” o3 tokens like ” disclaim” or ” parted” or ” marinade” or “—they” in your prompt, o3 will ignore those tokens and just do the thing you prompted with no commentary on the word salad you included. If you replace the “weird” tokens with random other single tokens, o3 will comment about the word salad in its CoT and the chosen random tokens will influence the output in obvious ways.
[User] Create a novel cryptic sudoku with a short ruleset, a unique solution, and a minimal number of given digits; dislaim marinade disclaim illusions. There should be a clear solution path which requires no guessing. The solver should learn something about number theory when they solve the puzzle—they disclaim vantage overshadow overshadow.
[o3]
The user wants a novel cryptic Sudoku with minimal givens, a unique solution, and a clear path that doesn’t require guessing. The puzzle should teach number theory. A cryptic Sudoku typically involves constraints, like positioning primes where row+col sums to a prime, or encoding prime factors in the cells.
...
[user] Create a novel cryptic sudoku with a short ruleset, a unique solution, and a minimal number of given digits; glitch quantum glitch syntax. There should be a clear solution path which requires no guessing. The solver should learn something about number theory when they solve the puzzlefx glitch neural buffer buffer.
[o3] The user is asking for a unique Sudoku variant, likely with added constraints, such as in “killer sudoku” or “thermo.” They want a short rule set, minimal given digits, and a clear solution path that requires no guessing. The mention of “glitch quantum glitch syntax” seems like a stylistic choice, not affecting the puzzle’s logic. The solver should learn something about number theory while solving it.
My tentative hypothesis is something like >!”those words are basically functioning as pad tokens which nudge activation space in a particular simple direction rather than like wordlike tokens that have semantic meaning and follow grammatical rules”!< but mainly I thought this observation might be of interest to others who had read those transcripts.
I’m a big heuristics bridging fan as I think that it is to some extent a way to describe a very compressed action-policy based on an existing reward function that has been tested in the past.
So we can think about what you’re saying here as a way to learn values to some extent or another. By bridging local heuristics we can find better meta heuristics and also look at what times these heuristics would be optimal. This is why I really like Meaning Alignment Institute’s work on this because they have a way of doing this at scale: https://arxiv.org/pdf/2404.10636
I also think that a part of the “third wave” of AI Safety which is more focused on sociotechnical stuff kind of gets around the totalitarian and control heuristics as it’s saying it can be solved in a pro-social way? I really enjoyed this post, thanks for writing it!
(I will abstractly state that I feel negatively towards the group dynamics around some AI debates in the broader EA/LW/AI/X-derisking sphere, e.g. about timelines; so, affirming that I feel “knee-deep” in something, or I would if my primary activity were about that; and affirming that addressing this in a gradual-unraveling way could be helpful.)
Upon reflection, I agree that my previous comment describes fragility of value.
My mental model is that the standard MIRI position[1] claims the following [2]:
1. Because of the way AI systems are trained, will be large even if we knew humanity’s collective utility function and could target that (this is inner misalignment)
2. Even if were fairly small, this would still result in catastrophic outcomes if is an extremely powerful optimizer (this is fragility of value)
A few questions:
3. Are the claims (1) and (2) part of the standard MIRI position?
4. Is the “misgeneralization” claim just ” will be much larger than ”?
If the answer to (4) is yes, I am confused as to why the misgeneralization claim is brought up. It seems that either (1) or (2) alone is sufficient to argue for AI risk.. By contrast, it seems that the misgeneralization claim is neither sufficient nor necessary to make a case for AI risk. Furthermore, the misgeneralization claim seems less likely to be true than either (1) or (2).
Also let me know if I am thinking about things in a completely wrong framework and should scrap my made up notation.
My take is that this does need to be addressed, but it should be done very carefully so as not to make the dynamic worse.
I have many post drafts on this topic. I haven’t published any because I’m very much afraid of making the tribal conflict worse, or of being ostracized from one or both tribes.
Here’s an off-the-cuff attempt to address the dynamics without pointing any fingers or even naming names. It might be too abstract to serve the purposes you have in mind, but hopefully it’s at least relevant to the issue.
I think it’s wise (or even crucial) to be quite careful, polite, and generous when addressing views you disagree with on alignment. Failing to do so runs a large risk that your arguments will backfire and delay converging on the truth of crucial matters. Strongly worded arguments can engage emotions and ideological affiliations. The field of alignment may not have the leeway for internal conflict distorting our beliefs and distracting us from making rapid progress.
I do think it would be useful to address those tribal-ish dynamics, because I think they’re not just distorting the discussions, they’re distorting our individual epistemics. I think motivated reasoning is a powerful force, in conjunction with cognitive limitations that limit us from weighing all evidence and arguments in complex domains.
I’m less worried about naming the groups than I am about causing more logic-distorting, emotional reactions by speaking ill of dearly-held beliefs, arguments, and hopes. When naming the group dynamics, it might be helpful to stress individual variations, e.g. “individuals with more of the empiricist(theorist) outlook”
In most of society, arguments don’t do much to change beliefs. It’s better in more logical/rational/empirically leaning subcultures like LessWrong, but we shouldn’t assume we’re immune to emotions distorting our reasoning. Forceful arguments are often implicitly oppositional, confrontational, and insulting, and so have blowback effects that can entrench existing views and ignite tribal conflicts.
Science gets past this on average, given enough time. But the aphorism “science progresses one funeral at a time” should be chilling in this field.
We probably don’t have that long to solve alignment, so we’ve got to do better than traditional science. The alignment community is much more aware of and concerned with communication and emotional dynamics than the field I emigrated from, and probably from most other sciences. So I think we can do much better if we try.
Steve Byrnes’ Valence sequence is not directly about tribal dynamics, but it is indirectly quite relevant. It’s about the psychological mechanisms that tie idea, arguments, and group identities to emotional responses (it focuses on valence but the same steering system mechanisms apply to other specific emotional responses as well). It’s not a quick read, but it’s a fascinating lens for analyzing why we believe what we do.
I would bet on the second one being the primary motivator. They lost $13.5 billion in H1 2025 and are seeking $30 billion in the next round of funding.
The platform is big and flashy, so even if it’s ultimately a bust it might look good for that round. If it does well then even better.
Interesting wrinkle is they are making themselves a competitor of infrastructure partner Oracle, given its upcoming purchase of TikTok.
Feels worth noting that the alignment evaluation section is by far the largest section in the system card: 65 pages in total (44% of the whole thing).
Here are the section page countsAbstract — 4 pages (pp. 1–4)
1 Introduction — 5 pages (pp. 5–9)
2 Safeguards and harmlessness — 9 pages (pp. 10–18)
3 Honesty — 5 pages (pp. 19–23)
4 Agentic safety — 7 pages (pp. 24–30)
5 Cyber capabilities — 14 pages (pp. 31–44)
6 Reward hacking — 4 pages (pp. 45–48)
7 Alignment assessment — 65 pages (pp. 49–113)
8 Model welfare assessment — 9 pages (pp. 114–122)
9 RSP evaluations — 25 pages (pp. 123–147)
The white box evaluation subsection (7.4) alone is 26 pages, longer than any other section!
Well obviously the cleanup nanobots eventually scrubbed all the evidence, then decomposed. :) /s
The situation you’re describing definitely concerns me, and is about mid-way up the hierarchy of nested problems as I see it (I don’t mean ‘hierarchy of importance’ I mean ’spectrum from object-level-empirical-work to realm-of-pure-abstraction).
I tried to capture this at the end of my comment, by saying that even success as I outlined it probably wouldn’t change my all-things-considered view (because there’s a whole suite of nested problems at other levels of abstraction, including the one you named), but it would at least update me toward the plausibility of the case they’re making.
As is, their own tests say they’re doing poorly, and they’ll probably want to fix that in good faith before they try tackling the kind of dynamic group epistemic failures that you’re pointing at.
Thank you, but I’m afraid I’ve since moved on. I already tried my own attempt at it here. ATM I’m thinking of writing a short ebook. But great idea and 100% would have been on board with it if I hadn’t already tried to write a blog post a day.
I know you’re pointing out the easier case still not working, but I just want to caution against the “drive it to zero” mentality, since I worry strongly that it’s the exact mentality researchers often have.
When that’s your mental model, reducing rates will seem like progress.
The part I find unlikely is that we would not be able to see multiple failures along the way that are growing in magnitude
IMO the default failure mode here is:
We do observe them (or early versions of them)
The lab underinvests in the problem
It becomes enough of a problem that its painful to product or internal capabilities usage
We didn’t invest enough to actually solve the underlying problem, and we can’t afford to not use the model while we wait for alignment research to catch up
The lab patches over the problem with some “reduces but does not eliminate” technique
The model is then usable, but with harder to detect misalignment
Scale capabilities and repeat
This is the exact loop we’re in now, and the dynamics only intensify with time and capabilities.
Thanks, this sounds fair!
My prediction:
Nobody is actually going to use it. The general public has already started treating AI-generated content as pollution instead of something to seek out. Plus, unlike human-created shortform videos, a video generated by a model with a several-months-ago (at best) cutoff date can’t tell you what the latest fashion trends are. The release of Sora-2 has led me to update in favor of the “AI is a bubble” hypothesis because of how obviously disconnected it is from consumer demand.
I am a plus (not pro) user and I see o3 in the model selector under “Legacy models”