Davidmanheim

Karma: 5,159

Davidmanheim May 5, 2025, 6:56 AM
LW: 4 AF: 1
−1
AF
on: Interpretability Will Not Reliably Find Deceptive AI
First, strongly agreed on the central point—I think that as a community, we’ve been too heavily investing in the tractable approaches (interpretability, testing, etc.) without having the broader alignment issues taking front stage. This has led to lots of bikeshedding, lots of capabilities work, and yes, some partial solutions to problems.
That said, I am concerned about what happens if interpretability is wildly successful—against your expectations. That is, I see interpretability as a concerning route to attempted alignment even if it succeeds in getting past the issues you note on “miss things,” “measuring progress,” and “scalability,” partly for reasons you discuss under obfuscation and reliability. Wildly successful and scalable interpretability without solving other parts of alignment would very plausibly function as a very dangerously misaligned system, and the methods for detection themselves arguably exacerbate the problem. I outlined my potential concerns about this case in more detail in a post here. I would be very interested in your thoughts about this. (And thoughts from @Buck / @Adam Shai as well!)

Davidmanheim Apr 28, 2025, 9:34 PM
2 points
0
in reply to: Noosphere89’s comment on: “The Era of Experience” has an unsolved technical alignment problem
If it fails once we are well past AlphaZero, or even just more moderate superhuman AI research, this is good, as this means the “automate AI alignment” plan has a safe buffer zone.
If it fails before AI automates AI research, this is also good, because it forces them to invest in alignment.
That assumes AI firms learn the lessons needed from the failures. Our experience shows that they don’t, and they keep making systems that predictably are unsafe and exploitable, and they don’t have serious plans to change their deployments, much less actually build a safety-oriented culture.

Davidmanheim Apr 28, 2025, 9:30 PM
3 points
0
in reply to: MichaelDickens’s comment on: “The Urgency of Interpretability” (Dario Amodei)
Because they are all planning to build agents that will have optimization pressures, and RL-type failures apply when you build RL systems, even if it’s on top of LLMs.

Davidmanheim Apr 28, 2025, 7:04 AM
5 points
0
on: Therapist in the Weights: Risks of Hyper-Introspection in Future AI Systems
Responses to o4-mini-high’s final criticisms of the post:

Criticism: “You’re treating hyper-introspection (internal transparency) as if it naturally leads to embedded agency (full goal-driven self-modification). But in practice, these are distinct capabilities. Why do you believe introspection tools would directly lead to autonomous, strategic self-editing in models that remain prediction-optimized?”

Response: Yes, these are distinct, and one won’t necessarily lead to the other—but both are being developed by the same groups in order to deploy them. There’s a reasonable question about how linked they are, but I think that there is a strong case that self-modifying via introspection, even if only done during training and via internal deployment would lead to much more dangerous and hard to track deception.

Criticism: “You outline very plausible risks but don’t offer a distribution over outcomes. Should we expect hyper-introspection to make systems 10% more dangerous? 1000%? Under what architectures? I’d find your argument stronger if you were more explicit about the conditional risk landscape.”

Response: If we don’t solve ASI alignment, which no-one seems to think we can do, we’re doomed once we build misaligned. This seems to get us there more quickly. Perhaps it even reduces short term risks, but I think timelines are far more uncertain than the way the risks will emerge if we build systems that have these capabilities.
Criticism: “Given that fully opaque systems are even harder to oversee, and that deception risk grows with opacity too, shouldn’t we expect that some forms of introspection are necessary for any meaningful oversight? I agree hyper-introspection could be risky, but what’s the alternative plan if we don’t pursue it?
Response: Don’t build smarter than human systems. If you are not developing ASI, and you want to monitor current and near future but not inevitably existentially dangerous systems, work on how humans can provide meaningful oversight in deployment instead of tools that enhance capabilities for accelerating the race—because without fixing the underlying dynamics, i.e. solving alignment, self-monitoring is a doomed approach.
Criticism: “You assume that LLMs could practically trace causal impact through their own weights. But given how insanely complicated weight-space dynamics are even for humans analyzing small nets, why expect this capability to arise naturally, rather than requiring radical architectural overhaul?”
Response: Yes, maybe Anthropic and others will fail, and building smarter than human systems might not be possible. Then strong interpretability is just a capability enhancer, and doesn’t materially change the largest risks. That would be great news, but I don’t want to bet my kid’s lives on it.

Therapist in the Weights: Risks of Hyper-Introspection in Future AI Systems

DavidmanheimApr 28, 2025, 6:42 AM

15 points

1 comment5 min readLW link

Davidmanheim Apr 27, 2025, 8:10 PM
LW: 4 AF: 3
0
AF
in reply to: Noosphere89’s comment on: “The Era of Experience” has an unsolved technical alignment problem
In general, you can mostly solve Goodhart-like problems in the vast majority of the experienced range of actions, and have it fall apart only in more extreme cases. And reward hacking is similar. This is the default outcome I expect from prosaic alignment—we work hard to patch misalignment and hacking, so it works well enough in all the cases we test and try, until it doesn’t.

Davidmanheim Apr 27, 2025, 7:33 AM
10 points
3
in reply to: Ben Pace’s comment on: “The Urgency of Interpretability” (Dario Amodei)
Quick take: it’s focused on interpretability as a way to solve prosaic alignment, ignoring the fact that prosaic alignment is clearly not scalable to the types of systems they are actively planning to build. (And it seems to actively embrace the fact that interpretability is a capabilities advantage in the short term, but pretends that it is a safety thing, as if the two are not at odds with each other when engaged in racing dynamics.)
What links here?
- AI #115: The Evil Applications Division by Zvi (May 8, 2025, 1:40 PM; 32 points)

Davidmanheim Apr 27, 2025, 7:18 AM
2 points
0
in reply to: Dirichlet-to-Neumann’s comment on: ≤10-year Timelines Remain Unlikely Despite DeepSeek and o3
...yest it hasn’t happened, which is pretty strong evidence the other way.

Davidmanheim Apr 27, 2025, 5:26 AM
2 points
1
in reply to: samuelshadrach’s comment on: Davidmanheim’s Shortform
I think you are fooling yourself about how similar people in 1600 are to people today. The average person at the time was illiterate, superstitious, and could maybe do single digit addition and subtraction. You’re going to explain nuclear physics?

Davidmanheim Apr 22, 2025, 3:47 PM
2 points
0
in reply to: samuelshadrach’s comment on: Davidmanheim’s Shortform
This doesn’t matter for predicting the outcome of a hypothetical war between 16th century Britain and 21st century USA.

If AI systems can make 500 years of progress before we notice it’s uncontrolled, it’s already assuming it’s a insanely strong superintelligence.

We could probably understand how a von Neumann probe or an anti-aging cure worked too, if someone taught us.
Probably, if it’s of a type we can imagine and is comprehensible in those terms—but that’s assuming the conclusion! As Gwern noted, we can’t understand chess endgames. Similarly, in the case of a strong ASI, the ASI- created probe or cure could look more like a random set of actions that aren’t explainable in our terms which cause the outcome than it does like an engineered / purpose driven system that is explainable at all.

Davidmanheim Apr 22, 2025, 6:21 AM
5 points
3
in reply to: gwern’s comment on: Davidmanheim’s Shortform
We can point to areas of chess like the endgame databases, which are just plain inscrutable

I think there isa key difference in places where the answers are just exhaustive search, rather than more intelligence—AI isn’t better at that than humans, and from the little I understand, AI doesn’t outperform in endgames (compared to their overperformance in general) via better policy engines, they do it via direct memorization or longer lookahead.
The difference here matters for other domains with far larger action spaces even more, since the exponential increase makes intelligence less marginally valuable at finding increasingly rare solutions. The design space for viruses is huge, and the design space for nanomachines using arbitrary configurations is even larger. If move-37-like intuitions are common, they will be able to do things humans cannot understand, whereas if it’s more like chess endgames, they will need to search an exponential space in ways that are infeasible for them.

This relates closely to a folk theorem about NP-complete problems, where exponential problems are approximately solvable with greedy algorithms in nlogn or n^2 time, and TSP is NP complete but actual salesmen find sufficiently efficient routes easily.
But what part are you unsure about?
Yeah, on reflection, the music analogy wasn’t a great one. I am not concerned that pattern creation that we can’t intuit could exist—humans can do that as well. (For example, it’s easy to make puzzles no-one can solve.) The question is whether important domains are amenable to kinds of solutions that ASI can understand robustly in ways humans cannot. That is, can ASI solve “impossible” problems?

One specific concerning difference is whether ASI could play perfect social 12-D chess by being a better manipulator, despite all of the human-experienced uncertainties, and engineer arbitrary outcomes in social domains. There clearly isn’t a feasible search strategy with exact evaluation, but if it is far smarter than “human-legible ranges” of thinking, it might be possible.

This isn’t jut relevant for AI risk, of course. Another area is biological therapies, where, for example, it seems likely that curing or reversing aging requires the same sort of brilliant insight into insane complexity, figuring out whether there would be long term or unexpected out of distribution impacts years later, without actually conducting multi-decade large scale trials.

Davidmanheim Apr 21, 2025, 4:25 PM
1 point
0
on: Improving CNNs with Klein Networks: A Topological Approach to AI
Cool work, and I like your book on topological data analysis—but you seem to be working on accelerating capabilities instead of doing work on safety or interpretability. That seems bad to me, but it also makes me wonder why you’re sharing it here.
On the other hand, I’d be very interested in your thoughts on approaches like singular learning theory.

Davidmanheim Apr 21, 2025, 7:18 AM
11 points
0
on: Davidmanheim’s Shortform
I’ve been wondering about superintelligence as a concept for a long time, and want to lay out two distinct possibilities; either it’s boundedly complex and capable, or not bounded, and can scale to impossible to understand levels.

In the first case, think of chess; superhuman chess still plays chess. You can watch AlphaZero’s games and nod along—even if it’s alien, you get what it’s doing, the structure of the chess “universe” is such that unbounded intelligence still leads to mostly understandable moves. This seems to depend on domain. For AlphaGo, it’s unclear to me that move 37 is fundamentally impossible to understand in Go-expert terms, or if it’s just a new style of play that humans can now understand by reformulating their understanding of Go in some way.

In the near-term, there’s a reason to think that even superhuman AI would stay within human-legible ranges of decisions. An AI tasked with optimizing urban environments. might give us efficient subway systems and walkable streets—but the essence of the city is for human residents, and legibility and predictability are presumably actually critical criteria. If a superintelligent designer produced fractal and biologically active cities out of a Lovecraftian fever-dream that are illegible to humans, they would be ineffective cities.

A superhuman composer might write music that breaks our rules but still stirs our souls. But a superintelligence might instead write music that sounds to us like static, full of some brilliant structure, with no ability for human brains to comprehend it. Humans might be unable to tell whether it’s genius or gibberish—but are such heights of genius a real thing? I am unsure.

The question I have, then, is whether heights of creation inaccessible to human minds are a real coherent idea. If they are, perhaps the takeover of superhuman AI would happen in ways that we cannot fathom. But if not, it seems far more likely that we end up disempowered rather than eliminated in the blink of an eye.

Davidmanheim Apr 20, 2025, 5:10 AM
11 points
7
in reply to: Tenoke’s comment on: Why Should I Assume CCP AGI is Worse Than USG AGI?
There are a number of ways that the US seems to have better values than the CCP, by my lights, but it seems incredibly strange to claim the US values being egalitarian, and social equality or harmony more.
Rule of law, fostering diversity, encouraging human excellence? Sure, there you would have an argument. But egalitarian?

Davidmanheim Apr 16, 2025, 3:16 PM
2 points
0
on: Can LLM-based models do model-based planning?
Very interesting work. One question I’ve had about this is whether humans can do such planning ‘natively’, i.e. in our heads, or if we’re using tools in ways that are essentially the same as doing “model-based planning inefficiently, with… bottleneck being a potential need to encode intermediate states.”

Davidmanheim Apr 15, 2025, 1:26 PM
7 points
2
in reply to: Neel Nanda’s comment on: Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study
Yeah, I’m only unsurprised because I’ve been tracking other visual reasoning tasks and already updated towards verbal intelligence of LLMs being pretty much disconnected from spatial and similar reasoning. (But the visual classes of task seem not obviously harder, and visual data generation is very feasible at scale, so I do expect reasonably rapid future progress now that it is a focus, conditional on sufficient attention from developers.)

Davidmanheim Apr 15, 2025, 1:19 PM
4 points
1
in reply to: Neel Nanda’s comment on: Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study
I understood, very much secondhand, that current LLMs are still using a separately trained part of the model’s input space for images. I’m very unsure how the model weights are integrating the different types of thinking, but am by default skeptical that it integrates cleanly into other parts of reasoning.
That said, I’m also skeptical that this is fundamentally a had part of the problem, as simulation and generated data seems like a very tractable route to improving this, if/once model developers see it as a critical bottleneck for tens of billions of dollars in revenue.

Grounded Ghosts in the Machine—Friston Blankets, Mirror Neurons, and the Quest for Cooperative AI

DavidmanheimApr 10, 2025, 10:15 AM

9 points

0 comments9 min readLW link

(davidmanheim.com)

Davidmanheim Apr 10, 2025, 10:11 AM
2 points
0
in reply to: Mis-Understandings’s comment on: Short Timelines don’t Devalue Long Horizon Research
That seems correct, but I don’t think any of those aren’t useful to investigate with AI, despite the relatively higher bar.

Davidmanheim Apr 9, 2025, 6:31 AM
6 points
3
in reply to: Julian Bradshaw’s comment on: birds and mammals independently evolved intelligence
...Thus, to explain the Fermi Paradox, we should posit increased odds that the Great Filter is in front of us. (However, my prior for the Great Filter being ahead of humanity is pretty low, we’re too close to AI and the stars—keep in mind that even a paperclipper has not been Filtered, a Great Filter prevents any intelligence from escaping Earth.)

Or that the filter is far behind us—specifically, Eukaryotes only evolved once. And in the chain-model by Sandberg et al, pre-intelligence filters are the vast majority of the probability mass, so it seems to me that eliminating intelligence as a filter shifts the remaining probability mass for a filter backwards in time in expectation.

Davidmanheim

Ther­a­pist in the Weights: Risks of Hyper-In­tro­spec­tion in Fu­ture AI Systems

Grounded Ghosts in the Ma­chine—Fris­ton Blan­kets, Mir­ror Neu­rons, and the Quest for Co­op­er­a­tive AI

Therapist in the Weights: Risks of Hyper-Introspection in Future AI Systems

Grounded Ghosts in the Machine—Friston Blankets, Mirror Neurons, and the Quest for Cooperative AI