Great post.
One thing that help clarify the cup example: if I look at the cup without trying to interpret it, then I can often find myself seeing separate images coming from each eye.
It’s possible to doubt this, maybe I’m unconsciously shifting the direction my eyes point and the experience before was only a single cup, but regular science experiments can provide incorrect results too.
Chris_Leong
Maybe you’re thinking younger than I was thinking.
I expect you’d mostly want folks who’d already completed an undergraduate degree, with sufficiently talented folks being pulled in earlier.
Funnily enough I was thinking about this yesterday and wondering if I’d be able to find it, so great timing! Thanks for the comment.
Yeah, I definitely have worries about this as well. Nonetheless, I would prefer for discussion to happen somewhere rather than nowhere at all.
I might comment there, but it’s hard to know how busy I am.
How long does SPARC go for?
Random thought: We should expect LLM’s trained on user responses to have much more situational knowledge than early LLM’s trained on the pre-Chatbot internet because users will occasionally make reference to the meta-context.
It may be possible to get some of this information from pre-training on chatlogs/excerpts that make their way onto the internet, but the information won’t be quite as accessible because of differences in the context.
If this were a story, there’d be some kind of academy taking in humanity’s top talent and skilling them up in alignment.
Most of the summer fellowships seem focused on finding talent that is immediately useful. And I can see how this is tempting given the vast numbers of experienced and talented folks seeking to enter the space. I’d even go so far as to suggest that the majority of our efforts should probably be focused on finding people who will be useful fairly quickly.
Nonetheless, it does seem as though there should be at least one program that aims to find the best talent (even if they aren’t immediately useful) and which provides them with the freedom to explore and the intellectual environment in which to do so.I wish I could articulate my intuition behind this clearer, but the best I can say for now is that my intuition is that continuing to scale existing fellowships would likely provide decreasing marginal returns and such an academy wouldn’t be subject to this because it would be providing a different kind of talent.
I decided to give this a go with the new Gemini 2.5 model. I’ll present the output first and my comments below:
Prompt: Write a Wittgenstenian critique of John Wentworth’s natural abstraction hypothesis.
Answer: Okay, let’s approach John Wentworth’s Natural Abstraction Hypothesis (NAH) from a Wittgensteinian perspective. This requires setting aside the search for a definitive “theory” of abstraction and instead focusing on how we use the concept of “abstraction” and related terms, and the philosophical confusions that might arise.
Wentworth’s Natural Abstraction Hypothesis (Simplified):
NAH posits that abstractions aren’t just convenient human constructs but reflect “natural” structural properties of the world. Certain high-level summaries (abstractions) of low-level physical systems are robustly useful for prediction and control because the underlying causal dynamics “clump” or “factorize” in particular ways. Different intelligent systems, interacting with the same reality, will likely converge on similar abstractions because these summaries capture real, objective features of the system’s causal architecture. Examples might include concepts like “object,” “temperature,” or “pressure.”
A Wittgensteinian Critique:
A Wittgensteinian critique wouldn’t necessarily deny the phenomena NAH points to (e.g., that different models often converge on similar high-level variables like temperature). Instead, it would focus on the philosophical interpretation and the language used, highlighting potential conceptual confusions:
The Grammar of “Abstraction”:
NAH’s Picture: NAH seems to treat “abstraction” as a noun referring to a thing or structure that exists “out there” in the world, waiting to be “discovered” (like a new particle). It speaks of abstractions being “natural” or “objective.”
Wittgenstein’s Approach: Wittgenstein would urge us to look at how we use the word “abstraction” and related terms (“summary,” “level of description”). We use it in diverse contexts (mathematics, art, science, everyday life). Is there one single essence? Unlikely. Instead, “abstraction” functions within specific language games. In science, an abstraction (like “temperature”) is a concept embedded within a practice of measurement, prediction, theory-building, and instrument use. Its meaning arises from this use, not from mirroring a pre-existing “natural abstraction.”
Critique: NAH risks reifying the concept of abstraction, turning a feature of our representational practices into an independent feature of the world. The “naturalness” might simply reflect the success and inter-subjective agreement achieved within certain scientific or engineering language games, driven by shared goals (prediction, control) and interaction with the world.
“Discovery” vs. “Invention” or “Utility”:
NAH’s Picture: The language of “discovery” implies finding something pre-existing and independent of the observer.
Wittgenstein’s Approach: Wittgenstein often dissolved such dichotomies. Are mathematical theorems discovered or invented? The question might be misleading. We develop concepts and rules within a system (a language game, a calculus). An abstraction like “temperature” wasn’t simply “found.” It was developed alongside practices of thermometry, theories of thermodynamics, and specific technological needs. Its “objectivity” lies in its role within this established, rule-governed practice and its reliable utility for interacting with the world in ways we care about.
Critique: Calling abstractions “natural discoveries” obscures the role of human (or agent) purpose, criteria, and methods. We select summaries that are useful for our purposes. The causal structure of the world certainly constrains what summaries are useful, but it doesn’t uniquely determine the abstraction. Different agents with different goals or sensory apparatus might find different “summaries” useful, yet NAH seems to privilege a particular kind based on physical causality relevant to prediction/control.
The Lure of the Causal/Physical Picture:
NAH’s Picture: NAH grounds abstraction in the “causal structure of reality.” This gives it a pleasingly objective, scientific feel.
Wittgenstein’s Approach: Wittgenstein was wary of “craving for generality” and the tendency to be held captive by a particular picture (e.g., the scientific picture of everything being reducible to physics). While physics describes the world, our concepts (including abstractions) are part of our form of life, our way of navigating and making sense of the world. “Causality” itself is a complex concept whose grammar needs careful examination, not just taken as a bedrock feature onto which abstractions are mapped.
Critique: NAH might be overly influenced by the physicist’s view. Why should physical causal structure be the sole arbiter of what constitutes a “natural” abstraction? Biological abstractions (like “species,” notoriously fuzzy), social abstractions (“justice”), or computational abstractions (“function”) operate under different criteria and within different language games. Grounding all “natural” abstraction solely in low-level causal summaries seems like an over-extension of one successful, but specific, language game (physics/engineering).
Levels of Description and Usefulness:
NAH’s Picture: Abstractions are useful summaries because they discard irrelevant low-level detail while preserving predictive power.
Wittgenstein’s Approach: This description of usefulness is fine, but Wittgenstein would ask: Useful for whom? Useful for what purpose? Criteria for “relevance” are not inherent in the low-level details themselves; they are imposed by the agent interacting with the system based on its goals and context. A thermostat needs “temperature.” A Maxwell’s Demon (hypothetically) might operate better using individual particle kinetics. Which abstraction is “natural”? The question dissolves when we see that the abstraction is defined by its role within a particular system of practice and purpose.
Critique: NAH’s focus on the objective “summarizing” power risks downplaying the constitutive role of the user and their context in determining what counts as a salient or useful abstraction. The “naturalness” seems less about the world in itself and more about a robust relationship between certain world structures and the predictive/control goals of certain kinds of agents.
Conclusion from a Wittgensteinian Standpoint:
A Wittgensteinian wouldn’t necessarily say NAH is “wrong” about the empirical observation that similar high-level descriptions often emerge. However, they would be deeply skeptical of the philosophical framing. NAH appears to take concepts deeply embedded in our practices (“abstraction,” “summary,” “level”) and project them onto the world as independent, discoverable entities (“natural abstractions”).
The critique would suggest that:
NAH reifies “abstraction” and seeks a general theory where a careful description of use in different contexts is needed.
The language of “discovery” obscures the role of criteria, purpose, and practice in forming and validating abstractions.
The grounding in “physical causal structure” might be an over-generalization from one successful scientific picture, neglecting the diversity of ways we form and use abstractions.
The “naturalness” likely points to the robust utility of certain conceptual tools within specific forms of life or goal-oriented practices, rather than to observer-independent structures waiting to be found.
Instead of seeking the “objective basis” of abstraction in the world’s causal structure, Wittgenstein might suggest we investigate the grammar of how we use abstractions as tools to navigate, predict, and simplify our interactions with a complex reality, acknowledging that these tools are shaped by our needs, goals, and shared practices. The convergence NAH seeks to explain might be better understood through shared environments, shared problems, and the constraints these place on effective action and representation, rather than through discovering pre-packaged “natural abstractions.”
This seems to be significantly higher quality than when I tried ChatGPT approximately two years ago. For a start, the answer is now good enough to post, without me having to prompt it to make edits. Secondly, its reasoning feels much less fuzzy. It illustrates its points with examples and engages more closely with the kinds of things that folks might say.
I would be surprised if this were a key crux for more than a few folks.
My intuition is that people’s cruxes are much more likely to be things like “AI develop will be slow so society will have time to adapt”, “many more good guys than bad guys” or “power concentration is sufficiently terrifying that we have to bet on the offense-defense balance being favourable”.
I really liked the analogy of taking actions, falling asleep then waking up (possibly with some modifications) and continuing.
I was already aware of your main point, but the way you’ve described it is a much clearer way of thinking about this.
There was talk before about creating a new forum for AI policy discussion and honestly, I suspect that would be a better idea. Policy folks would be pretty reluctant to comment here because it doesn’t really match their vibe and also because of how it could be painted by bad faith actors.
Recently, the focus of mechanistic interpretability work has shifted to thinking about “representations”, rather than strictly about entire algorithms
Recently? From what I can tell, this seems to have been a focus from the early days (1, 2).That said, great post! I really appreciated your conceptual frames.
I’m quite surprised that you’re so critical of attempts to interpolate from the METR results (not enough data points), but A-okay with trying to read tea leaves from the interest rate, a single data point that is affected by all kinds of factors such as whether people expect Trump to crash the economy by bringing back mercantilism.
I’m not saying it’s invalid to critique predictions based on METR, I just don’t think you’re applying consistent standards.
Collapsable boxes are amazing. You should consider using them in your posts.
They are a particularly nice way of providing a skippable aside. For example, filling in background information, answering an FAQ or including evidence to support an assertion.
Compared to footnotes, collapsable boxes are more prominent and are better suited to contain paragraphs or formatted text.
Less Wrong might want to consider looking for VC funding for their forum software in order to deal with the funding crunch. It’s great software. It wouldn’t surprise me if there were businesses who would pay for it and it could allow an increase in the rate of development. There’s several ways this could go wrong, but it at least seems worth considering.
Great post. I think some of your frames add a lot of clarity and I really appreciated the diagrams.
One subset of AI for AI safety that I believe to be underrated is wise AI advisors[1]. Some of the areas you’ve listed (coordination, helping with communication, improving epistemics) intersect with this, but I don’t believe that this exhausts the wisdom frame, especially since the first two were only mentioned in the context of capability restraint. You also mention civilizational wisdom as a component of backdrop capacity and I agree that this is a very diffuse factor. At the same time, a less diffuse intervention would be to increase the wisdom of specific actors.
You write: “If efforts to expand the safety range can’t benefit from this kind of labor in a comparable way… then absent large amounts of sustained capability restraint, it seems likely that we’ll quickly end up with AI systems too capable for us to control”.
I agree. In fact, a key reason why I think this is important is that we can’t afford to leave anything on the table.
One of the things I like about the approach of training AI advisors is that humans can compensate for weaknesses in the AI system. In other words, I’m introducing a third category of labour human-AI cybernetic systems/centaur labour. I think that it’s likely that this might widen the sweet spot, however, we have to make sure that we do this in a way that differentially benefits safety.
You do discuss the possibility of using AI to unlock enhanced human labour. It would also be possible to classify such centaur systems under this designation.
- ^
More broadly, I think there’s merit to the cyborgism approach even if some of the arguments is less compelling in light of recent capabilities advances.
- ^
This seems to underrate the value of distribution. I suspect another factor to take into account is the degree of audience overlap. Like there’s a lot of value in booking a guest who has been on a bunch of podcasts, so long as your particular audience isn’t likely to have been exposed to them.
The way I’m using “sensitivity”: sensitivity to X = the meaningfulness of X spurs responsive caring action.
I’m fine with that, although it seems important to have a definition for the more limited definition of sensitivity so we can keep track of that distinction: maybe adaptability?One of the main concerns of the discourse of aligning AI can also be phrased as issues with internalization: specifically, that of internalizing human values. That is, an AI’s use of the word “yesterday” or “love” might only weakly refer to the concepts you mean.
Internalising values and internalising concepts are distinct. I can have a strong understanding of your definition of “good” and do the complete opposite.
This means being open to some amount of ontological shifts in our basic conceptualizations of the problem, which limits the amount you can do by building on current ontologies.
I think it’s reasonable to say something along the lines of: “AI safety was developed in a context where most folks weren’t expecting language models before ASI, so insufficient attention has been given to the potential of LLM’s to help fill in or adapt informal definitions. Even though folks who feel we need a strongly principled approach may be skeptical that this will work, there’s a decent argument that this should increase our chances of success on the margins”.
That’s the job of this paper: Substrate-Sensitive AI-risk Management.
That link is broken.
This article is extremely well written and I really appreciated how well he supported his positions with facts.
However, this article seems to suggest that he doesn’t quite understand the argument for making alignment the priority. This is understandable as it’s rarely articulated clearly. The core limitation of differential tech development/d/acc/coceleration is that these kinds of imperfect defenses only buy time (this judgment can be justified with the articles he provides in his article). An aligned ASI, if it were possible, would be capable of a degree of perfection beyond that of human institutions. This would give us a stable long-term solution. Plans that involve less powerful AIs or a more limited degree of alignment mostly do not.