I think that’s what they meant you should not do when they said [edit to add: directly quoting a now-modified part of the footnote] “Bulk preorders don’t count, and in fact hurt.”
Davidmanheim
My attitude here is something like “one has to be able to work with moral monsters”.
You can work with them without inviting them to hang out with your friends.This flavor of boycotting seems like it would generally be harmful to one’s epistemics to adopt as a policy.
Georgia did not say she was boycotting, nor calling for others not to attend—she explained why she didn’t want to be at an event where he was a featured speaker.
This seems mostly right, except that it’s often hard to parallelize work and manage large projects—which seems like it slows thing importantly. And, of course, some things are strongly serialized using time that can’t be sped up via more compute or more people. (See: PM hires 9 women to have baby in one month.)
Similarly, running 1,000 AI research groups in parallel might get you the same 20 insights 50 times, rather than generating far more insights. And managing and integrating the research, and deciding where to allocate research time, plausibly gets harder at more than a linear rate with more groups.
So overall, the model seems correct, but I think the 10x speed up is more likely than the 20x speed up.
I have a biologically hardwired preference for defeating and hurting those who oppose me vigorously. I work very hard to sideline that biologically hardwired preference.
This seems like a very bad analogy, which is misleading in this context. We can usefully distinguish between evolutionarily beneficial instrumental strategies which are no longer adaptive and actively sabotage our other preferences in the modern environment, and preferences that we can preserve without sacrificing other goals.
CoT monitoring seems like a great control method when available
As I posted in a top level comment, I’m not convinced that even success would be a good outcome. I think that if we get this working 99.999% reliably. we still end up delegating parts of the oversight in ways that have other alignment failure modes, such as via hyper-introspection.
First, strongly agreed on the central point—I think that as a community, we’ve been too heavily investing in the tractable approaches (interpretability, testing, etc.) without having the broader alignment issues taking front stage. This has led to lots of bikeshedding, lots of capabilities work, and yes, some partial solutions to problems.
That said, I am concerned about what happens if interpretability is wildly successful—against your expectations. That is, I see interpretability as a concerning route to attempted alignment even if it succeeds in getting past the issues you note on “miss things,” “measuring progress,” and “scalability,” partly for reasons you discuss under obfuscation and reliability. Wildly successful and scalable interpretability without solving other parts of alignment would very plausibly function as a very dangerously misaligned system, and the methods for detection themselves arguably exacerbate the problem. I outlined my potential concerns about this case in more detail in a post here. I would be very interested in your thoughts about this. (And thoughts from @Buck / @Adam Shai as well!)
If it fails once we are well past AlphaZero, or even just more moderate superhuman AI research, this is good, as this means the “automate AI alignment” plan has a safe buffer zone.
If it fails before AI automates AI research, this is also good, because it forces them to invest in alignment.
That assumes AI firms learn the lessons needed from the failures. Our experience shows that they don’t, and they keep making systems that predictably are unsafe and exploitable, and they don’t have serious plans to change their deployments, much less actually build a safety-oriented culture.
Because they are all planning to build agents that will have optimization pressures, and RL-type failures apply when you build RL systems, even if it’s on top of LLMs.
Responses to o4-mini-high’s final criticisms of the post:
Criticism: “You’re treating hyper-introspection (internal transparency) as if it naturally leads to embedded agency (full goal-driven self-modification). But in practice, these are distinct capabilities. Why do you believe introspection tools would directly lead to autonomous, strategic self-editing in models that remain prediction-optimized?”
Response: Yes, these are distinct, and one won’t necessarily lead to the other—but both are being developed by the same groups in order to deploy them. There’s a reasonable question about how linked they are, but I think that there is a strong case that self-modifying via introspection, even if only done during training and via internal deployment would lead to much more dangerous and hard to track deception.
Criticism: “You outline very plausible risks but don’t offer a distribution over outcomes. Should we expect hyper-introspection to make systems 10% more dangerous? 1000%? Under what architectures? I’d find your argument stronger if you were more explicit about the conditional risk landscape.”
Response: If we don’t solve ASI alignment, which no-one seems to think we can do, we’re doomed once we build misaligned. This seems to get us there more quickly. Perhaps it even reduces short term risks, but I think timelines are far more uncertain than the way the risks will emerge if we build systems that have these capabilities.Criticism: “Given that fully opaque systems are even harder to oversee, and that deception risk grows with opacity too, shouldn’t we expect that some forms of introspection are necessary for any meaningful oversight? I agree hyper-introspection could be risky, but what’s the alternative plan if we don’t pursue it?
Response: Don’t build smarter than human systems. If you are not developing ASI, and you want to monitor current and near future but not inevitably existentially dangerous systems, work on how humans can provide meaningful oversight in deployment instead of tools that enhance capabilities for accelerating the race—because without fixing the underlying dynamics, i.e. solving alignment, self-monitoring is a doomed approach.
Criticism: “You assume that LLMs could practically trace causal impact through their own weights. But given how insanely complicated weight-space dynamics are even for humans analyzing small nets, why expect this capability to arise naturally, rather than requiring radical architectural overhaul?”
Response: Yes, maybe Anthropic and others will fail, and building smarter than human systems might not be possible. Then strong interpretability is just a capability enhancer, and doesn’t materially change the largest risks. That would be great news, but I don’t want to bet my kid’s lives on it.
In general, you can mostly solve Goodhart-like problems in the vast majority of the experienced range of actions, and have it fall apart only in more extreme cases. And reward hacking is similar. This is the default outcome I expect from prosaic alignment—we work hard to patch misalignment and hacking, so it works well enough in all the cases we test and try, until it doesn’t.
Quick take: it’s focused on interpretability as a way to solve prosaic alignment, ignoring the fact that prosaic alignment is clearly not scalable to the types of systems they are actively planning to build. (And it seems to actively embrace the fact that interpretability is a capabilities advantage in the short term, but pretends that it is a safety thing, as if the two are not at odds with each other when engaged in racing dynamics.)
...yest it hasn’t happened, which is pretty strong evidence the other way.
I think you are fooling yourself about how similar people in 1600 are to people today. The average person at the time was illiterate, superstitious, and could maybe do single digit addition and subtraction. You’re going to explain nuclear physics?
This doesn’t matter for predicting the outcome of a hypothetical war between 16th century Britain and 21st century USA.
If AI systems can make 500 years of progress before we notice it’s uncontrolled, it’s already assuming it’s a insanely strong superintelligence.
We could probably understand how a von Neumann probe or an anti-aging cure worked too, if someone taught us.
Probably, if it’s of a type we can imagine and is comprehensible in those terms—but that’s assuming the conclusion! As Gwern noted, we can’t understand chess endgames. Similarly, in the case of a strong ASI, the ASI- created probe or cure could look more like a random set of actions that aren’t explainable in our terms which cause the outcome than it does like an engineered / purpose driven system that is explainable at all.
We can point to areas of chess like the endgame databases, which are just plain inscrutable
I think there isa key difference in places where the answers are just exhaustive search, rather than more intelligence—AI isn’t better at that than humans, and from the little I understand, AI doesn’t outperform in endgames (compared to their overperformance in general) via better policy engines, they do it via direct memorization or longer lookahead.The difference here matters for other domains with far larger action spaces even more, since the exponential increase makes intelligence less marginally valuable at finding increasingly rare solutions. The design space for viruses is huge, and the design space for nanomachines using arbitrary configurations is even larger. If move-37-like intuitions are common, they will be able to do things humans cannot understand, whereas if it’s more like chess endgames, they will need to search an exponential space in ways that are infeasible for them.
This relates closely to a folk theorem about NP-complete problems, where exponential problems are approximately solvable with greedy algorithms in nlogn or n^2 time, and TSP is NP complete but actual salesmen find sufficiently efficient routes easily.But what part are you unsure about?
Yeah, on reflection, the music analogy wasn’t a great one. I am not concerned that pattern creation that we can’t intuit could exist—humans can do that as well. (For example, it’s easy to make puzzles no-one can solve.) The question is whether important domains are amenable to kinds of solutions that ASI can understand robustly in ways humans cannot. That is, can ASI solve “impossible” problems?
One specific concerning difference is whether ASI could play perfect social 12-D chess by being a better manipulator, despite all of the human-experienced uncertainties, and engineer arbitrary outcomes in social domains. There clearly isn’t a feasible search strategy with exact evaluation, but if it is far smarter than “human-legible ranges” of thinking, it might be possible.
This isn’t jut relevant for AI risk, of course. Another area is biological therapies, where, for example, it seems likely that curing or reversing aging requires the same sort of brilliant insight into insane complexity, figuring out whether there would be long term or unexpected out of distribution impacts years later, without actually conducting multi-decade large scale trials.
Cool work, and I like your book on topological data analysis—but you seem to be working on accelerating capabilities instead of doing work on safety or interpretability. That seems bad to me, but it also makes me wonder why you’re sharing it here.
On the other hand, I’d be very interested in your thoughts on approaches like singular learning theory.
I’ve been wondering about superintelligence as a concept for a long time, and want to lay out two distinct possibilities; either it’s boundedly complex and capable, or not bounded, and can scale to impossible to understand levels.
In the first case, think of chess; superhuman chess still plays chess. You can watch AlphaZero’s games and nod along—even if it’s alien, you get what it’s doing, the structure of the chess “universe” is such that unbounded intelligence still leads to mostly understandable moves. This seems to depend on domain. For AlphaGo, it’s unclear to me that move 37 is fundamentally impossible to understand in Go-expert terms, or if it’s just a new style of play that humans can now understand by reformulating their understanding of Go in some way.
In the near-term, there’s a reason to think that even superhuman AI would stay within human-legible ranges of decisions. An AI tasked with optimizing urban environments. might give us efficient subway systems and walkable streets—but the essence of the city is for human residents, and legibility and predictability are presumably actually critical criteria. If a superintelligent designer produced fractal and biologically active cities out of a Lovecraftian fever-dream that are illegible to humans, they would be ineffective cities.
A superhuman composer might write music that breaks our rules but still stirs our souls. But a superintelligence might instead write music that sounds to us like static, full of some brilliant structure, with no ability for human brains to comprehend it. Humans might be unable to tell whether it’s genius or gibberish—but are such heights of genius a real thing? I am unsure.
The question I have, then, is whether heights of creation inaccessible to human minds are a real coherent idea. If they are, perhaps the takeover of superhuman AI would happen in ways that we cannot fathom. But if not, it seems far more likely that we end up disempowered rather than eliminated in the blink of an eye.
There are a number of ways that the US seems to have better values than the CCP, by my lights, but it seems incredibly strange to claim the US values being egalitarian, and social equality or harmony more.
Rule of law, fostering diversity, encouraging human excellence? Sure, there you would have an argument. But egalitarian?
Very interesting work. One question I’ve had about this is whether humans can do such planning ‘natively’, i.e. in our heads, or if we’re using tools in ways that are essentially the same as doing “model-based planning inefficiently, with… bottleneck being a potential need to encode intermediate states.”
They edited the text. It was an exact quote from the earlier text.