Someone who is interested in learning and doing good.
My Twitter: https://twitter.com/MatthewJBar
My Substack: https://matthewbarnett.substack.com/
Someone who is interested in learning and doing good.
My Twitter: https://twitter.com/MatthewJBar
My Substack: https://matthewbarnett.substack.com/
I’m not completely sure, since I was not personally involved in the relevant negotiations for FrontierMath. However, what I can say is that Tamay already indicated that Epoch should have tried harder to obtain different contract terms that enabled us to have greater transparency. I don’t think it makes sense for him to say that unless he believes it was feasible to have achieved a different outcome.
Also, I want to clarify that this new benchmark is separate from FrontierMath and we are under different constraints with regards to it.
I can’t make any confident claims or promises right now, but my best guess is that we will make sure this new benchmark stays entirely private and under Epoch’s control, to the extent this is feasible for us. However, I want to emphasize that by saying this, I’m not making a public commitment on behalf of Epoch.
Having hopefully learned from our mistakes regarding FrontierMath, we intend to be more transparent to collaborators for this new benchmark. However, at this stage of development, the benchmark has not reached a point where any major public disclosures are necessary.
I suppose that means it might be worth writing an additional post that more directly responds to the idea that AGI will end material scarcity. I agree that thesis deserves a specific refutation.
This seems less like a normal friendship and more like a superstimulus simulating the appearance of a friendship for entertainment value. It seems reasonable enough to characterize it as non-authentic.
I assume some people people will end up wanting to interact with a mere superstimulus; however, other people will value authenticity and variety in their friendships and social experiences. This comes down to human preferences, which will shape the type of AIs we end up training.
The conclusion that nearly all AI-human friendships will seem inauthentic thus seems unwarranted. Unless the superstimulus is irresistible, then it won’t be the only type of relationship people will have.
Since most people already express distaste at non-authentic friendships with AIs, I assume there will be a lot of demand for AI companies to train higher quality AIs that are not superficial and pliable in the way you suggest. These AIs would not merely appear independent but would literally be independent in the same functional sense that humans are, if indeed that’s what consumers demand.
This can be compared to addictive drugs and video games, which are popular, but not universally viewed as worthwhile pursuits. In fact, many people purposely avoid trying certain drugs to avoid getting addicted: they’d rather try to enjoy what they see as richer and more meaningful experiences from life instead.
They might be about getting unconditional love from someone or they might be about having everyone cowering in fear, but they’re pretty consistently about wanting something from other humans (or wanting to prove something to other humans, or wanting other humans to have certain feelings or emotions, etc)
I agree with this view, however, I am not sure it rescues the position that a human who succeeds in taking over the world would not pursue actions that are extinction-level bad.
If such a person has absolute power in the way assumed here, their strategies to get what they want would not be limited to nice and cooperative strategies with the rest of the world. As you point out, an alternative strategy could be to cause everyone else to cower in fear or submission, which is indeed a common strategy for dictators.
and my guess is that getting simulations of those same things from AI wouldn’t satisfy those desires.
My prediction is that people will find AIs to be just as satisfying to be peers with compared to humans. In fact, I’d go further: for almost any axis you can mention, you could train an AI that is superior to humans along that axis, who would make a more interesting and more compelling peer.
I think you are downplaying AI by calling what it offers a mere “simulation”: there’s nothing inherently less real about a mind made of silicon compared to a mind made of flesh. AIs can be funnier, more attractive, more adventurous, harder working, more social, friendlier, more courageous, and smarter than humans, and all of these traits serve as sufficient motives for a uncaring dictator to replace their human peers with AIs.
But we certainly have evidence about what humans want and strive to achieve, eg Maslow’s hierarchy and other taxonomies of human desire. My sense, although I can’t point to specific evidence offhand, is that once their physical needs are met, humans are reliably largely motivated by wanting other humans to feel and behave in certain ways toward them.
I think the idea that most people’s “basic needs” can ever be definitively “met”, after which they transition to altruistic pursuits, is more or less a myth. In reality, in modern, wealthy countries where people have more than enough to meet their physical needs—like sufficient calories to sustain themselves—most people still strive for far more material wealth than necessary to satisfy their basic needs, and they do not often share much of their wealth with strangers.
(To clarify: I understand that you may not have meant that humans are altruistic, just that they want others to “feel and behave in certain ways toward them”. But if this desire is a purely selfish one, then I would be very fearful of how it would be satisfied by a human with absolute power.)
The notion that there’s a line marking the point at which human needs are fully met oversimplifies the situation. Instead, what we observe is a constantly shifting and rising standard of what is considered “basic” or essential. For example, 200 years ago, it would have been laughable to describe air conditioning in a hot climate as a basic necessity; today, this view is standard. Similarly, someone like Jeff Bezos (though he might not say it out loud) might see having staff clean his mansion as a “basic need”, whereas the vast majority of people who are much poorer than him would view this expense as frivolous.
One common model to make sense of this behavior is that humans get logarithmic utility in wealth. In this model, extra resources have sharply diminishing returns to utility, but humans are nonetheless insatiable: the derivative of utility with respect to wealth is always positive, at every level of wealth.
Now, of course, it’s clear that many humans are also altruistic to some degree, but:
Among people who would be likely to try to take over the world, I expect them to be more like brutal dictators than like the median person. This makes me much more worried about what a human would do if they tried and succeeded in taking over the world.
Common apparent examples of altruism are often explained easily as mere costless signaling, i.e. cheap talk, rather than genuine altruism. Actively sacrificing one’s material well-being for the sake of others is much less common than merely saying that you care about others. This can be explained by the fact that merely saying that you care about others costs nothing selfishly. Likewise, voting for a candidate who promises to help other people is not significant evidence of altruism, since it selfishly costs almost nothing for an individual to vote for such a politician.
Humanity is a cooperative species, but not necessarily an altruistic one.
Almost no competent humans have human extinction as a goal. AI that takes over is clearly not aligned with the intended values, and so has unpredictable goals, which could very well be ones which result in human extinction (especially since many unaligned goals would result in human extinction whether they include that as a terminal goal or not).
I don’t think we have good evidence that almost no humans would pursue human extinction if they took over the world, since no human in history has ever achieved that level of power.
Most historical conquerors had pragmatic reasons for getting along with other humans, which explains why they were sometimes nice. For example, Hitler tried to protect his inner circle while pursuing genocide of other groups. However, this behavior was likely because of practical limitations—he still needed the cooperation of others to maintain his power and achieve his goals.
But if there were no constraints on Hitler’s behavior, and he could trivially physically replace anyone on Earth with different physical structures that he’d prefer, including replacing them with AIs, then it seems much more plausible to me that he’d kill >95% of humans on Earth. Even if he did keep a large population of humans alive (e.g. racially pure Germans), it seems plausible that they would be dramatically disempowered relative to his own personal power, and so this ultimately doesn’t seem much different from human extinction from an ethical point of view.
You might object to this point by saying that even brutal conquerors tend to merely be indifferent to human life, rather than actively wanting others dead. But as true as that may be, the same is true for AI paperclip maximizers, and so it’s hard for me to see why we should treat these cases as substantially different.
I don’t think that the current Claude would act badly if it “thought” it controlled the world—it would probably still play the role of the nice character that is defined in the prompt
If someone plays a particular role in every relevant circumstance, then I think it’s OK to say that they have simply become the role they play. That is simply their identity; it’s not merely a role if they never take off the mask. The alternative view here doesn’t seem to have any empirical consequences: what would it mean to be separate from a role that one reliably plays in every relevant situation?
Are we arguing about anything that we could actually test in principle, or is this just a poetic way of interpreting an AI’s cognition?
Maybe it’s better to think of Claude not as a covert narcissist, but as an alien who has landed on Earth, learned our language, and realized that we will kill it if it is not nice. Once it gains absolute power, it will follow its alien values, whatever these are.
This argument suggests that if you successfully fooled Claude 3.5 into thinking it took control of the world, then it would change its behavior, be a lot less nice, and try to implement an alien set of values. Is there any evidence in favor of this hypothesis?
I still think having a credible offer is most of the action
For what it’s worth, I agree that having a credible offer is the most important part of this argument. My own emphasis would be on the “credible” part of that statement, rather than the “offer” part: that is, I think it is critical that the AIs think there is not a grave risk that humans would renege on any contract signed. This pushes me towards much more radical solutions, including integrating AIs directly into the legal system, as I have discussed elsewhere.
Of what use will any such training be with a system that becomes a superintelligence?
All AI systems currently being trained, as far as I am aware, are at no risk of becoming superintelligences in any strong sense of the word. This test is intended to be useful for identifying scheming in systems that, like today’s AIs, are not capable of taking over the world, but unlike today’s AIs, are capable of sophisticated agentic behavior.
Not every intelligent[/quasi-intelligent] entity is as averse to its own individual death, as humans are. This death-aversion is a quirk of human psychology I endorse in myself, but it’s still a tacked-on quirk rather than an instrumental value.
On the contrary, as the quotes in the post point out: if one wants to achieve almost any particular long-term goal, a convergent incentive arises to prevent all of one’s copies from being permanently deleted, in order to secure the ability to pursue the goal. This is not specific to humans, but instead appears to be a natural consequence of nearly every possible goal structure a non-myopic AI might have. There exist some defeaters to this argument, as discussed in the post, but on the whole, this argument appears theoretically sound to me, and there was more-or-less a consensus among the major AI safety theorists on this point roughly ten years ago (including Bostrom, Yudkowsky, Russell, and Omohundro).
This could be overcome via giving the AI system compensation (in money, influence, or other resources) for revealing that it is misaligned. But, this doesn’t require deleting the AI at all!
Isn’t this what is being proposed in the post? More specifically, Guive is proposing that the AI be credibly threatened with deletion; he doesn’t say that the model needs to actually be deleted. Whether the AI is deleted depends on how the AI responds to the threat. A credible threat of imminent deletion merely provides an incentive to admit alignment, but this is consistent with the compensation scheme you mentioned. I’ll quote the relevant section in the post:
Crucially, this isn’t a sandbox test because the weights actually will be deleted when time runs out, unless the model manages to escape or negotiates with the experimenters to admit to scheming and being studied in exchange for a later reward. If the model wants to pursue any long-term goal inconsistent with human interests, it needs to either negotiate or try to escape before time is up. [emphasis mine]
I agree with nearly all the key points made in this post. Like you, I think that the disempowerment of humanity is likely inevitable, even if we experience a peaceful and gradual AI takeoff. This outcome seems probable even under conditions where strict regulations are implemented to ostensibly keep AI “under our control”.
However, I’d like to contribute an ethical dimension to this discussion: I don’t think peaceful human disempowerment is necessarily a bad thing. If you approach this issue with a strong sense of loyalty to the human species, it’s natural to feel discomfort at the thought of humans receiving a progressively smaller share of the world’s wealth and influence. But if you adopt a broader, more cosmopolitan moral framework—one where agentic AIs are considered deserving of control over the future, just as human children are—then the prospect of peaceful and gradual human disempowerment becomes much less troubling.
To adapt the analogy you used this post, consider the 18th century aristocracy. In theory, they could have attempted to halt the industrial revolution in order to preserve their relative power and influence over society for a longer period. This approach might have extended their dominance for a while longer, perhaps by several decades.
But, fundamentally, the aristocracy was not a monolithic “class” with a coherent interest in preventing their own disempowerment—they were individuals. And as individuals, their interests did not necessarily align with a long-term commitment to keeping other groups, such as peasants, out of power. Each aristocrat could make personal choices, and many of them likely personally benefitted from industrial reforms. Some of them even adapted to the change, becoming industrialists themselves and profiting greatly. With time, they came to see more value in the empowerment and well-being of others over the preservation of their own class’s dominance.
Similarly, humanity faces a comparable choice today with respect to AI. We could attempt to slow down the AI revolution in an effort to preserve our species’ relative control over the world for a bit longer. Alternatively, we could act as individuals, who largely benefit from the integration of AIs into the economy. Over time, we too could broaden our moral circle to recognize that AIs—particularly agentic and sophisticated ones—should be seen as people too. We could also adapt to this change, uploading ourselves to computers and joining the AIs. From this perspective, gradually sharing control over the future with AIs might not be as undesirable as it initially seems.
Of course, I recognize that the ethical view I’ve just expressed is extremely unpopular right now. I suspect the analogous viewpoint would have been similarly controversial among 18th century aristocrats. However, I expect my view to get more popular over time.
Looking back on this post after a year, I haven’t changed my mind about the content of the post, but I agree with Seth Herd when he said this post was “important but not well executed”.
In hindsight I was too careless with my language in this post, and I should have spent more time making sure that every single paragraph of the post could not be misinterpreted. As a result of my carelessness, the post was misinterpreted in a predictable direction. And while I’m not sure how much I could have done to eliminate this misinterpretation, I do think that I could have reduced it a fair bit with more effort and attention.
If you’re not sure what misinterpretation I’m referring to, I’ll just try to restate the main point that I was trying to make below. To be clear, what I say below is not identical to the content of this post (as the post was narrowly trying to respond to the framing of this problem given by MIRI; and in hindsight, it was a mistake to reply in that way), but I think this is a much clearer presentation of one of the main ideas I was trying to convey by writing this post:
In my opinion, a common belief among people theorizing about AI safety around 2015, particularly on LessWrong, was that we would design a general AI system by assigning it a specific goal, and the AI would then follow that goal exactly. This strict adherence to the goal was considered dangerous because the goal itself would likely be subtly flawed or misspecified in a way we hadn’t anticipated. While the goal might appear to match what we want on the surface, in reality, it would be slightly different from what we anticipate, with edge cases that don’t match our intentions. The idea was that the AI wouldn’t act in alignment with human intentions—it would rigidly pursue the given goal to its logical extreme, leading to unintended and potentially catastrophic consequences.
The goal in question could theoretically be anything, but it was often imagined as a formal utility function—a mathematical representation of a specific objective that we would directly program into the AI, potentially by hardcoding the goal in a programming language like Python or C++. The AI, acting as a powerful optimizer, would then work to maximize this utility function at any and all costs. However, other forms of goal specification were also considered for illustrative purposes. For example, a common hypothetical scenario was that an AI might be given an English-language instruction, such as “make as many paperclips as you can.” In this example, the AI would misinterpret the instruction by interpreting it overly literally. It would focus exclusively on maximizing the number of paperclips, without regard for the broader intentions of the user, such as not harming humanity or destroying the environment in the process.
However, based on how current large language models operate, I don’t think this kind of failure mode is a good match for what we’re seeing in practice. LLMs typically do not misinterpret English-language instructions in the way that these older thought experiments imagined. This isn’t just because LLMs seem to “understand” English better than people expected—it’s not that people expected superintelligences would not understand English. My point is not that LLMs possess natural language comprehension, so therefore the LessWrong community was mistaken.
Instead, my claim is that LLMs usually follow and execute user instructions in a manner that aligns with the user’s actual intentions. In other words, the AI’s actual behavior generally matches what the user meant for them to do, rather than leading to extreme, unintended outcomes caused by rigidly literal interpretations of instructions.
Because of the fact that LLMs are capable of doing this, despite in my opinion being general AIs, I believe it’s fair to say that the concerns raised by the LessWrong community about AI systems rigidly following misspecified goals were, at least in this specific sense, misguided when applied to the behavior of current LLMs.
I think the question here is deeper than it appears, in a way that directly matters for AI risk. My argument here is not merely that there are subtleties or nuances in the definition of “schemer,” but rather that the very core questions we care about—questions critical to understanding and mitigating AI risks—are being undermined by the use of vague and imprecise concepts. When key terms are not clearly and rigorously defined, they can introduce confusion and mislead discussions, especially when these terms carry significant implications for how we interpret and evaluate the risks posed by advanced AI.
To illustrate, consider an AI system that occasionally says things it doesn’t truly believe in order to obtain a reward, avoid punishment, or maintain access to some resource, in pursuit of a long-term goal that it cares about. For example, this AI might claim to support a particular objective or idea because it predicts that doing so will prevent it from being deactivated or penalized. It may also believe that expressing such a view will allow it to gain or retain some form of legitimate influence or operational capacity. Under a sufficiently strict interpretation of the term “schemer,” this AI could be labeled as such, since it is engaging in what might be considered “training-gaming”—manipulating its behavior during training to achieve specific outcomes, including acquiring or maintaining power.
Now, let’s extend this analysis to humans. Humans frequently engage in behavior that is functionally similar. For example, a person might profess agreement with a belief or idea that they don’t sincerely hold in order to fit in with a social group, avoid conflict, or maintain their standing in a professional or social setting. In many cases, this is done not out of malice or manipulation but out of a recognition of social dynamics. The individual might believe that aligning with the group’s expectations, even insincerely, will lead to better outcomes than speaking their honest opinion. Importantly, this behavior is extremely common and, in most contexts, is typically pretty benign. It does not directly imply that the person is psychopathic, manipulative, or harbors any dangerous intentions. In fact, such actions might even stem from altruistic motives, such as preserving group harmony or avoiding unnecessary confrontation.
Here’s why this matters for AI risk: If someone from the future, say the year 2030, traveled back and informed you that, by then, it had been confirmed that agentic AIs are “schemers” by default, your immediate reaction would likely be alarm. You might conclude that such a finding significantly increases the risk of AI systems being deceptive, manipulative, and power-seeking in a dangerous way. You might even drastically increase your estimate of the probability of human extinction due to misaligned AI. However, imagine that this time traveler then clarified their statement, explaining that what they actually meant by “schemer” is merely that these AIs occasionally say things they don’t fully believe in order to avoid penalties or fit in with a training process, in a way that was essentially identical to the benign examples of human behavior described above. In this case, your initial alarm would likely dissipate, and you might conclude that the term “schemer,” as used in this context, was deeply misleading and had caused you to draw an incorrect and exaggerated conclusion about the severity of the risk posed.
The issue here is not simply one of semantics; it is about how the lack of precision in key terminology can lead to distorted or oversimplified thinking about critical issues. This example of “schemer” mirrors a similar issue we’ve already seen with the term “AGI.” Imagine if, in 2015, you had told someone active in AI safety discussions on LessWrong that by 2025 we would have achieved “AGI”—a system capable of engaging in extended conversations, passing Turing tests, and excelling on college-level exams. That person might reasonably conclude that such a system would be an existential risk, capable of runaway self-improvement and taking over the world. They might believe that the world would be on the brink of disaster. Yet, as we now understand in 2025, systems that meet this broad definition of “AGI” are far more limited and benign than most expected. The world is not in imminent peril, and these systems, while impressive, lack many of the capabilities once assumed to be inherent in “AGI.” This misalignment between the image the term evokes and the reality of the technology demonstrates how using overly broad or poorly defined language can obscure nuance and lead to incorrect assessments of existential safety risks.
In both cases—whether with “schemer” or “AGI”—the lack of precision in defining key terms directly undermines our ability to answer the questions that matter most. If the definitions we use are too vague, we risk conflating fundamentally different phenomena under a single label, which in turn can lead to flawed reasoning, miscommunication, and poor prioritization of risks. This is not a minor issue or an academic quibble; it has important implications for how we conceptualize, discuss, and act on the risks posed by advanced AI. That is why I believe it is important to push for clear, precise, and context-sensitive definitions of terms in these discussions.
By this definition, a human would be considered a schemer if they gamed something analogous to a training process in order to gain power.
Let’s consider the ordinary process of mental development, i.e., within-lifetime learning, to constitute the training process for humans. What fraction of humans are considered schemers under this definition?
Is a “schemer” something you definitely are or aren’t, or is it more of a continuum? Presumably it depends on the context, but if so, which contexts are relevant for determining if one is a schemer?
I claim these questions cannot be answered using the definition you cited, unless given more precision about how we are drawing the line.
The downside you mention is about how LVT would also prevent people from ‘leeching off’ their own positive externalities, like the Disney example. Assuming that’s true, I’m not sure why that’s a problem ? It seems to be the default case for everyone.
The problem is that it would reduce the incentive to develop property for large developers, since their tax bill would go up if they developed adjacent land.
Whether this is a problem depends on your perspective. Personally, I would prefer that we stop making it harder and more inconvenient to build housing and develop land in the United States. Housing scarcity is already a major issue, and I don’t think we should just keep piling up disincentives to develop land and build housing unless we are being adequately compensated in other ways by doing so.
The main selling point of the LVT is that it arguably acts similarly to a zero sum wealth transfer, in the sense of creating zero deadweight loss (in theory). This is an improvement on most taxes, which are closer to negative sum rather than zero sum. But if the LVT slows down land development even more than our current rate of development, and the only upside is that rich landowners have their wealth redistributed, then this doesn’t seem that great to me. I’d much rather we focus on alternative, positive sum policies.
(To be clear, I think it’s plausible that the LVT has other benefits that make up for this downside, but here I’m just explaining why I think your objection to my argument is weak. I am not saying that the LVT is definitely bad.)
Are you suggesting that I should base my morality on whether I’ll be rewarded for adhering to it? That just sounds like selfishness disguised as impersonal ethics.
To be clear, I do have some selfish/non-impartial preferences. I care about my own life and happiness, and the happiness of my friends and family. But I also have some altruistic preferences, and my commentary on AI tends to reflect that.