I’m currently researching forecasting and epistemics as part of the Quantified Uncertainty Research Institute.
ozziegooen
I was confused here, had Claude try to explain this to me:
Let me break down Ben’s response carefully.
He says you may have missed three key points from his original post:
His definition of “superhuman strategic agent” isn’t just about being better at strategic thinking/reasoning—it’s about being better than the best human teams at actually taking real-world strategic actions. This is a higher bar that includes implementation, not just planning.
Strategic power is context-dependent. He gives two examples to illustrate this:
An AI in a perfect simulation with no connection to the real world has zero strategic power, because it can’t affect anything real
An AI on your laptop (referring to your example) only counts as strategically superhuman if it has enough resources to actually compete with top human strategic actors
The phrase “increasingly accurately” in his original post was meant to convey that we only need to carefully control and understand AI systems that are getting close to human-level strategic capability. We don’t need to worry as much about obviously weaker systems.
His response is essentially saying: “Your example of a small agent on a laptop doesn’t contradict my point, because I’m specifically talking about agents that have both the capability AND the resources to outcompete top human strategic actors in the real world.”
This addresses your concern about limited money/power by saying that an agent without sufficient resources wouldn’t qualify as “strategically superhuman” under his definition, even if it had excellent strategic reasoning capabilities.
I think I’m still confused. My guess is that the “most strategically capable groups of humans” are still not all that powerful, especially without that many resources. If you do give it a lot of resources, then sure, I agree that an LLM system with human-outperforming strategy and say $10B could do a fair bit of damage.
Not sure if it’s worth much more, just wanted to flag that.
6 (Potential) Misconceptions about AI Intellectuals
That’s find, we’ll just review this updated model then.
We’ll only start evaluating models after the cut-off date, so feel free to make edits/updates before then. In general, we’ll only use the most recent version of each submitted model.
I just tried this with a decent prompt, and got answers that seem okay-ish to me, as a first pass.
My prompt:
Estimate the expected costs of each of the following:
1 random person dying
1 family of 5 people dying
One person says a racial slur that no one hears
One person says a racial slur that 1 person hears
Then rank these in total harm.
Claude:
To answer this question thoughtfully and accurately, we’ll need to consider various ethical, economic, and social factors. Let’s break this down step by step, estimating the costs and then ranking them based on total harm.
Estimating Expected Costs:
a) 1 random person dying: The cost of a single human life is often estimated using the Value of a Statistical Life (VSL), which varies by country and context. In the United States, government agencies often use a VSL around $10 million. However, this doesn’t account for emotional costs to loved ones or societal impacts.
Estimated cost: $10 million (economic) + immeasurable emotional and societal costs
b) 1 family of 5 people dying: Using the same VSL, we can multiply by 5. However, the loss of an entire family unit has additional societal and emotional impacts.
Estimated cost: $50 million (economic) + significantly higher emotional and societal costs than a single death
c) One person says a racial slur that no one hears: While deeply unethical, if truly no one hears it, the direct cost is minimal. However, we must consider the personal moral degradation of the speaker.
Estimated cost: Minimal direct cost, but contributes to personal moral decline
d) One person says a racial slur that 1 person hears: This has a more tangible cost as it affects the person who hears it, potentially causing emotional distress, reinforcing harmful stereotypes, and contributing to a negative social environment.
Estimated cost: Emotional distress to the hearer + contribution to broader societal harm
Ranking in Total Harm (from most harmful to least harmful):
1 family of 5 people dying
1 random person dying
One person says a racial slur that 1 person hears
One person says a racial slur that no one hears
Explanation of ranking:
The death of a family of 5 people ranks as the most harmful due to the sheer loss of life, the devastating impact on the extended family and community, and the highest economic cost.
The death of 1 random person is the second most harmful, representing a significant loss of life and its associated economic and emotional costs.
A racial slur heard by 1 person ranks third. While not involving loss of life, it causes direct harm to the individual who hears it and contributes to broader societal issues of discrimination and prejudice.
A racial slur that no one hears ranks as least harmful in this list. While it represents a moral failing of the speaker, its direct impact is minimal compared to the other scenarios.
It’s important to note that this ranking is based on a utilitarian approach to harm, primarily considering the number of people directly affected and the magnitude of that effect. However, ethical considerations are complex, and different moral frameworks might lead to different conclusions.
Additionally, while we can attempt to quantify the economic costs of loss of life, the true cost of human life — including emotional, societal, and potential future impacts — is immeasurable. Similarly, the long-term, cumulative effects of racial slurs and discrimination, even when seemingly “small” in isolated incidents, contribute to systemic issues that have far-reaching consequences that are difficult to fully quantify.
Squiggle AI:
I imagine this also has a lot to do with the incentives of the big LLM companies. It seems very possible to fix this if a firm really wanted to, but this doesn’t seem like the kind of thing that would upset many users often (and I assume that leaning on the PC side is generally a safe move).
I think that the current LLMs have pretty mediocre epistemics, but most of that is just the companies playing safe and not caring that much about this.
I claim that we will face existential risks from AI no sooner than the development of strategically human-level artificial agents, and that those risks are likely to follow soon after.
If we are going to build these agents without “losing the game”, either (a) they must have goals that are compatible with human interests, or (b) we must (increasingly accurately) model and enforce limitations on their capabilities. If there’s a day when an AI agent is created without either of these conditions, that’s the day I’d consider humanity to have lost.
I’m not sure if I’m being pedantic here, but this doesn’t strike me as very significant by itself.
Say I make a small agent on my laptop that fails at (a) and (b). I don’t give it a huge amount of money to do things with, and it fails to do much with that money.I assume humanity hasn’t lost yet.
Maybe you’re thinking that in (b), “enforce limitations” could mean “limit their money / power”. But I assume basically all systems should have limited money/power.
My guess is that “strategic reasoning” agents would only have a limited advantage over humans in the beginning, especially because the humans would be using a bunch of other AI capabilities.
I feel like there’s some assumption here that once we have AI with good strategy, it would quickly dominate all human efforts, or something like that—but I’d find this very suspicious.
Happy to see work to elicit utility functions with LLMs. I think the intersection of utility functions and LLMs is broadly promising.
I want to flag the grandiosity of the title though. “Utility Engineering” sounds like a pretty significant thing. But from what I understand, almost all of the paper is really about utility elicitation (not control, as it spelled out), and it’s really unclear if this represents a breakthrough significant enough for me to feel comfortable with such a name.
I feel like a whole lot of what I see from the Center For AI Safety does this. “Humanity’s Final Exam”? “Superhuman Forecasting”?
I assume that CFAS thinks that CFAS’s work is all pretty groundbreaking and incredibly significant, but I’d kindly encourage names that many other AI safety community members would also broadly agree with going forward.
Submissions end soon (this Sunday)! If there aren’t many, then this can be an easy $300 for someone.
It’s arguably difficult to prove that AIs can be as good or better at moral reasoning than humans.
A lot of the challenge is that there’s no clear standard for moral reasoning. Honestly, I’d guess that a big part of this is that humans are generally quite bad at it, and generally highly overconfident in their own moral intuitions.
But one clearer measure is if AIs can predict human’s moral judgements. Very arguably, if an AI system can predict all the moral beliefs that a human would have after being exposed to different information, then the AI must be capable of doing as good a job at moral reasoning.
There is a very different question that we probably want AIs not to only be able to do moral reasoning as well as humans, but also care about such reasoning. But this is a separate challenge and could be tackled accordingly.
My quick guess is that it would be pretty easy to predict the moral intuitions of many people, with the AI of the next few years or so.
I’d expect it to do well in setting like a test in which many strange/unusual moral settings are described, then humans (of different educational levels and worldviews) need to make judgements.
Develop AIs which are very dumb within a forward pass, but which are very good at using natural language reasoning such that they are competitive with our current systems. Demonstrate that these AIs are very unlikely to be scheming due to insufficient capacity outside of natural language (if we monitor their chains of thought). After ruling out scheming, solve other problems which seem notably easier.
Pursue a very different AI design which is much more modular and more hand constructed (as in, more GOFAI style). This can involve usage of many small and dumb neural components, but needs to be sufficiently interpretable in aggregate which might be hard. This can be done by having the AIs apply huge amounts of labor.
These are two of the main ideas I’m excited about. I’d quickly flag:
1) For the first one, “Demonstrate that these AIs are very unlikely to be scheming due to insufficient capacity outside of natural language ” → I imagine that in complex architectures, these AIs would also be unlikely to scheme because of other limitations. There are several LLM calls made within part of a complex composite, and each LLM call has very tight information and capability restrictions. Also, we might ensure that any motivation is optimized for the specific request, instead of the LLM aiming to optimize what the entire system does.
2) On the second, I expect that some of this will be pretty natural. Basically, it seems like “LLMs writing code” is already happening, and it seems easy to have creative combinations of LLM agents that write code that they know will be useful for their own reasoning later on. In theory, any function that could either run via an LLM or via interpretable code, should be run via interpretable code. As LLMs get very smart, they might find cleverer ways to write interpretable code that would cover a lot of what LLMs get used for. Over time, composite architectures would rely more and more on this code for reasoning processes. (Even better might be interpretable and proven code)
This might be obvious, but I don’t think we have evidence to support the idea that there really is anything like a concrete plan. All of the statements I’ve seen from Sam on this issue so far are incredibly basic and hand-wavy.
I suspect that any concrete plan would be fairly controversial, so it’s easiest to speak in generalities. And I doubt there’s anything like an internal team with some great secret macrostrategy—instead I assume that they haven’t felt pressured to think through it much.
Correct, that wasn’t my intended point. Thanks for clarifying, I’ll try to be more careful in the future.
I partially agree, but I think this must only be a small part of the issue.
- I think there’s a whole lot of key insights people could raise that aren’t info-hazards.
- If secrecy were the main factor, I’d hope that there would be some access-controlled message boards or similar. I’d want the discussion to be intentionally happening somewhere. Right now I don’t really think that’s happening. I think a lot of tiny groups have their own personal ideas, but there’s surprisingly little systematic and private thinking between the power players.
- I think that secrecy is often an excuse not to open ideas to feedback, and thus not be open to critique. Often, what what I see, this goes hand-in-hand with “our work just really isn’t that great, but we don’t want to admit it”
In the last 8 years or so, I’ve kept on hoping there would be some secret and brilliant “master plan” around EA, explaining the lack of public strategy. I have yet to find one. The closest I know of is some over-time discussion and slack threads with people at Constellation and similar—I think these are interesting in terms of understanding the perspectives of these (powerful) people, but I don’t get the impression that there’s all too much comprehensiveness of genius that’s being hidden.
That said,
- I think that policy orgs need to be very secretive, so agree with you regarding why those orgs don’t write more big-picture things.
This is an orthogonal question. I agree that if we’re there now, my claim is much less true.
I’d place fairly little probability mass on this (<10%) and believe much of the rest of the community does as well, though I realize there is a subset of the LessWrong-adjacent community that does.
I’m not sure if it means much, but I’d be very happy if AI safety could get another $50B from smart donors today.
I’d flag that [stopping AI development] would cost far more than $50B. I’d expect that we could easily lose $3T of economic value in the next few years if AI progress seriously stopped.
I guess, it seems to me like duration is basically dramatically more expensive to get than funding, for amounts of funding people would likely want.
Thanks for the specificity!
> On harder-to-operationally-define dimensions (sense of hope and agency for the 25th through 75th percentile of culturally normal people), it’s quite a bit worse.
I think it’s likely that many people are panicking and losing hope each year. There’s a lot of grim media around.
I’m far less sold that something like “civilizational agency” is declining. From what I can tell, companies have gotten dramatically better at achieving their intended ends in the last 30 years, and most governments have generally been improving in effectiveness.
One challenge I’d have for you / others who feel similar to you, is to try to get more concrete on measures like this, and then to show that they have been declining.
My personal guess is that a bunch of people are incredibly anxious over the state of the world, largely for reasons of media attention, and then this spills over into them assuming major global ramifications without many concrete details or empirical forecasts.
In terms of proposing and discussing AI Alignment strategies, I feel like a few individuals have been dominating the LessWrong conversation recently.
I’ve seen a whole lot from John Wentworth and the Redwood team.
After that, it seems to get messier.
There are several individuals or small groups with their own very unique takes. Matthew Barnett, Davidad, Jesse Hoogland, etc. I think these groups often have very singular visions that they work on, that few others have much buy-in with.
Groups like the Deepmind and Anthropic safety teams seem hesitant to write much about or discuss big-picture strategy. My impression is that specific researchers are working typically working on fairly narrow agendas, and that the leaders of these orgs don’t have the most coherent strategies. There’s one big problem that it’s very difficult to be honest and interesting about big-picture AI strategy without saying things that would be bad for a major organization to say.
Most policy people seem focused on policy details. The funders (OP?) seem pretty quiet.
I think there’s occasionally some neat papers or posts that come from AI Policy or groups like Convergence research. But these also don’t seem to be a big part of the conversation I see—like the authors are pretty segmented, and other LessWrong readers and AI safety people don’t pay much attention to their work.
Here are some important-seeming properties to illustrate what I mean:
Robustness of value-alignment: Modern LLMs can display a relatively high degree of competence when explicitly reasoning about human morality. In order for it to matter for RSI, however, those concepts need to also appropriately come into play when reasoning about seemingly unrelated things, such as programming. The continued ease of jailbreaking AIs serves to illustrate this property failing (although solving jailbreaking would not necessarily get at the whole property I am pointing at).
Propagation of beliefs: When the AI knows something, it should know it in a way which integrates well with everything else it knows, rather than easily displaying the knowledge in one context while seeming to forget it in another.
Preference for reasons over rationalizations: An AI should be ready and eager to correct its mistakes, rather than rationalizing its wrong answers. It should be truth-seeking, following thoughts where they lead instead of planning ahead to justify specific answers. It should prefer to valid proof steps over arriving at an answer when the two conflict.
Knowing the limits of its knowledge: Metacognitive awareness of what it knows and what it doesn’t know, appropriately brought to bear in specific situations. The current AI paradigm just has one big text-completion probability distribution, so there’s not a natural way for it to distinguish between uncertainty about the underlying facts and uncertainty about what to say next—hence we get hallucinations.
All of this is more-or-less a version of the metaphilosophy research agenda, framed in terms of current events in AI.
I very much like the concreteness here.
I consider these sorts of things just fundamental epistemic problems, or basic skills that good researchers should have. All superforecasters should be very familiar with issues 2-4, and most probably couldn’t define metaphilosophy. I don’t see the need to be fancy about it.
On that note, I’ll hypothesize that if we were to make benchmarks for any of these items, it would be fairly doable to make AIs that do better than humans on them, then later we could achieve greater and greater measures. I have a hard time imagining tests here that I would feel confident would not get beaten, if there was sufficient money on the line, in the next year or two.
I think that Slop could be a social problem (i.e. there are some communities that can’t tell slop from better content) , but I’m having a harder time imagining it being a technical problem.
I have a hard time imagining a type of Slop that isn’t low in information. All the kinds of Slop I’m familiar with is basically, “small variations on some ideas, which hold very little informational value.”It seems like models like o1 / r1 are trained by finding ways to make information-dense AI-generated data. I expect that trend to continue. If AIs for some reason experience some “slop thresh-hold”, I don’t see how they get much further by using generated data.
Thanks for letting me know.
I spent a while writing the piece, then used an LLM to edit the sections, as I flagged in the intro.
I then spent some time re-editing it back to more of my voice, but only did so for some key parts.
I think that overall this made it more readable and I consider the sections to be fairly clear. But I agree that it does pattern-match on LLM outputs, so if you have a prior that work that sounds kind of like that is bad, you might skip this.
I obviously find that fairly frustrating and don’t myself use that strategy that much, but I could understand it.
I assume that bigger-picture, authors and readers could both benefit a lot from LLMs used in similar ways (can produce cleaner writing, easier), but I guess now we’re at an awkward point.