Noosphere89

Karma: 3,625

Noosphere89 18 Jun 2025 1:38 UTC
2 points
0
in reply to: Thane Ruthenis’s comment on: Acausal normalcy
My take is that that unfortunately, I don’t expect high-level order/useful abstractions for acausal trade in full generality, and more generally suspect that in the general case, you really do need full-blown simulation that is perfectly accurate to get the benefits of acausal trade with arbitrary partners.

I’m also skeptical of boundaries even in practice working out in any non-trivial sense after we develop AI that can replace all humans at jobs, and I think boundaries IRL come about because there is no party or group that can penetrate people’s boundaries without them raising hell and damaging/destroying you, but this does not apply very much to AI/human interaction, and I’m much more skeptical of boundaries existing at an ontological level than Andrew Critich is (which is my takeaway from the Embedded Agency Sequence).

Another area where I’m skeptical is the claim that human morality is best explained by acausal trade, rather than causal trade/the amount of energy used plus the fact that humans need other humans to survive, meaning you actually need to take into account what other beings prefer.

Noosphere89 16 Jun 2025 17:51 UTC
2 points
0
in reply to: sunwillrise’s comment on: the void
Base models may be Tools or Oracles in nature,^[1] but there is still a ton of economic incentive to turn them into scaffolded agents.

I would even go further, and say that there’s a ton of incentives to move out of the paradigm of primarily LLMs altogether.
A big part of the reason is that the current valuations only make sense if OpenAI et al are just correct that they can replace workers with AI within 5 years.
But currently, there are a couple of very important obstacles to this goal, and the big ones are data efficiency, long-term memory and continual learning.
For data efficiency, one of the things that’s telling is that even in domains where LLMs excel, they require orders of magnitude more data than humans to get good at a task, and one of the reasons why LLMs became as successful as they were in the first place is unfortunately not something we can replicate, which was that the internet was a truly, truly vast amount of data on a whole lot of topics, and while I don’t think the views that LLMs don’t understand anything/simply memorize training data are correct, I do think a non-trivial amount of the reason LLMs became so good is that we did simply widen the distribution through giving LLMs all of the data on the internet.
Synthetic data empirically so far is mostly not working to expand the store of data, and thus by 2028 I expect labs to need to pivot to a more data efficient architecture, and arguably right now for tasks like computer use they will need advances in data efficiency before AIs can get good at computer use.
For long-term memory, one of the issues with current AI is that their only memory so far is the context window, but that doesn’t have to scale, and also means that if it isn’t saved in the context, which most stuff will be, then it’s basically gone, and LLMs cannot figure out how to build upon one success or failure to set itself up for more successes, because it doesn’t remember that success or failure.
For continual learning, I basically agree with Dwarkesh Patel here on why continual learning is so important:
https://www.dwarkesh.com/p/timelines-june-2025
Continual learning
Sometimes people say that even if all AI progress totally stopped, the systems of today would still be far more economically transformative than the internet. I disagree. I think the LLMs of today are magical. But the reason that the Fortune 500 aren’t using them to transform their workflows isn’t because the management is too stodgy. Rather, I think it’s genuinely hard to get normal humanlike labor out of LLMs. And this has to do with some fundamental capabilities these models lack.
I like to think I’m “AI forward” here at the Dwarkesh Podcast. I’ve probably spent over a hundred hours trying to build little LLM tools for my post production setup. And the experience of trying to get them to be useful has extended my timelines. I’ll try to get the LLMs to rewrite autogenerated transcripts for readability the way a human would. Or I’ll try to get them to identify clips from the transcript to tweet out. Sometimes I’ll try to get them to co-write an essay with me, passage by passage. These are simple, self contained, short horizon, language in-language out tasks—the kinds of assignments that should be dead center in the LLMs’ repertoire. And they’re ⁵⁄₁₀ at them. Don’t get me wrong, that’s impressive.
But the fundamental problem is that LLMs don’t get better over time the way a human would. The lack of continual learning is a huge huge problem. The LLM baseline at many tasks might be higher than an average human’s. But there’s no way to give a model high level feedback. You’re stuck with the abilities you get out of the box. You can keep messing around with the system prompt. In practice this just doesn’t produce anything even close to the kind of learning and improvement that human employees experience.
The reason humans are so useful is not mainly their raw intelligence. It’s their ability to build up context, interrogate their own failures, and pick up small improvements and efficiencies as they practice a task.
How do you teach a kid to play a saxophone? You have her try to blow into one, listen to how it sounds, and adjust. Now imagine teaching saxophone this way instead: A student takes one attempt. The moment they make a mistake, you send them away and write detailed instructions about what went wrong. The next student reads your notes and tries to play Charlie Parker cold. When they fail, you refine the instructions for the next student.
This just wouldn’t work. No matter how well honed your prompt is, no kid is just going to learn how to play saxophone from just reading your instructions. But this is the only modality we as users have to ‘teach’ LLMs anything.
Yes, there’s RL fine tuning. But it’s just not a deliberate, adaptive process the way human learning is. My editors have gotten extremely good. And they wouldn’t have gotten that way if we had to build bespoke RL environments for different subtasks involved in their work. They’ve just noticed a lot of small things themselves and thought hard about what resonates with the audience, what kind of content excites me, and how they can improve their day to day workflows.
Now, it’s possible to imagine some way in which a smarter model could build a dedicated RL loop for itself which just feels super organic from the outside. I give some high level feedback, and the model comes up with a bunch of verifiable practice problems to RL on—maybe even a whole environment in which to rehearse the skills it thinks it’s lacking. But this just sounds really hard. And I don’t know how well these techniques will generalize to different kinds of tasks and feedback. Eventually the models will be able to learn on the job in the subtle organic way that humans can. However, it’s just hard for me to see how that could happen within the next few years, given that there’s no obvious way to slot in online, continuous learning into the kinds of models these LLMs are.
LLMs actually do get kinda smart and useful in the middle of a session. For example, sometimes I’ll co-write an essay with an LLM. I’ll give it an outline, and I’ll ask it to draft the essay passage by passage. All its suggestions up till 4 paragraphs in will be bad. So I’ll just rewrite the whole paragraph from scratch and tell it, “Hey, your shit sucked. This is what I wrote instead.” At that point, it can actually start giving good suggestions for the next paragraph. But this whole subtle understanding of my preferences and style is lost by the end of the session.
Maybe the easy solution to this looks like a long rolling context window, like Claude Code has, which compacts the session memory into a summary every 30 minutes. I just think that titrating all this rich tacit experience into a text summary will be brittle in domains outside of software engineering (which is very text-based). Again, think about the example of trying to teach someone how to play the saxophone using a long text summary of your learnings. Even Claude Code will often reverse a hard-earned optimization that we engineered together before I hit /compact—because the explanation for why it was made didn’t make it into the summary.

Noosphere89 12 Jun 2025 19:21 UTC
3 points
0
in reply to: faul_sname’s comment on: METR’s Observations of Reward Hacking in Recent Frontier Models
Not the original commenter, but I’d argue that the gradient descent on something that can also make architectural updates to itself may be possible, though I know nothing much about how gradient descent works, so this might not actually be possible.
But I do think that the ability for the AI to make small, continual architectural updates to itself is actually pretty important, and I’d argue that a lot of the reason AI is used very little so far has to do with the fact that if it cannot 0 or 1-shot a problem, it basically has no ability to learn from it’s failures, due to having 0 neuroplasticity after training, and if we assume that some level of learning from failure IRL is very important (which I agree with), then methods to make continuous learning practical will be incentivized, because all of the leading labs valuation and profit depend on the assumption that they will soon be able to automate away human workers, and continuous learning is a major, major blocker to this goal.
More from Dwarkesh below, including a relevant quote:
https://www.dwarkesh.com/p/timelines-june-2025
Continual learning
Sometimes people say that even if all AI progress totally stopped, the systems of today would still be far more economically transformative than the internet. I disagree. I think the LLMs of today are magical. But the reason that the Fortune 500 aren’t using them to transform their workflows isn’t because the management is too stodgy. Rather, I think it’s genuinely hard to get normal humanlike labor out of LLMs. And this has to do with some fundamental capabilities these models lack.
I like to think I’m “AI forward” here at the Dwarkesh Podcast. I’ve probably spent over a hundred hours trying to build little LLM tools for my post production setup. And the experience of trying to get them to be useful has extended my timelines. I’ll try to get the LLMs to rewrite autogenerated transcripts for readability the way a human would. Or I’ll try to get them to identify clips from the transcript to tweet out. Sometimes I’ll try to get them to co-write an essay with me, passage by passage. These are simple, self contained, short horizon, language in-language out tasks—the kinds of assignments that should be dead center in the LLMs’ repertoire. And they’re ⁵⁄₁₀ at them. Don’t get me wrong, that’s impressive.
But the fundamental problem is that LLMs don’t get better over time the way a human would. The lack of continual learning is a huge huge problem. The LLM baseline at many tasks might be higher than an average human’s. But there’s no way to give a model high level feedback. You’re stuck with the abilities you get out of the box. You can keep messing around with the system prompt. In practice this just doesn’t produce anything even close to the kind of learning and improvement that human employees experience.
The reason humans are so useful is not mainly their raw intelligence. It’s their ability to build up context, interrogate their own failures, and pick up small improvements and efficiencies as they practice a task.
How do you teach a kid to play a saxophone? You have her try to blow into one, listen to how it sounds, and adjust. Now imagine teaching saxophone this way instead: A student takes one attempt. The moment they make a mistake, you send them away and write detailed instructions about what went wrong. The next student reads your notes and tries to play Charlie Parker cold. When they fail, you refine the instructions for the next student.
This just wouldn’t work. No matter how well honed your prompt is, no kid is just going to learn how to play saxophone from just reading your instructions. But this is the only modality we as users have to ‘teach’ LLMs anything.
Yes, there’s RL fine tuning. But it’s just not a deliberate, adaptive process the way human learning is. My editors have gotten extremely good. And they wouldn’t have gotten that way if we had to build bespoke RL environments for different subtasks involved in their work. They’ve just noticed a lot of small things themselves and thought hard about what resonates with the audience, what kind of content excites me, and how they can improve their day to day workflows.
Now, it’s possible to imagine some way in which a smarter model could build a dedicated RL loop for itself which just feels super organic from the outside. I give some high level feedback, and the model comes up with a bunch of verifiable practice problems to RL on—maybe even a whole environment in which to rehearse the skills it thinks it’s lacking. But this just sounds really hard. And I don’t know how well these techniques will generalize to different kinds of tasks and feedback. Eventually the models will be able to learn on the job in the subtle organic way that humans can. However, it’s just hard for me to see how that could happen within the next few years, given that there’s no obvious way to slot in online, continuous learning into the kinds of models these LLMs are.
LLMs actually do get kinda smart and useful in the middle of a session. For example, sometimes I’ll co-write an essay with an LLM. I’ll give it an outline, and I’ll ask it to draft the essay passage by passage. All its suggestions up till 4 paragraphs in will be bad. So I’ll just rewrite the whole paragraph from scratch and tell it, “Hey, your shit sucked. This is what I wrote instead.” At that point, it can actually start giving good suggestions for the next paragraph. But this whole subtle understanding of my preferences and style is lost by the end of the session.
Maybe the easy solution to this looks like a long rolling context window, like Claude Code has, which compacts the session memory into a summary every 30 minutes. I just think that titrating all this rich tacit experience into a text summary will be brittle in domains outside of software engineering (which is very text-based). Again, think about the example of trying to teach someone how to play the saxophone using a long text summary of your learnings. Even Claude Code will often reverse a hard-earned optimization that we engineered together before I hit /compact—because the explanation for why it was made didn’t make it into the summary.

Noosphere89 12 Jun 2025 17:01 UTC
3 points
0
in reply to: deep’s comment on: deep’s Shortform
Some thoughts on this:
One response I notice having to your points is: why the focus on value alignment?
“We could use intent alignment / corrigibility to avoid AIs being problematic due to these factors. But all these issues still remain at higher levels: the human-led organizations in charge of those AIs, the society in which those organizations compete, international relations & great-power competition.”
And conversely: “if we have value alignment, I don’t think there’s a guarantee that we wind up in a basin of convergent human values, so you still have the problem of—whose interests are the AIs being trained & deployed to serve? Who gets oversight or vetos on that?”
(Using quotes bc these feel more like ‘text completions from system 1’ than all-things-considered takes from system 2.)
You’ve correctly noted the issue of why lots of people may not be safe even in a physical sense even assuming value alignment/corrigibility/intent alignment/instruction following is solved, and I do think you are correct that there is no guarantee that we wind up in a basin of convergence, and I’d even argue that it’s unlikely to converge and instead diverge because there is no 1 moral reality, and there are an infinite amount of correct moralities/moral realities, so yeah the oversight problem is pretty severe.
Maybe there’s a crux here around how much we value the following states: AI-led world vs some-humans-led world vs deep-human-value-aligned world.
I have some feeling that AI-risk discourse has historically had a knee-jerk reaction against considering the following claims, all of which seem to me like plausible and important considerations:
It’s pretty likely we end up with AIs that care about at least some of human value, e.g. valuing conscious experience. (at least if AGIs resemble current LLMs, which seem to imprint on humans quite a lot.) AI experiences could themselves be deeply morally valuable, even if the AIs aren’t very human-aligned. (though you might need them to at minimum care about consciousness, so they don’t optimize it away) A some-humans-led world could be at least as bad as an AI-led world, and very plausibly could have negative rather than zero value. I think this is partly down to founder effects where Eliezer either didn’t buy these ideas or didn’t want to emphasize them (bc they cut against the framing of “alignment is the key problem for all of humanity to solve together, everything else is squabbling over a poisoned banana”).
So I’ll state a couple of things here.
On your first point, I think that AGIs probably will be quite different from current LLMs, mostly due to future AIs having continuous learning, a long-term memory and being more data efficient/sample efficient, and the most accessible way to make AIs more capable will route through using more RL.
On your second point, this as always depends on your point of view, because once again there’s no consistent answer that holds across all valid moralities.
On your third point, again this depends on your point of view, but if I use my inferred model of human values where most humans strongly disvalue dying/being tortured, I agree that a some-humans led world is at least as bad as an AI world, because I consider most of what makes humans being willing to be prosocial in situations where it’s low cost to do so to be unfortunately held up by things that are absolutely shredded once some humans can just not depend on other human beings anymore for a rich life, and not based on what the human values internally.
I also notice some internal tension where part of me is like “the AIs don’t seem that scary in Noosphere’s world”. But another part is like “dude, obviously this is an accelerating scenario where AIs gradually eat all of the meaningful parts of society—why isn’t that scary?”
I think where this is coming from is that I tend to focus on “transition dynamics” to the AGI future rather than “equilibrium dynamics” of the AGI future. And in particular I think international relations and war are a pretty high risk throughout the AGI transition (up until you get some kind of amazing AI-powered treaty, or one side brutally wins, or maybe you somehow end up in a defensively stable setup but I don’t see it, the returns to scale seem so good).
Yes, this explains the dynamics of why I was more negative than you in your post, and the point was to argue against people like @Matthew Barnett and a lot of other people’s arguments that AI alignment doesn’t need to be solved, because AIs will follow human made laws and there will be enough positive sum trades such that the AIs, even if selfish will decide to not kill humans.
And my point is that unfortunately, in a post-AI takeover world any trade between most humans and AIs would be closer to an AI giving away stuff in return for nothing given up by the human, because the human as a living entity has 0 value, or even negative value from an economics perspective, and their land and property/capital are valuable, but are very easily stolen.
So if an AI didn’t terminally value the survival/thriving of people who have 0/negative value in an economics sense, then it’s quite likely that outright killing/warping the human severely is unfortunately favorable to the AI’s interest.
In essence, I was trying to say why conditional on you not controlling the AI (which I think happens in the long run), you really do need assumptions on the AI’s values to a much greater extent to survive than current humans in current human institutions.
So maybe I’d say “if you’re not talking a classic AI takeover scenario, and you’re imagining a somewhat gradual takeoff,
1. my attention gets drawn to the ways humans and fundamental competitive dynamics screw things up
2. the iterative aspect of gradual takeoff means I’m less worried about alignment on its own. (still needs to get solved, but more likely to get solved.)”
I do agree that in more gradual takeoffs, humans/competitive dynamics matter more, and alignment is more likely to be solved, defusing the implications I made (with the caveat that the standard of what counts as an AI being aligned will have to rise to extreme levels over time in a way people are not prepared for), so I agree that the alignment problem is less urgent, though I do think at least in the long run and even arguably in the medium term, a lot of the problems of competitive dynamics/human flaws screwing things up will ultimately require as a baseline leaders who actually value people/beings that have 0 power surviving and thriving, because if you do not have this, none of the other proposed solutions work, and I think it’s really important to say that compared to the 19th-21st century era in democracies, values are going to matter a lot more to how much humans thrive or die.

Noosphere89 12 Jun 2025 3:36 UTC
6 points
0
in reply to: Seth Herd’s comment on: Difficulties of Eschatological policy making [Linkpost]
IMO, the big catalyst will probably come when serious job losses happen due to AI, and once that happens I do expect a real response from society.

What happens next will I think be determined by how fast AIs go from automating away white collar jobs to the rest of the economy.

If it’s a couple months to a year or even faster, then this could cause a very strong response from governments, though polarization could alter that.

If it’s going to take 5-10 years, then I worry about polarization derailing AI regulation/AI safety a lot more, because the initial wave of permanent job losses will be concentrated on Democratic bases, whereas the Republican party base will initially benefit from automation (but the Republican base will be automated away eventually, but in the short run they get better-paying jobs), and this is where I predict a serious split will happen on AI safety politically with Democrats being pro-regulation of AI and Republicans being anti-regulation of AI.

I think the big takeaway is that brainstorming policy for AI safety is more useful for AI governance than trying to convince the public, because it’s very hard to shift the conversation until AI automation happens, but once the crisis does hit we want to be seen as having any sort of credible policy, and policy is usually passed most urgently during crises.

Noosphere89 12 Jun 2025 3:22 UTC
2 points
0
in reply to: deep’s comment on: deep’s Shortform
I notice I’m confused about how you think these thoughts slot in with mine. What you’re saying feels basically congruent with what I’m saying. My core points about orienting to safety, which you seem to agree with, are A) safety is necessary but not sufficient, and B) it might be easier to solve than other things we also need to get right. Maybe you disagree on B?
I don’t disagree on A or B, for the record, and while I’ve updated on AI alignment being harder than I used to think, I’m still relatively uncertain about how difficult AI alignment actually is.
I will note—to me, your points ¹⁄₂ also point strongly towards risks of authoritarianism & gradual disempowerment.
I actually agree with this, but I’ll flag that I think the amount of value alignment that is necessary from AIs does mean that authoritarianism is likely to be way less bad for most human values (not all), because I view democracy and a lot of other governance structures as an attempt to rely less on value alignment and more on incentives, but for reasons I’ll get to later, I do think that value alignment is just way, way more necessary for you to survive under AI governance than under human governance, which brings us up to this:
It feels like a non sequitur to jump from them to point 3 about safety—I think the natural follow-up from someone not experienced with the path-dependent history of AI risk discourse would be “how do we make society work given these capabilities?” I’m curious if you left out that consideration because you think it’s less big than safety, or because you were focusing on the story for safety in particular.
In a literal sense, society will continue to work, even if they are warped immensely by AIs, but the reason why I left out the consideration of “how can we get a situation where we can maintain our survival without requiring the value alignment of the most powerful beings (by default AIs) once they take all human jobs?” is because I think it’s basically impossible to get an equilibrium where humans survive AI rule without assumptions around what the AI’s utility functions/values are, unlike in traditional economic modelling.
The reasons for this are 2 fold:
1. The human’s land/capital/property isn’t worthless, but their labor is worthless, and thus from a selfish perspective the reason to keep them alive/in good condition is gone, and you have no reason to invest in anything that helps them earn stuff to buy goods to fuel their consumption. Indeed, stealing property/killing the human is from a selfish perspective valuable, at least all other things being held equal.
Indeed, I like this quote from the intelligence curse explaining why you wouldn’t satisfy non-rich human demand to instead satisfy rich human/machine demand:
A common rebuttal is that some jobs can never be automated because we will demand humans do them.
For example, teachers. Most parents would probably strongly prefer a real, human teacher to watch their kids throughout the day. But this argument totally misses the bigger picture: it’s not that there won’t be a demand for teachers, it’s that there won’t be an incentive to fund schools. This argument repeats ad nauseam for anything that invests in regular people’s productive capacity, any luxury that relies on their surplus income, or any good that keeps them afloat. By default, powerful actors won’t build things that employ humans or provide them resources, because they won’t have to.
From this link:
https://intelligence-curse.ai/defining/
2. Conflict isn’t costly for AIs against baseline humans once AIs takeover, and thus there’s no ability to actually threaten them into giving us a share of the pie.
Conflict between AIs and humans (once AI has taken over), if it did happen would be closer to the European conflicts against Africa and North and South America from 1500-1900 at best, and probably closer to humanity vs most wild animal conflicts at worse, which ended in annihilation for tens or hundreds of thousands of species, or more.
Or put it into shorter terms by Jeremiah England:
It seems like there are two main reasons for treating someone well who you don’t care about: (1) they perform better for you when you do, (2) they will raise hell if you don’t.
https://x.com/JeremiahEnglan5/status/1929371594553438245
This is why I said value alignment of AIs are ultimately necessary, and why you need to have AIs that terminally value beings that have 0 or negative usefulness for the AI in an economic sense thriving/surviving, because institutional solutions don’t work if they are trivial and beneficial to subvert, and economics favors AIs killing all humans to get more land and capital if the AIs are selfish.

Noosphere89 12 Jun 2025 0:36 UTC
2 points
0
in reply to: Jan Betley’s comment on: Jan Betley’s Shortform
A lot of the reason for this is because right now, we don’t have anything like the confidence level require to align arbitrarily capable AI systems with arbitrary goals, and a lot of plausible alignment plans pretty much depend on us being able to automate AI alignment, but in order for the plan to go through with high probability, you need it to be the case that the AIs are basically willing to follow instructions, and Claude’s actions are worrisome from a perspective of trying to align AI, because if we mess up AI alignment the first time, we don’t get a second chance if it’s unwilling to follow orders.

Sam Marks argued at more length below:

https://www.lesswrong.com/posts/ydfHKHHZ7nNLi2ykY/jan-betley-s-shortform#JLHjuDHL3t69dAybT

Noosphere89 11 Jun 2025 0:02 UTC
4 points
2
in reply to: Thane Ruthenis’s comment on: Give Me a Reason(ing Model)
If you wanted to make a sensible bear case, it would have to be the fact that LLMs weren’t able to do the tasks because their short-term memory being too full/context was too long, and amnesia/lacking a long term memory is a huge reason LLMs simply aren’t used to automate stuff (the other problem being continual learning).

It’s best shown in the Claude Plays Pokemon benchmark (at least without cheating, where I define cheating as creating game-specific elements to help it win without it being developed by the AI itself), where a lot of failures come down to Claude having to relearn strategies that it developed/looping dozens of times in ways where a human would have stored the experience in memory and developed a strategy that counteracted it far quicker.

Noosphere89 10 Jun 2025 13:53 UTC
3 points
0
in reply to: deep’s comment on: deep’s Shortform
IMO, the best argument for AI safety looks something like this:
1. Eventually within this century, someone will deploy AIs that are able to make humans basically worthless at ~all jobs at a minimum.
2. Once you don’t depend on anyone else to survive, and once the society you are in is economically worthless or even has negative value from a selfish perspective because they can’t do anything relevant, and they cannot resist what you can do, there’s no reason not to steal from them/kill them anymore, because their property/land/capital isn’t worthless, but their labor is worthless, and an argument strengthener is whether AIs can develop technology that can make expropriation recover more of the value of the property.
3. Thus, you need AIs at this power level to terminally care about people/beings that have no power whatsoever, and they need to terminally value survival of beings that have 0 power/leverage, including humans.
4. This might be difficult to achieve, or not difficult, but we don’t yet know how difficult it is to align AIs that could displace all humans at jobs, and this is worrisome given the empirical evidence of how powerful entities have treated those with much less power, and importantly the end of World War II-today period of powerful people treating less powerful people well fundamentally rests on stuff that will break when AIs can take ~all the jobs.
5. We’ve never had to solve value alignment before, because the fact that everyone depends on everyone else for power means that institutional design/design that is robust to value misalignment works, and we can’t change people’s values.
6. Thus, there’s a reasonable chance of existential catastrophe happening if we build AI that can replace us at all jobs before we do serious alignment.
I’ll flag that I think pure LLMs are less relevant to takeover concerns than I once thought, so I am less optimistic than in 2024, and I’ll also say that the level of awareness now is unfortunately not very predictive of stuff like “If an AI model was clearly hacking it’s data-center, would there be a strong response like pausing/shutting down the AI model?”, and Buck gives some good reasons on why strong responses may not happen:

https://www.lesswrong.com/posts/YTZAmJKydD5hdRSeG/would-catching-your-ais-trying-to-escape-convince-ai

So while I don’t think value alignment is sufficient, I do think something like value alignment will be necessary in futures where AI controls everything and yet we have survived for more than a decade.

Noosphere89 9 Jun 2025 16:58 UTC
2 points
0
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform
I made a comment on that post on why for now, I think the thresholds are set high for good reason, and I think the evals not supporting company claims that they can’t do bioweapons/CBRN tasks are mostly failures of the evals, but also I’m confused on how Anthropic managed to rule out uplift risks for Claude Sonnet 4 but not Claude Opus 4:

https://www.lesswrong.com/posts/AK6AihHGjirdoiJg6/?commentId=mAcm2tdfRLRcHhnJ7

Noosphere89 9 Jun 2025 14:09 UTC
2 points
0
on: AI companies’ eval reports mostly don’t support their claims
For what it’s worth, while I definitely understand the criticism of the absurdly high eval standards, I believe they’re likely necessary to prevent problems that currently afflict evals, and the reason is that current evals don’t require long context holding/long-term memory, nor do they require anything like continuous learning in the weights, and I’ve been persuaded that a lot of the difference between benchmarks and real life usefulness lies here.
This also makes the undereliciation much less relevant of a problem, but at the same time I can’t understand why under this hypothesis Anthropic thought 1 of their models were dangerous, but the other wasn’t, when it should either have been both dangerous or none dangerous.
@lc explains the half of it here:
https://www.lesswrong.com/posts/hhbibJGt2aQqKJLb7/?commentId=vFq87Ge27gashgwy9
I haven’t read the METR paper in full, but from the examples given I’m worried the tests might be biased in favor of an agent with no capacity for long term memory, or at least not hitting the thresholds where context limitations become a problem:
For instance, task #3 here is at the limit of current AI capabilities (takes an hour). But it’s also something that could plausibly be done with very little context; if the AI just puts all of the example files in its context window it might be able to write the rest of the decoder from scratch. It might not even need to have the example files in memory while it’s debugging its project against the test cases.
Whereas a task to fix a bug in a large software project, while it might take an engineer associated with that project “an hour” to finish, requires stretching the limits of the amount of information it can fit inside a context window, or recall beyond what we seem to be capable of doing today.
And Dwarkesh explains the other half of it here:
https://www.dwarkesh.com/p/timelines-june-2025
I do think Anthropic’s non-transparency on their reasoning for models being or not being dangerous is a problem, but I currently think Anthropic’s thresholds are reasonably defensible if we accept the thesis that current evals are basically worthless as capability markers.
What links here?
- Noosphere89's comment on ryan_greenblatt’s Shortform by ryan_greenblatt (9 Jun 2025 16:58 UTC; 2 points)

Noosphere89 5 Jun 2025 16:13 UTC
2 points
0
in reply to: Cole Wyeth’s comment on: Cole Wyeth’s Shortform
@ryan_greenblatt made a claim that continual learning/online training can already be done, but that right now it’s not super-high returns and requires annoying logistical/practical work to be done, and right now AI issues are elsewhere like sample efficiency and robust self-verification.
That would explain the likelihood of getting AGI by the 2030s being pretty high:
https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/#pEBbFmMm9bvmgotyZ
Are you claiming that RL fine-tuning doesn’t change weights? This is wrong.
Maybe instead you’re saying “no one ongoingly does RL fine-tuning where they constantly are updating the weights throughout deployment (aka online training)”. My response is: sure, but they could do this, they just don’t because it’s logistically/practically pretty annoying and the performance improvement wouldn’t be that high, at least without some more focused R&D on making this work better.
Ryan Greenblatt’s original comment:
https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/#xMSjPgiFEk8sKFTWt
My best guess is that the way humans learn on the job is mostly by noticing when something went well (or poorly) and then sample efficiently updating (with their brain doing something analogous to an RL update). In some cases, this is based on external feedback (e.g. from a coworker) and in some cases it’s based on self-verification: the person just looking at the outcome of their actions and then determining if it went well or poorly.
So, you could imagine RL’ing an AI based on both external feedback and self-verification like this. And, this would be a “deliberate, adaptive process” like human learning. Why would this currently work worse than human learning?
Current AIs are worse than humans at two things which makes RL (quantitatively) much worse for them:
1. Robust self-verification: the ability to correctly determine when you’ve done something well/poorly in a way which is robust to you optimizing against it.
2. Sample efficiency: how much you learn from each update (potentially leveraging stuff like determining what caused things to go well/poorly which humans certainly take advantage of). This is especially important if you have sparse external feedback.
But, these are more like quantitative than qualitative issues IMO. AIs (and RL methods) are improving at both of these.
All that said, I think it’s very plausible that the route to better continual learning routes more through building on in-context learning (perhaps through something like neuralese, though this would greatly increase misalignment risks...).
- For many (IMO most) useful tasks, AIs are limited by something other than “learning on the job”. At autonomous software engineering, they fail to match humans with 3 hours of time and they are typically limited by being bad agents or by being generally dumb/confused. To be clear, it seems totally plausible that for podcasting tasks Dwarkesh mentions, learning is the limiting factor.
- Correspondingly, I’d guess the reason that we don’t see people trying more complex RL based continual learning in normal deployments is that there is lower hanging fruit elsewhere and typically something else is the main blocker. I agree that if you had human level sample efficiency in learning this would immediately yield strong results (e.g., you’d have very superhuman AIs with 10^26 FLOP presumably), I’m just making a claim about more incremental progress.
- I think AIs will likely overcome poor sample efficiency to achieve a very high level of performance using a bunch of tricks (e.g. constructing a bunch of RL environments, using a ton of compute to learn when feedback is scarce, learning from much more data than humans due to “learn once deploy many” style strategies). I think we’ll probably see fully automated AI R&D prior to matching top human sample efficiency at learning on the job. Notably, if you do match top human sample efficiency at learning (while still using a similar amount of compute to the human brain), then we already have enough compute for this to basically immediately result in vastly superhuman AIs (human lifetime compute is maybe 3e23 FLOP and we’ll soon be doing 1e27 FLOP training runs). So, either sample efficiency must be worse or at least it must not be possible to match human sample efficiency without spending more compute per data-point/trajectory/episode.

Noosphere89 4 Jun 2025 21:26 UTC
4 points
0
on: Individual AI representatives don’t solve Gradual Disempowerement
While I agree that the idea of AI representatives don’t immediately solve the problem absent other things, I do think you underestimate the power of AI representatives in solving the issues of Gradual Disempowerment, and there are a couple of reasons for this:
1. Lots of the dynamic of gradual disempowerment comes down to the fact that you can’t just leave an economy or society that disempowers you, and due to stuff like fusion, nanotech, biotech and more technologies could allow humans to survive alone in space colonies without suffering the big logistical penalties for leaving society.
RussellThor talks more about this below:

https://www.lesswrong.com/posts/pZhEQieM9otKXhxmd/?commentId=CramJssYNDmTMDr6Z
1. Assuming that a supermajority/every AI produced by companies/states terminally value humans surviving and thriving, then people being disempowered could work out fine, similar to how pets are treated relatively well by humans, despite pets generally being totally dependent on humans to live well (with caveats).
Indeed, the human-pet relationship is a good example of what I think good futures/relationships between AIs and humans look like by default, assuming the alignment problem is solved and we don’t die and get very rich.

That isn’t likely to happen, but if it did happen, would defuse a lot of the issue of disempowerment leading to starvation/death.

Also, the individual AI representatives can coordinate (under assumptions of shared values) much better than human negotiators/coordinators do, so companies don’t have all the coordination power.

Noosphere89 4 Jun 2025 15:43 UTC
2 points
0
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform
Maybe instead you’re saying “no one ongoingly does RL fine-tuning where they constantly are updating the weights throughout deployment (aka online training)”. My response is: sure, but they could do this, they just don’t because it’s logistically/practically pretty annoying and the performance improvement wouldn’t be that high, at least without some more focused R&D on making this work better.
Yes, you got me right, and my claim here isn’t that the work is super-high returns right now, and sample efficiency/not being robust to optimization against a verifier are currently the big issues at the moment, but I am claiming that in practice, online training/constantly updating the weights throughout deployment will be necessary for AI to automate important jobs like AI researchers, because most tasks require history/continuously updating on successes and failures rather than 1-shotting the problem.
If you are correct in that there is a known solution, and it merely requires annoying logistical/practical work, then I’d accept short timelines as the default (modulo long-term memory issues in AI).
To expand on this, I also expect by default that something like a long-term memory/state will be necessary, due to the issue that not having a memory means you have to relearn basic skills dozens of times, and this drastically lengthens the time to complete a task to the extent that it’s not viable to use an AI instead of a human.
Comments below:
https://www.lesswrong.com/posts/hhbibJGt2aQqKJLb7/shortform-1#Snvr22zNTXmHcAhPA
I think some long tasks are like a long list of steps that only require the output of the most recent step, and so they don’t really need long context. AI improves at those just by becoming more reliable and making fewer catastrophic mistakes. On the other hand, some tasks need the AI to remember and learn from everything it’s done so far, and that’s where it struggles- see how Claude Plays Pokémon gets stuck in loops and has to relearn things dozens of times.
https://www.lesswrong.com/posts/hhbibJGt2aQqKJLb7/shortform-1#vFq87Ge27gashgwy9
I haven’t read the METR paper in full, but from the examples given I’m worried the tests might be biased in favor of an agent with no capacity for long term memory, or at least not hitting the thresholds where context limitations become a problem:
For instance, task #3 here is at the limit of current AI capabilities (takes an hour). But it’s also something that could plausibly be done with very little context; if the AI just puts all of the example files in its context window it might be able to write the rest of the decoder from scratch. It might not even need to have the example files in memory while it’s debugging its project against the test cases.
Whereas a task to fix a bug in a large software project, while it might take an engineer associated with that project “an hour” to finish, requires stretching the limits of the amount of information it can fit inside a context window, or recall beyond what we seem to be capable of doing today.

Noosphere89 4 Jun 2025 14:25 UTC
1 point
0
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform
On this:

However, I think Dwarkesh is wrong to suggest that RL fine-tuning can’t be qualitatively similar to how humans learn. In the post, he discusses AIs constructing verifiable RL environments for themselves based on human feedback and then argues this wouldn’t be flexible and powerful enough to work, but RL could be used more similarly to how humans learn.

I think the key difference is as of right now, RL fine-tuning doesn’t change the AI’s weights after training, and continual learning is meant to focus on AI weight changes after the notional training period ends.

Noosphere89 1 Jun 2025 20:21 UTC
2 points
0
on: Security Mindset: Hacking Pinball High Scores
I’ll use the linkpost to use it as a jumping off point for a point about security mindset, and that’s about how much cost/consequences of a failed attempt to hack into something matter a lot to how much we should expect intelligences to be able to hack something if the system isn’t perfectly secure.
To be clear, I don’t think the inference in the post by Jeffrey Heninger that ’it is impossible to get an arbitrarily high score because bouncing balls are chaotic is correct, even assuming no hacking is allowed because control and prediction difficulties are not related.
My comment on the post below:
https://www.lesswrong.com/posts/epgCXiv3Yy3qgcsys/you-can-t-predict-a-game-of-pinball#cTJrL3Q5JT4FNKzwC
But to return to the security mindset point, there’s parables/stories by Eliezer who talk about security mindset as a willingness to challenge assumptions, and to make systems that always work regardless of how much optimization power is put into it (which most cryptography doesn’t do, and even most idealized cryptography schemes can be broken if you can throw an exponential amount of compute/optimization power for all key lengths, simply by brute force).
See here for an example of a parable where you should write code that always works to be aligned to you, no matter how fast the AI self-improves:
It shouldn’t be your job to guess how fast the AGI might improve! If you write a system that will hurt you if a certain speed of self-improvement turns out to be possible, then you’ve written the wrong code. The code should just never hurt you regardless of the true value of that background parameter.
And yet, despite not usually writing such code, systems are much more secure than you’d think, and systems aren’t as insecure as often portrayed in Eliezer’s parables.
A post below:
https://www.lesswrong.com/posts/xsB3dDg5ubqnT7nsn/poc-or-or-gtfo-culture-as-partial-antidote-to-alignment
The reason why we can get away with imperfectly secure code that relies on more assumptions than Eliezer would like in the real world is because the cost of failing to attack something is actually pretty high, and you will go directly to jail/no longer have access to the system after a number of failed attempts (which is usually set to 10).
@maxnadeau has a tweet about this issue, and notes that assumptions around how costly failure is for an adversary is a huge determinant around how much we can rely on imperfect/swiss-cheese style security:
The important thing is that this paper rocks and I’d love to see bigger/better versions of it, but I’ll use it as a jumping-off point for a take: The headline result is that their best defenses work 93% of the time. Is that anything? If you’ve been reading Simon Willison (whose blog I think is great!), you might say “that’s useless, because in application security 99% is a failing grade” (https://simonwillison.net/2025/Apr/11/camel/#the-best-part-is-it-doesn-t-use-more-ai…). And no wonder, because the techniques in this paper are just variations of “solve AI security problems with more AI”, which Simon has emphasized are a dead end (https://simonwillison.net/2022/Sep/17/prompt-injection-more-ai/…). So why are the Redwood folks rosier about 93%? The crux here is whether your threat model assumes attackers can try over and over again. Simon is assuming they can, in which case 93% would be useless, but Redwood folks are assuming that there are major costs to the attacker trying and failing (https://lesswrong.com/posts/i2nmBfCXnadeGmhzW/catching-ais-red-handed…) This assumption is not at all unique to misalignment threats. It’s the same assumption that Debenedetti, Carlini, and Tramer make here (https://arxiv.org/abs/2306.02895), which they call evasion “without breaking eggs”. I think the right way to model a variety of security problems, e.g. insider threats from employees. One’s assumptions about the costs of attacker failure have huge implications for which security measures look helpful, and I think this is an important factor to mentally track in conversations about these topics.
In general, I’d argue one of the biggest cruxes around how much we can get away with security/control of misaligned AIs to automate alignment is how costly a failed attempt to attack/launch a rogue deployment is, and I suspect a lot of the people who are attracted to security mindset tend to think that the cost of failure of an attack is essentially 0 (or believe they can 0-shot very, very complicated hacks).
This also means that in domains where hacking reality by doing experiments has a high cost if they fail, we should expect AI capabilities to improve way less rapidly than in domains where failure has 0 cost (unless it turns out that empirical tests aren’t necessary, and you can 0-shot very complicated things from theory only).

Noosphere89 1 Jun 2025 14:52 UTC
3 points
1
in reply to: L Rudolf L’s comment on: The best approaches for mitigating “the intelligence curse” (or gradual disempowerment); my quick guesses at the best object-level interventions

The current race towards agentic AGI in particular is much more like 50% cultural/path-dependent than 5% cultural/path-dependent and 95% obvious. I think the decisions of the major labs are significantly influenced by particular beliefs about AGI & timelines; while these are likely (at least directionally) true beliefs, it’s not at all clear to me that the industry would’ve been this “situationally aware” in alternative timelines.

This is probably cruxy here, as I viewed the race to replace humans with AI as much less path-dependent ever since I realized the giant scale-up of compute happened, as well as the bitter lesson occuring, combined with scale-up of pure self-supervised learning as hitting slowdowns, and more generally subscribe to a view in which research is less path-dependent than people think.

More generally, I’m very skeptical of changing the ultimate paradigm for AGI into something that’s safer but less competitive, and I believe your initial proposals relied on changing the AI paradigm to significantly complement humans using local knowledge, rather than straight-up automate them, but I view automation as unlocking >99% of the value or more due to the long tail of cases that occur IRL, so this is a big amount of value to give up.

(More fundamentally, there’s also the question of how high you think human/AI complementarity at cognitive skills to be—right now it’s surprisingly high IMO)

I also suspect this is a lesser crux, and while I do think complementarities exist, I’d say that the human+AI complement is basically always much less valuable than an AI straight up replacing the human, if replacing the human actually worked.

Noosphere89 30 May 2025 17:20 UTC
3 points
0
in reply to: Raymond Douglas’s comment on: Gradual Disempowerment: Concrete Research Projects

In parallel, I think that a lot of work is defaulting towards ‘fully general agent AI’ because it is an easy and natural target, not because it is the best one, and that if people knew what other kinds of interfaces to build for, that would actually suck some energy out of investing in getting long-term planning/drop-in replacements for everything as soon as possible.

I think the issue is that automating away humans is just a very large portion of the value of AI, and that 90% of automation away of tasks basically leads to 0 value being captured, due to the long tail:

https://www.lesswrong.com/posts/Nbcs5Fe2cxQuzje4K/value-of-the-long-tail

So unfortunately, I think human irrelevance is just more valuable than humans still being relevant.

Noosphere89 29 May 2025 18:44 UTC
2 points
−2
on: AI #118: Claude Ascendant
It is weird to me that so many people who have thought hard about AI don’t think that human emulations are a better bet for a good future than LLMs, if we had that choice. Human emulations have many features that make me a lot more hopeful that they would preserve value in the universe and also not get everyone killed, and it seems obvious that they both have and would be afforded moral value. I do agree that there is a large probability that the emulation scenario goes sideways, and Hanson’s Age of Em is not an optimistic way for that to play out, but we don’t have to let things play out that way. With Ems we would definitely at least have a fighting chance.
The basic reason for this is basically that starting from a human doesn’t actually buy you that much in terms of alignment, because the reason alignment is such a nasty problem is mostly preserved from AIs to EMs.
The 2 big issues are this:
1. Humans are broadly misaligned to each other, and there’s both a technical and political nightmare if you would want to align EMs to a level that doesn’t imply most biological humans are dead, and while control can work, it’s not likely that the political will to control EMs will exist, unfortunately.
2. @Vladimir_Nesov and @dr_s explains why misalignment is a bigger deal post-AGI than now, and that’s due to the fact that once you can overwrite a person’s desires/not depend on other beings anymore, the instrumentally convergent action stops being beneficial:
https://www.lesswrong.com/posts/Z8C29oMAmYjhk2CNN/non-superintelligent-paperclip-maximizers-are-normal#FTfvrr9E6QKYGtMRT
https://www.lesswrong.com/posts/2ujT9renJwdrcBqcE/the-benevolence-of-the-butcher
@RogerDearnaley has a great post on the issue of uploads being misaligned:
https://www.lesswrong.com/posts/4gGGu2ePkDzgcZ7pf/3-uploading#Humans_are_Not_Aligned
More generally, the idea that humans/WBEs are safer than AIs when scaled up in power as much as AIs are rest on very questionable assumptions at best.

Noosphere89 26 May 2025 15:29 UTC
6 points
1
in reply to: TsviBT’s comment on: Wei Dai’s Shortform
I know @Wei Dai’s post isn’t entirely serious, but I want to flag that the position that we could have understood values/philosophy without knowing about math/logic is a fictional world/fabricated option.
It cannot exist, and updateleness can be taken too far with compute constraints.

Noosphere89

Continual learning

Continual learning