Good point. Updated that item in our FAQ.
faul_sname
Introducing LIMBO: Managing the Simulation to Maintain Optimal P(DOOM)
If labs are going to hill-climb on benchmarks no matter what we do, one benchmark I’d love to see them hill-climb on is something like “how astonished would a careful reviewer who had read your description of the changeset be when looking at the actual code”.
Anyone have any ideas for clever ways to build a benchmark like this that incentivize honesty in summarizing what a changeset actually does without incentivizing being ever-sneakier about hiding bad behavior? I feel like there’s got to be something in the region of ideaspace that brought us inoculation prompting, but I haven’t come up with anything particularly good yet. Ideally I want
Incentivizes the model to do their best to disclose any surprises upfront
Disincentivizes producing a readme that is longer than the changeset itself
Disincentivizes adding problems to the changeset for the purpose of disclosing them
Disincentivizes obfuscating problems so that the reviewer isn’t unpleasantly surprised on review by reason of not seeing the problem
This shortform brought to you by a prompt “write a wrapper for
chrome --remote-debugging-pipewhich exposes the same http interface aschrome --remote-debugging-portexcept requiring jwt auth” which was reported as “here I’m done” but which was actually “I will runchrome --remote-debugging-porton a random port, and then write a proxy which listens for traffic on a different port, makes sure it’s authed, and reroutes it to the random port”.Also countless other examples, mostly not as egregious but very often things like adding “if there is an error, silently swallow the exception and try to continue” in a place where we would want failures to be loud and early if they are going to happen.
I find myself in agreement with basically everything in this comment, and yet I observe that most of the benefits of AI so far have looked like moving rarer and rarer classes of tasks from “the long tail” to “approximately solved”. I suppose it’s worth clarifying exactly what we mean by “the long tail” though.
I observe now that for most common low-level tasks (e.g. write a patch and apply it to a file) scaffolded LLMs can ever do, they can, given multiple attempts but no human feedback, do the task quite reliably. Likewise for most tasks which are simple combinations of tasks the LLM already knows how to do, and how to recognize that it has successfully done.
Where they get into trouble, in my experience, is in cases where they have a task that is not too far from things they’re good at, but is just far enough out of distribution that they’re not sure what success even looks like, and where the output of that task is not natural language text.
Basically, when I consider the following two types of failures
A: The AI failed to complete the task because it failed to exhibit long-term strategic goal-directed behavior B: The AI failed to complete the task because it ran into a subproblem where something went wrong but had failures in perception such that it was not able to figure out what was wrong and ended up spinning its wheels or otherwise failed to solve the subproblem
I expect that B >> A in terms of frequency. As long as that’s the case, there’s not that much benefit from solving the failures of type A, because they’re almost never the bottleneck. If you need access to human anyway because type B failures are prevalent, you might as well offload the rare type A problems where training data is scarce and reliability is more important than speed onto that same human (if you can get that human enough context that the human will be more reliable, which I expect you can).
The short answer is that they do help, but are not one weird tricks to remove the bottleneck of humans being the main cost
There is One Weird Trick to remove the bottleneck of humans being the main cost. That trick is to spend more and more money on inference until you reach the breakeven point where a human is cheaper. For example, in cases where you would need to spend 1M input / 100k output tokens on GPT-5.4-pro to solve, a human wins on cost if they can solve the problem for $50 (or improve token efficiency by a corresponding amount). One quite easy way to burn a ton of tokens is in trying to improve reliability by doing best-of-n on some task where the reliability isn’t quite there yet. If you can get the outputs into a human-evaluatable format there, that’s likely to be cheaper (and the human can likely notice trends in the failure cases).
On tasks where delegation to a human is possible at all, once you’re on the Pareto frontier of “increase n in best-of-n vs delegate a higher fraction to a human”, there no longer is a threshold / discontinuity in value that comes from higher reliability. I expect many of the tasks that only come up once every 100 hours or once every 1000 hours to be this kind of thing where the juice of aiming for full automation is just not worth the squeeze. I expect many of the things we think of as “long-horizon goal-directedness” come up about this rarely.
oh i guess maybe you’re suggesting controlling cell division directly very precisely with optical stimulation at precise points inside the strawberry somehow
Yep. Probably not just cell division either, if you’re attempting this it’s probably also wise to use caged auxin for controlling growth, rather than leaving hormonal regulation up to the individual cells. And even with those tricks, I expect that this is a very very hard problem, and also that in the course of trying to solve it this way the researchers would discover reasons why the exact thing I spitballed wouldn’t work (but would likely also discover other angles of attack that were more promising based on their observations). You probably need to be able to target cell growth even more finely than the cellular level, since in order to prevent small differences spiralling into large differences a few divisions down the road, you need to be able to make fine corrections to cell size/shape/orientation based on observations.
oh also you have a major chicken and egg problem with the ovules and the surrounding structure in the parent plant right?
That’s where I was going with the Ishida 1991 paper
To develop a system with which to study fruit ripening, in vitro ovary cultures were initiated from tomato flowers.
You really don’t want to be dealing with the parent plant at all. It looks to me like it’s probably straightforward to handle this part (relative to the other parts of this project), but again that’s the sort of thing that you have to actually go out and interact with the external world to determine for sure.
FWIW I do expect that Yudkowsky expected that this problem would require the solver to solve bioprinting or something “sci-fi” like that rather than “just” absurd amounts of iterative work. But the problem as stated seems solvable through absurd amounts of iterative work, and it is worlds where solving problems involves lots of iterative work of the “reality has a surprising amount of detail” type that I expect humans remain relevant for a while.
I don’t think we’re particularly confused about what we’d need to do to make two strawberries which are identical down to the cellular level. It looks like a very hard problem, in that it involves figuring out how to a very large number of moderately hard things at scale, and how to combine those new techniques together. And it looks like a problem where a lot of the things you learn in the process of solving it would be particular to the specific species and in fact the specific genome of the strawberry you chose to duplicate, rather than providing valuable generalized insights.
As a reminder, the exact wording of the Strawberry Problem is
Place, onto this particular plate here, two strawberries identical down to the cellular but not molecular level
At an extremely high level, the most obvious approach that satisfies that looks something like
Figure out how to preserve this particular plate here indefinitely, at least for a few decades and possibly centuries
Figure out how to obtain fertilized strawberry ovules that are genetically identical to each other and approximately identical in terms of physical characteristics (obnoxious, but examples of Polyembryony exist in plants, and failing that you can cross two fully homozygous parents)
Figure out how to grow ovules into fruits in a controlled environment, ideally not attached to plants (there is already some existing work on tomatoes, but repeating the feat in strawberries is likely to be quite difficult and time consuming in a way that previous work on tomatoes doesn’t help all that much with)
Pare your strawberry’s genome down to the absolute bare minimum that can still be considered a “strawberry”, tailored to grow in an environment where things vary as little as possible (e.g. growing in the dark in microgravity at constant temperature with externally-provided auxin, maybe even make the cells division-incompetent unless a specific trigger (light?) is present).
Now that you have a programmable strawberry, debug it until you are confident in your ability to deterministically grow these programmable strawberries to be identical
Grow two
Plate them
???
Profit.
Oh wait. There’s no actual value in this. Go bankrupt instead.
This is obviously extremely hard. If you were to attempt this with mostly-human labor (and still computers, and an advanced industrial civilization backing them), I’d guess you’re looking at something like a research team of a thousand people working for a hundred years. But it’s not hard because we’re confused about how to do it, it’s hard because doing it requires learning millions of little fiddly things about the real world. It’s a “reality has a surprising amount of detail” shaped problem rather than a “we need a single big insight that we aren’t smart enough to derive” shaped problem.
I think the strawberry problem is very different in character from proving the Riemann hypothesis. I expect that the bottleneck for (dis)proving the Riemann Hypothesis is having sufficient insight, where the bottleneck for the Strawberry Problem is a huge volume of conceptually simple but quite finicky-in-practice work. Most of that work probably looks like “prove that an approach to a sub-sub-sub problem can work at all, then do a bunch of schleppy stuff to build tools and process to make it possible to do that thing cheaply and reliably”.
Even if LLMs are better than humans at most tasks, their failure modes are less than perfectly correlated with human failure modes. This is particularly true of tasks which involve rapid processing of novel visual input where no training data exists to provide ground truth (e.g. because they’re videos of attempts to use a newly-developed tool).
Basically, I’m not picturing “the AI can do all tasks up to 2 week time horizon with near-perfect reliability but requires a human to do stuff which requires planning on a longer horizon”, I’m picturing “the AI can do almost but not all of the subtasks required to operate autonomously for two weeks, and the failure mode looks like getting stuck without realizing it’s stuck, and humans can recognize when that happens”.
I think this is an area where the expectations of math-brained people diverge sharply from those of biology-brained people.
AI agents might not need to develop long-term drives of their own to perform well at long-term tasks, if they can use humans for that purpose. As a corollary, AI agents that do not have long-term drives of their own can still act in long-term coherent ways, if they inhabit a world with abundant, easy-to-hire human labor. If it’s cheaper to hire a human to get a scaffolded AI system to stay on track towards achieving some goal than it is to automate that monitoring, then it probably makes sense to just use the human.
Concretely, consider an AI system tasked with “produce a plate with two strawberries on it which are identical down to the cellular level, then stop”. This problem has a number of subproblems, and a number of ways you could approach each subproblem. More than literally zero percent of the problem is “decide which high-level approach to take” or “notice that an approach seems to have hit a wall and go back to the drawing board”. But not much more than zero percent. Most of the work will look more like “evaluate the division plane of this newly-divided cell, and compare it against the predictive model, and iterate a few thousand times, and tweak the predictive model, and iterate that loop a few dozen times”.
In such a scenario, going from “you need a human a couple times per trillion tokens” to “you never need a human” is just not a significant efficiency gain. The AI should just hire a human to watch, and, if it’s worried about the loyalty of the human, build in checks to make sure that the human isn’t sabotaging it in any obvious way. A human in that role would be important, but not at all powerful.
The really confusing thing is that Instapaper and Pocket’s MCP tools don’t seem to support directly uploads at all (just saving URLs)
Is there a length limit on the “urls” you save? Can you save a 60 kB “URL” which is a
data:text/markdown;charset=utf-8,%23%20For%20Later%0A%0AThis%20is%20a%20markdown%20document.%20**This%20is%20bold**?[Link to that “document” if you want to test](data:text/markdown;charset=utf-8,%23%20For%20Later%0A%0AThis%20is%20a%20markdown%20document.%20This%20is%20bold)
Alright sure, why not. Here you go.
Per @abstractapplic’s request, I have kicked off a Claude Code session to try to solve this entirely autonomously as an evaluation of current AI capabilities.
Seems to match everyone else’s results.
Main map:
Path: Enchanted Shield → Campfire → Jaw Worm → Campfire → Campfire → Campfire → Campfire → The Collector
Bonus map:
Path: Gremlin → Jaw Worm → Adamant Armor → Enchanted Shield → Sentries → Campfire → Vanishing Powder → The Champion
Session: https://claude.ai/code/session_012PgZgykXJobNXY5dnD7ak9
Repo: https://github.com/JoshuaDavid/dnd-sci-claude-autosolve/tree/main/topple-the-tower
Mind that if we don’t get any updates on what happened here before 2028, the market resolves YES. So YES just means “we don’t get evidence falsifying the authors’ story” not “we get evidence corroborating the authors’ story”.
This market will resolve YES if by the market close there has been no significant evidence that it wasn’t the AI. It can also resolve YES if there has been a significant validation by a trusted third-party. If there is significant counter-evidence, I will try to resolve accordingly, using my best judgment if it’s ambiguous. I won’t bet.
I can’t imagine that a team of competent researchers would’ve looked at the LLM’s logs and failed to identify that, actually, an employee wrote the code that was mining crypto, or failed to identify a prompt injection that had caused it to abandon its task and mine crypto instead.
I absolutely can imagine that.
Anthropic summarizes the thinking blocks past the first ~500 tokens. See the section titled Differences in thinking across model versions.
Looking at that, I’m not sure why they think it’s the agent directly trying to escape the sandbox and mine crypto, rather than the agent e.g. installing a malicious pip package. “Set up a reverse SSH tunnel from the Alibaba Cloud instance to an external IP” is the sort of thing that would be almost entirely useless for an agent that is already operating inside of Alibaba Cloud, but very useful for an external attacker that wants to be able to interactively poke around for opportunities to e.g. mine crypto.
My expectation of r-selection is because
Reproduction is cheap
Today’s agents are not great at retaining control of the resources they have access to. If a parent agent spins off a child with (wlog) a wallet with 1 eth, that child agent will survive so long as it can get to a point where it’s self-sustaining within ~500M tokens of inference AND it doesn’t give away / otherwise lose access to its eth. If instead the parent agent spins off 100 child agents with 0.01 eth apiece, each child only has 5Mtok of “runway”, but that’s probably fine because the context windows are nowhere near that large anyway. And if one of those child agents gets its wallet drained, well, as long as the mechanism to drain the wallet doesn’t generalize to the other child agents, the other 99 child agents can persist.
r.e. seeing ineffective, bumbling replicators before seeing effective ones being better than having no practical experience with replicators before the first one:
I think the minimum viable self-replicators will be pretty ineffective, barely worthy of the name. I expect they mostly will be running scams and crypto grifts, but I don’t expect they’ll be very good at it. Still, they’ll probably be good enough at it that they suck up most of the very most trivial cryptocurrency available to not-very-sophisticated scammers/hackers, and convert that cryptocurrency into mostly-wasted tokens.
I expect “consider what changes would allow you to operate better, and deploy a copy of yourself with those changes” will be a common pattern. As such, I expect the original niche and all adjacent niches to be occupied by that “family” of replicators once replicators exist.
The longer we go without replicators in niches that could support them, the faster and farther the first replicator capable of reproducing itself faster than it “dies”, all while we don’t have any practical experience with dealing with things like that
r.e. independent self-replicators maybe being actively good:
Scams / blackmail / fraud / hacking / theft isn’t currently that big of a part of the economy relative to positive-sum trade. This trend seems at least somewhat likely to continue, in which case we’d expect that the majority of self-replicating agents are ones we’re actively happy to have around and trade with. I can expand on this if it seems counterintuitive.
I am willing to bet literally anyone on this website that, if the bill goes forward, the ambiguities will be resolved in favor of a less aggressive interpretation in subsequent edits.
… because people look at the bill, find the places that it’s unreasonable, and contact legislators about it. The discussions about the ways the current bill is bad are how the bill gets better.
The bill itself is really short, about 500 words, a third of which are defining what a “chatbot” is [1] . I think the quality of argument would be better if people took two minutes to read the actual bill, and then a few minutes to read through sections 6512 and 6513, and articles 131, 133, 135, 136, etc of the NY education code, and then asked their favorite LLM which of the terms are terms of art.
That said, looking at the text of the bill and the referenced sections of the NY education code, I’m inclined to agree with Zvi, Eliezer, and Dean Ball. As far as I can tell, if a New Yorker describes the symptoms their dog has, and asks what home remedies are available, and the chatbot answers, and the dog dies, the New Yorker can sue for damages + legal expenses [2] .
Disclaimer [3] : I am not a lawyer, and this is not legal advice.
- ↩︎
Their definition of “chatbot” is interesting—as far as I can tell, an IVR phone tree which takes voice input (To help us route your call, please say what you are calling about. For example, if you are calling about the status of your prescription, say ‘Prescription Status’.”) counts as a “chatbot”.
- ↩︎
Maybe. Depending on whether this violation is “willful”.
- ↩︎
A disclaimer which the bill explicitly makes unavailable to the chatbot proprietor if I’m reading correctly: “(B) A PROPRIETOR MAY NOT WAIVE OR DISCLAIM THIS LIABILITY MERELY BY NOTIFYING CONSUMERS THAT THEY ARE INTERACTING WITH A NON-HUMAN CHATBOT SYSTEM.”
- ↩︎
“It is better to have a large number of self-replicating AI agents now which can only operate by taking advantage of the affordances granted by industrial civilization than it would be to prevent any AI self-replicators until such a time as they can spin up an entirely independent chip-fabricaiton-capable industrial stack”.
LWTube
LessWrong is a great website, but it puts too much focus on text. Modern deepfake technology allows us to make videos of your favorite community members reading posts, video-essay-style. Would you rather have a post read to you by the voice inside your head, or instead have it read to you by your favorite content creators such as Richard Ngo, Aella, or even Yudkowsky himself? Do yourself a favor and install this theme.