abramdemski

Karma: 19,224

[Question] Have LLMs Generated Novel Insights?

abramdemski and Cole Wyeth

Feb 23, 2025, 6:22 PM

155 points

36 comments2 min readLW link

abramdemski Feb 23, 2025, 5:34 PM
2 points
0
on: My model of what is going on with LLMs
My position is NOT that LLMs are “stochastic parrots.” I suspect they are doing something akin to Solomonoff induction with a strong inductive bias in context—basically, they interpolate, pattern match, and also (to some extent) successfully discover underlying rules in the service of generalization.
I think non-reasoning models such as 4o and Claude are better-understood as doing induction with a “circuit prior” which is going to be significantly different from Solomonoff (longer-running programs require larger circuits, which will be penalized).
Reasoning models such as o1 and r1 are in some sense Turing-complete, and so, much more akin to Solomonoff. Of course, the RL used in such models is not training on the prediction task like Solomonoff Induction.

abramdemski Feb 21, 2025, 8:01 PM
11 points
2
on: My model of what is going on with LLMs
They haven’t proven any theorems that anyone cares about. They haven’t written anything that anyone will want to read in ten years (or even one year). Despite apparently memorizing more information than any human could ever dream of, they have made precisely zero novel connections or insights in any area of science^[3].
An anecdote I heard through the grapevine: some chemist was trying to synthesize some chemical. He couldn’t get some step to work, and tried for a while to find solutions on the internet. He eventually asked an LLM. The LLM gave a very plausible causal story about what was going wrong and suggested a modified setup which, in fact, fixed the problem. The idea seemed so hum-drum that the chemist thought, surely, the idea was actually out there in the world and the LLM had scraped it from the internet. However, the chemist continued searching and, even with the details in hand, could not find anyone talking about this anywhere. Weak conclusion: the LLM actually came up with this idea due to correctly learning a good-enough causal model generalizing not-very-closely-related chemistry ideas in its training set.

Weak conclusion: there are more than precisely zero novel scientific insights in LLMs.
What links here?
- Have LLMs Generated Novel Insights? by abramdemski (Feb 23, 2025, 6:22 PM; 155 points)

abramdemski Feb 13, 2025, 8:06 PM
6 points
4
in reply to: Kaj_Sotala’s comment on: Kaj’s shortform feed
> now that AI systems are already increasingly general
I want to point out that if you tried to quantify this properly, the argument falls apart (at least in my view). “All AI systems are increasingly general” would be false; there are still many useful but very narrow AI systems. “Some AI systems” would be true, but this highlights the continuing usefulness of the distinction.
One way out of this would be to declare that only LLMs and their ilk count as “AI” now, with more narrow machine learning just being statistics or something. I don’t like this because of the commonality of methods between LLMs and the rest of ML; it is still deep learning (and in many cases, transformers), just scaled down in every way.

abramdemski Feb 6, 2025, 6:15 PM
LW: 2 AF: 2
0
AF
in reply to: Towards_Keeperhood’s comment on: Anti-Slop Interventions?
Btw tbc, sth that I think slightly speeds up AI capability but is good to publish is e.g. producing rationality content for helping humans think more effectively (and AIs might be able to adopt the techniques as well). Creating a language for rationalists to reason in more Bayesian ways would probably also be good to publish.
Yeah, basically everything I’m saying is an extension of this (but obviously, I’m extending it much further than you are). We don’t exactly care whether the increased rationality is in humans or AI, when the two are interacting a lot. (That is, so long as we’re assuming scheming is not the failure mode to worry about in the shorter-term.) So, improved rationality for AIs seems similarly good. The claim I’m considering is that even improving rationality of AIs by a lot could be good, if we could do it.
An obvious caveat here is that the intervention should not dramatically increase the probability of AI scheming!
Belief propagation seems too much of a core of AI capability to me. I’d rather place my hope on GPT7 not being all that good yet at accelerating AI research and us having significantly more time.
This just seems doomed to me. The training runs will be even more expensive, the difficulty of doing anything significant as an outsider ever-higher. If the eventual plan is to get big labs to listen to your research, then isn’t it better to start early? (If you have anything significant to say, of course.)

abramdemski Feb 6, 2025, 6:03 PM
LW: 2 AF: 2
0
AF
in reply to: Steven Byrnes’s comment on: Anti-Slop Interventions?
Right, my point is, I don’t see any difference between “AIs that produce slop” and “weak AIs” (a.k.a. “dumb AIs”). So from my perspective, the above is similar to : “…Because weak AIs can speed up AI capabilities much easier than they can produce actually good alignment ideas.”
I want to explicitly call out my cliff vs gentle slope picture from another recent comment. Sloppy AIs can have a very large set of tasks at which they perform very well, but they have sudden drops in their abilities due to failure to extrapolate well outside of that.

abramdemski Feb 6, 2025, 5:57 PM
LW: 5 AF: 4
3
AF
in reply to: Steven Byrnes’s comment on: Anti-Slop Interventions?
So, rather than imagining a one-dimensional “capabilities” number, let’s imagine a landscape of things you might want to be able to get AIs to do, with a numerical score for each. In the center of the landscape is “easier” things, with “harder” things further out. There is some kind of growing blob of capabilities, spreading from the center of the landscape outward.
Techniques which are worse at extrapolating (IE worse at “coherent and correct understanding” of complex domains) create more of a sheer cliff in this landscape, where things go from basically-solved to not-solved-at-all over short distances in this space. Techniques which are better at extrapolating create more of a smooth drop-off instead. This is liable to grow the blob a lot faster; a shift to better extrapolation sees the cliffs cast “shadows” outwards.
My claim is that cliffs are dangerous for a different reason, namely that people often won’t realize when they’re falling off a cliff. The AI seems super-competent for the cases we can easily test, so humans extrapolate its competence beyond the cliff. This applies to the AI as well, if it lacks the capacity for detecting its own blind spots. So RSI is particularly dangerous in this regime, compared to a regime with better extrapolation.
This is very analogous to early Eliezer observing the AI safety problem and deciding to teach rationality. Yes, if you can actually improve people’s rationality, they can use their enhanced capabilities for bad stuff too. Very plausibly the movement which Eliezer created has accelerated AI timelines overall. Yet, it feels plausible that without Eliezer, there would be almost no AI safety field.

abramdemski Feb 6, 2025, 5:36 PM
LW: 3 AF: 3
0
AF
in reply to: Steven Byrnes’s comment on: Anti-Slop Interventions?
Two years later, GPT7 comes up with superhumanly-convincing safety measures XYZ. These inadequate standards become the dominant safety paradigm. At this point if you try to publish “belief propagation” it gets drowned out in the noise anyway.
Some relatively short time later, there are no humans.
I think that, if there are no humans, then slop must not be too bad. AIs that produce incoherent superficially-appealing slop are not successfully accomplishing ambitious nontrivial goals right?
Maybe “some relatively short time later” was confusing. I mean long enough for the development cycle to churn a couple more times.
IE, GPT7 convinces people of sloppy safety measures XYZ, people implement XYZ and continue scaling up AGI, the scaled-up superintelligence is a schemer.
(Or maybe you’re treating it as a “capabilities elicitation” issue? Like, the AI knows all sorts of things, but when we ask, we get sycophantic slop answers? But then we should just say that the AI is mediocre in effect. Even if there’s secretly a super-powerful AI hidden inside, who cares? Unless the AI starts scheming, but I thought AI scheming was out-of-scope for this post.)
I do somewhat think of this as a capabilities elicitation issue. I think current training methods are eliciting convincingness, sycophantism, and motivated cognition (for some unknown combination of the obvious reasons and not-so-obvious reasons).
But, as clarified above, the idea isn’t that sloppy AI is hiding a super-powerful AI inside. It’s more about convincingness outpacing truthfulness. I think that is a well-established trend. I think many people expect “reasoning models” to reverse that trend. My experience so far suggests otherwise.
I would have said “More powerful AI (if aligned) helps everybody make less mistakes. Less powerful AI convinces lots of people to make more mistakes.” Right?
What I’m saying is that “aligned” isn’t the most precise concept to apply here. If scheming is the dominant concern, yes. If not, then the precisely correct concept seems closer to the “coherence” idea I’m trying to gesture at.
I’ve watched (over Discord) a developer get excited about a supposed full-stack AI development tool which develops a whole application for you based on a prompt, try a few simple examples and exclaim that it is like magic, then over the course of a few more hours issue progressive updates of “I’m a little less excited now” until they’ve updated to a very low level of excitement and have decided that it seems like magic mainly because it has been optimized to work well for the sorts of simple examples developers might try first when putting it through its paces.
I’m basically extrapolating that sort of thing forward, to cases where you only realize something was bad after months or years instead of hours. As development of these sorts of tools continues to move forward, they’ll start to succeed in impressing on the days & weeks timespan. A big assumption of my model is that to do that, they don’t need to fundamentally solve the bad-at-extrapolation problem (hallucinations, etc); they can instead do it in a way that goodharts on the sorts of feedback they’re getting.
Alignment is broad enough that I can understand classifying this sort of failure as “alignment failure” but I don’t think it is the most precise description.
If the AI is producing slop, then why is there a self-improvement dial? Why wouldn’t its self-improvement ideas be things that sound good but don’t actually work, just as its safety ideas are?
This does seem possible, but I don’t find it probable. Self-improvement ideas can be rapidly tested for their immediate impacts, but checking their long-term impacts is harder. Therefore, AI slop can generate many non-working self-improvements that just get discarded and that’s fine; it’s the apparently-working self-improvement ideas that cause problems down the line. Similarly, the AI itself can more easily train on short-term impacts of proposed improvements; so the AI might have a lot less slop when reasoning about these short-term impacts, due to getting that feedback.
(Notice how I am avoiding phrasing it like “the sloppy AI can be good at capabilities but bad at alignment because capabilities are easier to train on than alignment, due to better feedback”. Instead, focusing on short-term impacts vs long-term impacts seems to carve closer to the joints of reality.)
Sloppy AIs are nonetheless fluent with respect to existing knowledge or things that we can get good-quality feedback for, but have trouble extrapolating correctly. Your scenario, where the sloppy AI can’t help with self-improvement of any kind, suggests a world where there is no low-hanging fruit via applying existing ideas to improve the AI, or applying the kinds of skills which can be developed with good feedback. This seems possible but not especially plausible.
But if we do have early transformative AI assistants, then the default expectation is that they will fail to solve the ASI alignment problem until it’s too late. Maybe those AIs will fail to solve the problem by outputting convincing-but-wrong slop, or maybe they’ll fail to solve it by outputting “I don’t know”, or maybe they’ll fail to solve it by being misaligned, a.k.a. a failure of “capabilities elicitation”. Who cares? What matters is that they fail to solve it. Because people (and/or the early transformative AI assistants) will build ASI anyway.
I think this is a significant point wrt my position. I think my position depends to some extent on the claim that it is much better for early TAI to say “I don’t know” as opposed to outputting convincing slop. If leading AI labs are so bullish that they don’t care whether their own AI thinks it is safe to proceed, then I agree that sharing almost any capability-relevant insights with these labs is a bad idea.

abramdemski Feb 5, 2025, 10:48 PM
LW: 4 AF: 4
0
AF
in reply to: Towards_Keeperhood’s comment on: Anti-Slop Interventions?
Concrete (if extreme) story:
World A:
Invent a version of “belief propagation” which works well for LLMs. This offers a practical way to ensure that if an LLM seems to know something in one context, it can & will fluently invoke the same knowledge in almost all appropriate contexts.
Keep the information secret in order to avoid pushing capabilities forward.
Two years later, GPT7 comes up with superhumanly-convincing safety measures XYZ. These inadequate standards become the dominant safety paradigm. At this point if you try to publish “belief propagation” it gets drowned out in the noise anyway.
Some relatively short time later, there are no humans.
World B:
Invent LLM “belief propagation” and publish it. It is good enough (by assumption) to be the new paradigm for reasoning models, supplanting current reinforcement-centric approaches.
Two years later, GPT7 is assessing its safety proposals realistically instead of convincingly arguing for them. Belief propagation allows AI to facilitate a highly functional “marketplace of ideas” where the actually-good arguments tend to win out far more often than the bad arguments. AI progress is overall faster, but significantly safer.
(This story of course assumes that “belief propagation” is an unrealistically amazing insight; still, this points in the direction I’m getting at)

abramdemski Feb 5, 2025, 10:28 PM
LW: 5 AF: 4
0
AF
in reply to: Lucius Bushnaq’s comment on: Anti-Slop Interventions?
Hmmm. I’m not exactly sure what the disconnect is, but I don’t think you’re quite understanding my model.
I think anti-slop research is very probably dual-use. I expect it to accelerate capabilities. However, I think attempting to put “capabilities” and “safety” on the same scale and maximize differential progress of safety over capabilities is an oversimplistic model which doesn’t capture some important dynamics.
There is not really a precise “finish line”. Rather, we can point to various important events. The extinction of all humans lies down a path where many mistakes (of varying sorts and magnitudes) were made earlier.
Anti-slop AI helps everybody make less mistakes. Sloppy AI convinces lots of people to make more mistakes.
My assumption is that frontier labs are racing ahead anyway. The idea is that we’d rather they race ahead with a less-sloppy approach.
Imagine an incautious teenager who is running around all the time and liable to run off a cliff. You expect that if they run off a cliff, they die—at this rate you expect such a thing to happen sooner or later. You can give them magic sneakers that allow them to run faster, but also improves their reaction time, their perception of obstacles, and even their wisdom. Do you give the kid the shoes?
It’s a tough call. Giving the kid the shoes might make them run off a cliff even faster than they otherwise would. It could also allow them to stop just short of the cliff when they otherwise wouldn’t.
I think if you value increased P(they survive to adulthood) over increased E(time they spend as a teenager), you give them the shoes. IE, withholding the shoes values short-term over long-term. If you think there’s no chance of survival to adulthood either way, you don’t hand over the shoes.

abramdemski Feb 5, 2025, 9:59 PM
LW: 6 AF: 5
3
AF
in reply to: jacquesthibs’s comment on: Anti-Slop Interventions?
I’m not sure I can talk about this effectively in the differential progress framework. My argument is that if we expect to die to slop, we should push against slop. In particular, if we expect to die to slop-at-big-labs, we should push against slop-at-big-labs. This seems to suggest a high degree of information-sharing about anti-slop tech.
Anti-slop tech is almost surely also going to push capabilities in general. If we currently think slop is a big source of risk, it seems worth it.
Put more simply: if someone is already building superintelligence & definitely going to beat you & your allies to it, then (under some semi-plausible additional assumptions) you want to share whatever safety tech you have with them, disregarding differential-progress heuristics.
Again, I’m not certain of this model. It is a costly move in the sense of having a negative impact on some possible worlds where death by slop isn’t what actually happens.

abramdemski Feb 4, 2025, 9:55 PM
LW: 4 AF: 3
0
AF
in reply to: Towards_Keeperhood’s comment on: Anti-Slop Interventions?
Do you not at all buy John’s model, where there are important properties we’d like nearer-term AI to have in order for those AIs to be useful tools for subsequent AI safety work?

abramdemski Feb 4, 2025, 9:53 PM
LW: 3 AF: 3
1
AF
in reply to: Towards_Keeperhood’s comment on: Anti-Slop Interventions?
I think there is both important math work and important conceptual work. Proving new theorems involves coming up with new concepts, but also, formalizing the concepts and finding the right proofs. The analogy to robots handling the literal heavy lifting part of a job seems apt.

abramdemski Feb 4, 2025, 8:48 PM
LW: 4 AF: 3
0
AF
in reply to: Towards_Keeperhood’s comment on: Anti-Slop Interventions?
Yeah, my sense is that modern AI could be useful to tiling agent stuff if it were less liable to confabulate fake proofs. This generalizes to any technical branch of AI safety where AI could help come up with formalizations of ideas, proofs of conjectures, etc. My thinking suggests there is something of an “overhang” here at present, in the sense that modern AI models are worse-than-useless due to the way that they try to create good-looking answers at the expense of correctness.
I disagree with the statement “to some extent the goal of tiling-agents-like work was to have an AI solve its own alignment problem”—the central thing is to understand conditions under which one agent can justifiably trust another (with “trust” operationalized as whether one agent wants to modify the decision procedure of the other). If AI can’t justifiably trust itself, then it has a potential motive to modify itself in ways that remove safety guarantees (so in this sense, tiling is a precondition for lots of safety arguments). Perhaps more importantly, if we can understand conditions under which humans can justifiably trust AI, then we have a formal target for alignment.

Anti-Slop Interventions?

abramdemskiFeb 4, 2025, 7:50 PM

74 points

33 comments6 min readLW link

abramdemski Jan 24, 2025, 4:40 PM
2 points
0
on: Why mesa-optimization is less likely under gradient descent than natural selection
This one was a little bit of a face-palm for me the first time I noticed it. If we’re being pedantic about it, we might point out that the term “optimization algorithm” does not just refer to AIXI-like programs, which optimize over expected future world histories. Optimization algorithms include all algorithms that search over some possibility space, and select a possibility according to some evaluation criterion. For example, gradient descent is an algorithm which optimizes over neuron configuration, not future world-histories.
This distinction is what I was trying to get at with selection vs control.

abramdemski Jan 24, 2025, 4:34 PM
3 points
0
on: Why mesa-optimization is less likely under gradient descent than natural selection
Evolutionary mutations are produced randomly, and have an entire lifetime to contribute to an animal’s fitness and thereby get naturally selected. By contrast, neural network updates are generated by deciding which weight-changes would certainly be effective for improving performance on single training examples, and then averaging those changes together for a large batch of training data.
Per my judgement, this makes it sound like evolution has a much stronger incentive to produce inner algorithms which do something like general-purpose optimization (e.g. human intelligence). We can roughly analogize an LLM’s prompt to human sense data; and although it’s hard to neatly carve sense data into a certain number of “training examples” per lifetime, the fact that human cortical neurons seem get used roughly 240 million times in a person’s 50-year window of having reproductive potential,^[4] whereas LLM neurons fire just once per training example, should give some sense for how much harder evolution selects for general-purpose algorithms such as human intelligence.
By this argument, it sounds like you should agree with my conclusion that o1 and similar models are particularly dangerous and a move in the wrong direction, because the “test-time compute” approach grows the size of a “single training example” much larger, so that single neurons are firing many more times.
I think the possibility of o1 models creating mesa-optimizers seems particularly concrete and easy to reason about. Pre-trained base models can already spin up “simulacra” which feel relatively agentic when you talk to them (ie coherent over short spans, mildly clever). Why not expect o1-style training to amplify these?
(I would agree that there are two sides to this argument—I am selectively arguing for one side, not presenting a balanced view, in the hopes of soliciting your response wrt the other side.)
I think it quite plausible that o1-style training increases agenticness significantly by reinforcing agentic patterns of thinking, while only encouraging adequate alignment to get high scores on the training examples. We have already seen o1 do things like spontaneously cheat at chess. What, if anything, is unconvincing about that example, in your view?

Lecture Series on Tiling Agents #2

abramdemskiJan 20, 2025, 9:02 PM

16 points

0 comments1 min readLW link

abramdemski Jan 20, 2025, 7:25 PM
2 points
0
in reply to: Alexander Gietelink Oldenziel’s comment on: Lecture Series on Tiling Agents
I’m still quite curious what you have found useful and how you’ve refactored your workflow to leverage AI more (such that you wish you did it a year ago).
I do use Perplexity, exa.ai and elicit as parts of my search strategy.

abramdemski Jan 19, 2025, 7:15 PM
9 points
7
in reply to: Alexander Gietelink Oldenziel’s comment on: Lecture Series on Tiling Agents
About 6 months ago you strongly recommended that I make use of the integrated AI plugin for Overleaf (Writefull). I did try it. Its recommended edits seem quite useless to me; they always seem to flow from a desire to make the wording more normal/standard/expected in contrast to more correct (which makes some sense given the way generatie pre-training works). This is obviously useful to people with worse English, but for me, the tails come apart massively between “better” and “more normal/standard/expected”, such that all the AI suggestions are either worse or totally neutral rephrasing.
It also was surprisingly bad at helping me write LaTeX; I had a much better time asking Claude instead.
It’s not that I didnt use AI daily before for mundane tasks or writing emails,
I haven’t found AI at all useful for writing emails, because the AI doesn’t know what I want to say, and taking the time to tell the AI isn’t any easier than writing it myself. AI can only help me write the boring boilerplate stuff that email recipients would skim over anyway (which I don’t want to add to my emails). AI can’t help me get info out of my head this way—it can only help me in so far as emails have a lot of low-entropy cruft. I can see how this could be useful for someone who has to write a lot of low-entropy emails, but I’m not in that situation. To some degree this could be mitigated if the LLMs had a ton of context (EG recording everything that happens on my computer), but again, only the more boring cases I think.
I’d love to restore the Abram $_{2010}$ ability to crank out several multi-page emails a day on intellectual topics, but I don’t think AI is helpful towards that end yet. I haven’t tried fine-tuning on my own writing, however. (I haven’t tried fine-tuning at all.)
Similarly, LLMs can be very useful for well-established mathematics which had many examples in the training data, but get worse the more esoteric the mathematics becomes. The moment I ask for something innovative, the math becomes phony.
Across the board, LLMs seem very useful for helping people who are at the lower end of a skill ladder, but not yet very useful for people at the upper end.
So I’m curious, how did you refactor your workflow to make better use of AI?

abramdemski

[Question] Have LLMs Gen­er­ated Novel In­sights?

Anti-Slop In­ter­ven­tions?

Lec­ture Series on Tiling Agents #2

[Question] Have LLMs Generated Novel Insights?

Anti-Slop Interventions?

Lecture Series on Tiling Agents #2