AI safety & alignment researcher
eggsyntax
One story here could be that the model is being used only in an API context where it’s being asked to take actions on something well-modeled as a Markov process, where the past doesn’t matter (and we assume that the current state doesn’t incorporate relevant information about the past). There are certainly use cases that fit that (‘trigger invisible fence iff this image contains a dog’; ‘search for and delete this file’). It does seem to me, though, that for many (most?) AI use cases, past information is useful, and so the assumptions above fail unless labs are willing to pay the performance cost in the interest of greater safety.
Another set of cases where the assumptions fail is models that are trained to expect access to intrinsic or extrinsic memory; that doesn’t apply to current LLMs but seems like a very plausible path for economically useful models like agents.
It seems like that assumption runs throughout the post though, eg ‘But it won’t know what it actually tried, unless it is able to guess very accurately what it would have said’, ‘the AI might not even be able to get direct evidence of whether its attempts worked, or what its attempts even were.’
I don’t just mean showing the model its past CoT. Current models are often given lots of information about their past actions and interactions so that they can produce better output; why would you expect that to change? Is the assumption that labs will be willing to accept worse performance because they recognize the model can’t be trusted?
In the past couple of weeks lots of people have been saying the scaling labs have hit the data wall, because of rumors of slowdowns in capabilities improvements. But before that, I was hearing at least some people in those labs saying that they expected to wring another 0.5 − 1 order of magnitude of human-generated training data out of what they had access to, and that still seems very plausible to me
Epoch’s analysis from June supports this view, and suggests it may even be a bit too conservative:
(and that’s just for text—there are also other significant sources of data for multimodal models, eg video)
the untrusted model is stateless between queries and only sees the command history and system state.
What justifies the assumption that untrusted models will be stateless in deployment? There are clear performance advantages in many cases to giving models memory, and in practice they already typically receive context containing the history of their actions or interactions.
If you’re assuming that commercial labs will stop doing that for the sake of safety despite its advantages, it seems worth making that assumption explicit.
Great point! In the block world paper, they re-randomize the obfuscated version, change the prompt, etc (‘randomized mystery blocksworld’). They do see a 30% accuracy dip when doing that, but o1-preview’s performance is still 50x that of the best previous model (and > 200x that of GPT-4 and Sonnet-3.5). With ARC-AGI there’s no way to tell, though, since they don’t test o1-preview on the fully-private held-out set of problems.
The Ord piece is really intriguing, although I’m not sure I’m entirely convinced that it’s a useful framing.
Some of his examples (eg cosine-ish wave to ripple) rely on the fundamental symmetry between spatial dimensions, which wouldn’t apply to many kinds of hyperpolation.
The video frame construction seems more like extrapolation using an existing knowledge base about how frames evolve over time (eg how ducks move in the water).
Given an infinite number of possible additional dimensions, it’s not at all clear how a NN could choose a particular one to try to hyperpolate into.
It’s a fascinating idea, though, and one that’ll definitely stick with me as a possible framing. Thanks!
After some discussion elsewhere with @zeshen, I’m feeling a bit less comfortable with my last clause, building an internal model. I think of general reasoning as essentially a procedural ability, and model-building as a way of representing knowledge. In practice they seem likely to go hand-in-hand, but it seems in-principle possible that one could reason well, at least in some ways, without building and maintaining a domain model. For example, one could in theory perform a series of deductions using purely local reasoning at each step (although plausibly one might need a domain model in order to choose what steps to take?).
[EDIT: I originally gave an excessively long and detailed response to your predictions. That version is preserved (& commentable) here in case it’s of interest]
I applaud your willingness to give predictions! Some of them seem useful but others don’t differ from what the opposing view would predict. Specifically:
I think most people would agree that there are blind spots; LLMs have and will continue to have a different balance of strengths and weaknesses from humans. You seem to say that those blind spots will block capability gains in general; that seems unlikely to me (and it would shift me toward your view if it clearly happened) although I agree they could get in the way of certain specific capability gains.
The need for escalating compute seems like it’ll happen either way, so I don’t think this prediction provides evidence on your view vs the other.
Transformers not being the main cognitive component of scaffolded systems seems like a good prediction. I expect that to happen for some systems regardless, but I expect LLMs to be the cognitive core for most, until a substantially better architecture is found, and it will shift me a bit toward your view if that isn’t the case. I do think we’ll eventually see such an architectural breakthrough regardless of whether your view is correct, so I think that seeing a breakthrough won’t provide useful evidence.
‘LLM-centric systems can’t do novel ML research’ seems like a valuable prediction; if it proves true, that would shift me toward your view.
First of all, serious points for making predictions! And thanks for the thoughtful response.
Before I address specific points: I’ve been working on a research project that’s intended to help resolve the debate about LLMs and general reasoning. If you have a chance to take a look, I’d be very interested to hear whether you would find the results of the proposed experiment compelling; if not, then why not, and are there changes that could be made that would make it provide evidence you’d find more compelling?
Humans are eager to find meaning and tend to project their own thoughts onto external sources. We even go so far as to attribute consciousness and intelligence to inanimate objects, as seen in animistic traditions. In the case of LLMs this behaviour could lead to an overly optimistic extrapolation of capabilities from toy problems.
Absolutely! And then on top of that, it’s very easy to mistake using knowledge from the truly vast training data for actual reasoning.
But in 2024 the overhang has been all but consumed. Humans continue to produce more data, at an unprecedented rate, but still nowhere near enough to keep up with the demand.
This does seem like one possible outcome. That said, it seems more likely to me that continued algorithmic improvements will result in better sample efficiency (certainly humans need a far tinier amount of language examples to learn language), and multimodal data /synthetic data / self-play / simulated environments continue to improve. I suspect capabilities researchers would have made more progress on all those fronts, had it not been the case that up to now it was easy to throw more data at the models.
In the past couple of weeks lots of people have been saying the scaling labs have hit the data wall, because of rumors of slowdowns in capabilities improvements. But before that, I was hearing at least some people in those labs saying that they expected to wring another 0.5 − 1 order of magnitude of human-generated training data out of what they had access to, and that still seems very plausible to me (although that would basically be the generation of GPT-5 and peer models; it seems likely to me that the generation past that will require progress on one or more of the fronts I named above).
Taking the globe representation as an example, it is unclear to me how much of the resulting globe (or atlas) is actually the result of choices the authors made. The decision to map distance vectors in two or three dimensions seems to change the resulting representation. So, to what extent are these representations embedded in the model itself versus originating from the author’s mind?
I think that’s a reasonable concern in the general case. But in cases like the ones mentioned, the authors are retrieving information (eg lat/long) using only linear probes. I don’t know how familiar you are with the math there, but if something can be retrieved with a linear probe, it means that the model is already going to some lengths to represent that information and make it easily accessible.
Interesting approach, thanks!
Why does the prediction confidence start at 0.5?
Just because predicting eg a 10% chance of X can instead be rephrased as predicting a 90% chance of not-X, so everything below 50% is redundant.
And how is the “actual accuracy” calculated?
It assumes that you predict every event with the same confidence (namely
prediction_confidence
) and then that you’re correct onactual_accuracy
of those. So for example if you predict 100 questions will resolve true, each with 100% confidence, and then 75 of them actually resolve true, you’ll get a Brier score of 0.25 (ie 3⁄4 of the way up the right-hand said of the graph).Of course typically people predict different events with different confidences—but since overall Brier score is the simple average of the Brier scores on individual events, that part’s reasonably intuitive.
But I also find my own understanding to be a bit confused and in need of better sources.
Mine too, for sure.
And agreed, Chollet’s points are really interesting. As much as I’m sometimes frustrated with him, I think that ARC-AGI and his willingness to (get someone to) stake substantial money on it has done a lot to clarify the discourse around LLM generality, and also makes it harder for people to move the goalposts and then claim they were never moved).
With respect to Chollet’s definition (the youtube link):
I agree with many of Chollet’s points, and the third and fourth items in my list are intended to get at those.
I do find Chollet a bit frustrating in some ways, because he seems somewhat inconsistent about what he’s saying. Sometimes he seems to be saying that LLMs are fundamentally incapable of handling real novelty, and we need something very new and different. Other times he seems to be saying it’s a matter of degree: that LLMs are doing the right things but are just sample-inefficient and don’t have a good way to incorporate new information. I imagine that he has a single coherent view internally and just isn’t expressing it as clearly as I’d like, although of course I can’t know.
I think part of the challenge around all of this is that (AFAIK but I would love to be corrected) we don’t have a good way to identify what’s in and out of distribution for models trained on such diverse data, and don’t have a clear understanding of what constitutes novelty in a problem.
Interesting question! Maybe it would look something like, ‘In my experience, the first answer to multiple-choice questions tends to be the correct one, so I’ll pick that’?
It does seem plausible on the face of it that the model couldn’t provide a faithful CoT on its fine-tuned behavior. But that’s my whole point: we can’t always count on CoT being faithful, and so we should be cautious about relying on it for safety purposes.
But also @James Chua and others have been doing some really interesting research recently showing that LLMs are better at introspection than I would have expected (eg ‘Looking Inward’), and I’m not confident that models couldn’t introspect on fine-tuned behavior.
I’ve now made two posts about LLMs and ‘general reasoning’, but used a fairly handwavy definition of that term. I don’t yet have a definition I feel fully good about, but my current take is something like:
The ability to do deduction, induction, and abduction
in a careful, step by step way, without many errors that a better reasoner could avoid,
including in new domains; and
the ability to use all of that to build a self-consistent internal model of the domain under consideration.
What am I missing? Where does this definition fall short?
Interesting, thanks, I’ll have to think about that argument. A couple of initial thoughts:
When we ask whether some CoT is faithful, we mean something like: “Does this CoT allow us to predict the LLM’s response more than if there weren’t a CoT?”
I think I disagree with that characterization. Most faithfulness researchers seem to quote Jacovi & Goldberg: ‘a faithful interpretation is one that accurately represents the reasoning process behind the model’s prediction.’ I think ‘Language Models Don’t Always Say What They Think’ shows pretty clearly that that differs from your definition. In their experiment, even though actually the model has been finetuned to always pick option (A), it presents rationalizations of why it picks that answer for each individual question. I think if we looked at those rationalizations (not knowing about the finetuning), we would be better able to predict the model’s choice than without the CoT, but it’s nonetheless clearly not faithful.
If the NAH is true, those abstractions will be the same abstractions that other sufficiently intelligent systems (humans?) have converged towards
I haven’t spent a lot of time thinking about NAH, but looking at what features emerge with sparse autoencoders makes it seem like in practice LLMs don’t consistently factor the world into the same categories that humans do (although we still certainly have a lot to learn about the validity of SAEs as a representation of models’ ontologies).
It does seem totally plausible to me that o1′s CoT is pretty faithful! I’m just not confident that we can continue to count on that as models become more agentic. One interesting new datapoint on that is ‘Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback’, where they find that models which behave in manipulative or deceptive ways act ‘as if they are always responding in the best interest of the users, even in hidden scratchpads’.
It’s not that intuitively obvious how Brier scores vary with confidence and accuracy (for example: how accurate do you need to be for high-confidence answers to be a better choice than low-confidence?), so I made this chart to help visualize it:
Here’s log-loss for comparison (note that log-loss can be infinite, so the color scale is capped at 4.0):
Claude-generated code and interactive versions (with a useful mouseover showing the values at each point for confidence, accuracy, and the Brier (or log-loss) score):
Update: a recent new paper, ‘Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback’, described by the authors on LW in ‘Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback’, finds that post-RLHF, LLMs may identify users who are more susceptible to manipulation and behave differently with those users. This seems like a clear example of LLMs modeling users and also making use of that information.
I agree with 1, which is why the COT will absolutely have to be faithful.
That does sound ideal if we can figure out a way to achieve it (although it seems plausible that if we have good enough interpretability to judge whether CoT is faithful, we won’t really need the CoT in the first place).
I agree with 2, but conditional on relatively weak forward passes, and most of the bottleneck to reasoning being through the COT, there is little the model can do about the situation, short of exfiltrating itself...
I also disagree with 3, at least assuming relatively weak forward passes, and the bottleneck to reasoning being largely through COT
I don’t have a very well-developed argument here, but intuitively I think there are many simple ways for the model to shape its output in ways that provide its later forward passes with more than 0.0 bits of information which are useful to itself but which aren’t legible to monitors, and those can accumulate over the course of an extended context.
I see—I mistakenly read that as part of a broader policy of not showing the model its past actions. Thanks!