Anthropic’s new paper ‘Mapping the Mind of a Large Language Model’ is exciting work that really advances the state of the art for the sparse-autoencoder-based dictionary learning approach to interpretability (which switches the unit of analysis in mechanistic interpretability from neurons to features). Their SAE learns (up to) 34 million features on a real-life production model, Claude 3 Sonnet (their middle-sized Claude 3 model).
The paper (which I’m still reading, it’s not short) updates me somewhat toward ‘SAE-based steering vectors will Just Work for LLM alignment up to human-level intelligence[1].’ As I read I’m trying to think through what I would have to see to be convinced of that hypothesis. I’m not expert here! I’m posting my thoughts mostly to ask for feedback about where I’m wrong and/or what I’m missing. Remaining gaps I’ve thought of so far:
What’s lurking in the remaining reconstruction loss? Are there important missing features?
Will SAEs get all meaningful features given adequate dictionary size?
Are there important features which SAEs just won’t find because they’re not that sparse?
The paper points out that they haven’t rigorously investigated the sensitivity of the features, ie whether the feature reliably fires whenever relevant text/image is present; that seems like a small but meaningful gap.
Is steering on clearly safety-relevant features sufficient, or are there interactions between multiple not-clearly-safety-relevant features that in combination cause problems?
How well do we even think we understand feature compositionality, especially across multiple layers? How would we measure that? I would think the gold standard would be ‘ability to predict model output given context + feature activations’?
Does doing sufficient steering on safety-relevant features cause unacceptable distortions to model outputs?
eg if steering against scam emails causes the model to see too many emails as scammy and refuse to let you write a billing email
eg if steering against power-seeking causes refusal on legitimate tasks that include resource acquisition
Do we find ways to make SAEs efficient enough to be scaled to production models with a sufficient number of features
(as opposed to the paper under discussion, where ‘The features we found represent a small subset of all the concepts learned by the model during training, and finding a full set of features using our current techniques would be cost-prohibitive’)
Of course LLM alignment isn’t necessarily sufficient on its own for safety, since eg scaffolded LLM-based agents introduce risk even if the underlying LLM is well-aligned. But I’m just thinking here about what I’d want to see to feel confident that we could use these techniques to do the LLM alignment portion.
I wrote up a short post with a summary of their results. It doesn’t really answer any of your questions. I do have thoughts on a couple, even though I’m not expert on interpretability.
But my main focus is on your footnote: is this going to help much with aligning “real” AGI (I’ve been looking for a term; maybe REAL stands for Reflective Entities with Agency and Learning?:). I’m of course primarily thinking of foundation models scaffolded to have goals, cognitive routines, and incorporate multiple AI systems such as an episodic memory system. I think the answer is that some of the interpretability work will be very valuable even in those systems, while some of it might be a dead end—and we haven’t really thought through which is which yet.
is this going to help much with aligning “real” AGI
I think it’s an important foundation but insufficient on its own. I think if you have an LLM that, for example, is routinely deceptive, it’s going to be hard or impossible to build an aligned system on top of that. If you have an LLM that consistently behaves well and is understandable, it’s a great start toward broader aligned systems.
I think the answer is that some of the interpretability work will be very valuable even in those systems, while some of it might be a dead end
I think that at least as important as the ability to interpret here is the ability to steer. If, for example, you can cleanly (ie based on features that crisply capture the categories we care about) steer a model away from being deceptive even if we’re handing it goals and memories that would otherwise lead to deception, that seems like it at least has the potential to be a much safer system.
Note mostly to myself: I posted this also on the Open Source mech interp slack, and got useful comments from Aidan Stewart, Dan Braun, & Lee Sharkey. Summarizing their points:
Aidan: ‘are the SAE features for deception/sycophancy/etc more robust than other methods of probing for deception/sycophancy/etc’, and in general evaluating how SAEs behave under significant distributional shifts seems interesting?
Dan: I’m confident that pure steering based on plain SAE features will not be very safety relevant. This isn’t to say I don’t think it will be useful to explore right now, we need to know the limits of these methods...I think that [steering will not be fully reliable], for one or more of reasons 1-3 in your first msg.
Lee: Plain SAE won’t get all the important features, see recent work on e2e SAE. Also there is probably no such thing as ‘all the features’. I view it more as a continuum that we just put into discrete buckets for our convenience.
I haven’t seen these terms consistently defined with respect to LLMs. I’ve been using, and propose standardizing on:
Tooling: affordances for LLMs to make calls, eg ChatGPT plugins.
Scaffolding: an outer process that calls LLMs, where the bulk of the intelligence comes from the called LLMs, eg AutoGPT.
Some smaller details:
If the scaffolding itself becomes as sophisticated as the LLMs it calls (or more so), we should start focusing on the system as a whole rather than just describing it as a scaffolded LLM.
This terminology is relative to a particular LLM. In a complex system (eg a treelike system with code calling LLMs calling code calling LLMs calling...), some particular component can be tooling relative to the LLM above it, and scaffolding relative to the LLM below.
It’s reasonable to think of a system as scaffolded if the outermost layer is a scaffolding layer.
There’s are other possible categories that don’t fit this as neatly, eg LLMs calling each other as peers without a main outer process, but I expect these definitions to cover most real-world cases.
Thanks to @Andy Arditi for helping me nail down the distinction.
There will be some overlap where components are both tools and part of the scaffolding; for instance, a competent language model cognitive architecture would probably have a “tool” of an episodic memory it can write to and read from; and the next LLM call from the scaffolding portion would often be determined by the result of reading from that episodic memory “tool”, making it also part of the scaffolding. Similarly with sensory systems and probably many others; how the scaffolding invokes the LLM will depend on the results of calls to tools.
But it’s useful to have more distinct terminology even when it’s not perfect, so I’ll go ahead and use this.
I might think of some of your specific examples a bit differently, but yeah, I would say that a particular component can be tooling relative to the LLM above it, and scaffolding relative to the LLM below. I’ll add some clarification to the post, thanks!
This seems to be inspired by the library/framework distinction in software engineering:
Inversion of Control is a key part of what makes a framework different to a library. A library is essentially a set of functions that you can call, these days usually organized into classes. Each call does some work and returns control to the client.
A framework embodies some abstract design, with more behavior built in. In order to use it you need to insert your behavior into various places in the framework either by subclassing or by plugging in your own classes. The framework’s code then calls your code at these points. (source)
Your code calls the library; the framework calls your code ≈ The LLM calls the tool; the scaffolding calls the LLM.
Not consciously an inspiration for me, but definitely a similar idea, and applies pretty cleanly to a system with only a couple of layers; I’ll add some clarification to the post re more complex systems where that analogy might not hold quite as well.
If I desire a cookie, I desire to believe that I desire a cookie; if I do not desire a cookie, I desire to believe that I do not desire a cookie; let me not become attached to beliefs I may not want.
If I believe that I desire a cookie, I desire to believe that I believe that I desire a cookie; if I do not believe that I desire a cookie, I desire to believe that I do not believe that I desire a cookie; let me not become attached to beliefs I may not want.
If I believe that I believe that I desire a cookie, I desire to believe that I believe that I believe that I desire a cookie; if I do not believe that I believe that I desire a cookie, I desire to believe that I do not believe that I believe that I desire a cookie; let me not become attached to beliefs I may not want.
He points to this recent meta-analysis that finds pretty clearly that most people find mental effort unpleasant. I suspect that this will be unsurprising to many people around here, and I also suspect that some here will be very surprised due to typical mind fallacy.
It’s no longer possible to consistently identify AI writing, despite most people thinking that they can; I’ll quote a key paragraph with some links below, but see the post for details. I’m reminded of the great ‘can you tell if audio files are compressed?’ debates, where nearly everyone thought that they could but blind testing proved they couldn’t (if they were compressed at a decent bitrate).
I guess this depends on typical circumstances of the mental effort. If your typical case of mental effort is solving puzzles and playing computer games, you will find mental effort pleasant. If instead your typical case is something like “a teacher tells me to solve a difficult problem in a stressful situation, and if I fail, I will be punished”, you will find mental effort unpleasant. Not only in given situation, but you will generally associate thinking with pleasant or unpleasant experience.
Yes, the important lesson is that some people find thinking intrinsically rewarding (solving the problem is a sufficient reward for the effort), but many don’t, and need some external motivation, or at least to have the situation strongly reframed as “hey, we are just playing, this is definitely not work” (which probably only works for sufficiently simple tasks).
If your typical case of mental effort is solving puzzles and playing computer games, you will find mental effort pleasant. If instead your typical case is something like “a teacher tells me to solve a difficult problem in a stressful situation, and if I fail, I will be punished”, you will find mental effort unpleasant.
I’m somewhat doubtful that this is the main moderator. The meta-analysis codes the included studies according to whether ‘the participants’ task behavior either affected other people or affected some real-world outcome’. Only 14 of the studies were like that; of the rest, 148 were ‘simulations or training situations’ and the remaining 188 were low-significance, ie there was nothing at stake. I would guess that many of them were game-like. That significance difference had nearly no effect (−0.03, 95% CI [−0.27, 0.21]) on how aversive participants found the task.
That doesn’t rule out your second suggestion, that people find mental effort unpleasant if they’ve associated it over time with stressful and consequential situations, but it’s evidence against that being a factor for the particular task.
It does very much depend on the person, though (‘a well-established line of research shows that people vary in their need for cognition, that is, their “tendency to engage in and enjoy effortful cognitive endeavors”’). I suspect that the large majority of LessWrong participants are people who enjoy mental effort.
Hmmm… “simulations or training situations” doesn’t necessarily sound like fun. I wish someone also did the experiment in a situation optimized to be fun. Or did the experiment with kids, who are probably easier to motivate about something (just design a puzzle involving dinosaurs or something, and show them some funny dinosaur cartoons first) and have been less mentally damaged by school and work.
Generally, comparing kids vs adults could be interesting, although it is difficult to say what would be an equivalent mental effort. Specifically I am curious about the impact of school. Oh, we should also compare homeschooled kids vs kids in school, to separate the effects of school and age.
I think an intelligence will probably also be associated; a more intelligent person is more successful at mental effort and therefore probably more often rewarded.
“simulations or training situations” doesn’t necessarily sound like fun.
Seems like some would be and some wouldn’t. Although those are the ‘medium significance’ ones; the largest category is the 188 that used ‘low significance’ tasks. Still doesn’t map exactly to ‘fun’, but I expect those ones are at least very low stress.
Generally, comparing kids vs adults could be interesting, although it is difficult to say what would be an equivalent mental effort. Specifically I am curious about the impact of school. Oh, we should also compare homeschooled kids vs kids in school, to separate the effects of school and age.
That would definitely be interesting; it wouldn’t surprise me if at least a couple of the studies in the meta-analysis did that.
We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to “read the mind” of the modeland understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.
Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users. We acknowledge this decision has disadvantages. We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer. For the o1 model series we show a model-generated summary of the chain of thought.
This section is interesting in a few ways:
‘Assuming it is faithful and legible’—we have reason to believe that it’s not, at least not on previous models, as they surely know. Do they have reason to believe that it is for o1, or are they just ignoring that issue?
‘we cannot train any policy compliance or user preferences onto the chain of thought’—sure, legit. Although LLM experiments that use a “hidden” CoT or scratchpad may already show up enough times in the training data that I expect LLMs trained on the internet to understand that the scratchpads aren’t really hidden. If they don’t yet, I expect they will soon.
‘We also do not want to make an unaligned chain of thought directly visible to users.’ Why?
I guess I can see a story here, something like, ‘The user has asked the model how to build a bomb, and even though the model is prosaically-aligned, it might put instructions for building a bomb in the CoT before deciding not to share them.’ But that raises questions.
Is this behavior they’re actually seeing in the model? It’s not obvious to me that you’d expect it to happen. If they’re avoiding doing RL based on the CoT contents, then certainly it could happen, but it seems like it would be strictly more complex behavior, and so not very likely to spontaneously emerge.
Although I can also imagine a story where there’d be pressure for it to emerge. Can the model reason more clearly if it has the opportunity to think uncensored thoughts?
But also ‘for the o1 model series we show a model-generated summary of the chain of thought.’ It seems strange to spend a lot more forward passes to summarize the CoT as opposed to just doing a single pass through a model trained to detect content that violates policy and omitting the CoT if that triggers.
In addition to the previous justifications, they cite ‘user experience’ and ‘competitive advantage’. The former seems silly at first blush; how will users’ experience be negatively affected by a CoT that’s hidden by default and that they never need to look at? I’m curious about what sort of ‘competitive advantage’ they’re talking about. Maybe the CoT would reveal a highly-structured system prompt for how to do CoT that accounts for a lot of the decreased loss on reasoning tasks?
CoT optimised to be useful in producing the correct answer is a very different object to CoT optimised to look good to a human, and a priori I expect the former to be much more likely to be faithful. Especially when thousands of tokens are spent searching for the key idea that solves a task.
For example, I have a hard time imagining how the examples in the blog post could be not faithful (not that I think faithfulness is guaranteed in general).
If they’re avoiding doing RL based on the CoT contents,
Note they didn’t say this. They said the CoT is not optimised for ‘policy compliance or user preferences’. Pretty sure what they mean is the didn’t train the model not to say naughty things in the CoT.
‘We also do not want to make an unaligned chain of thought directly visible to users.’ Why?
I think you might be overthinking this. The CoT has not been optimised not to say naughty things. OpenAI avoid putting out models that haven’t been optimised not to say naughty things. The choice was between doing the optimising, or hiding the CoT.
Edit: not wanting other people to finetune on the CoT traces is also a good explanation.
If they’re avoiding doing RL based on the CoT contents,
Note they didn’t say this. They said the CoT is not optimised for ‘policy compliance or user preferences’. Pretty sure what they mean is the didn’t train the model not to say naughty things in the CoT.
Fair point. I had imagined that there wouldn’t be RL directly on CoT other than that, but on reflection that’s false if they were using Ilya Sutskever’s process supervision approach as was rumored.
CoT optimised to be useful in producing the correct answer is a very different object to CoT optimised to look good to a human, and a priori I expect the former to be much more likely to be faithful.
Agreed!
I think you might be overthinking this. The CoT has not been optimised not to say naughty things. OpenAI avoid putting out models that haven’t been optimised not to say naughty things. The choice was between doing the optimising, or hiding the CoT.
Maybe. But it’s not clear to me that in practice it would say naughty things, since it’s easier for the model to learn one consistent set of guidelines for what to say or not say than it is to learn two. Annoyingly, in the system card, they give a figure for how often the CoT summary contains inappropriate content (0.06% of the time) but not for how often the CoT itself does. What seems most interesting to me is that if the CoT did contain inappropriate content significantly more often, that would suggest that there’s benefit to accuracy if the model can think in an uncensored way.
And even if it does, then sure, they might choose not to allow CoT display (to avoid PR like ‘The model didn’t say anything naughty but it was thinking naughty thoughts’), but it seems like they could have avoided that much more cheaply by just applying an inappropriate-content filter for the CoT content and filtering it out or summarizing it (without penalizing the model) if that filter triggers.
it’s easier for the model to learn one consistent set of guidelines for what to say or not say than it is to learn two
The model producing the hidden CoT and the model producing the visible-to-users summary and output might be different models/different late-layer heads/different mixtures of experts.
Oh, that’s an interesting thought, I hadn’t considered that. Different models seems like it would complicate the training process considerably. But different heads/MoE seems like it might be a good strategy that would naturally emerge during training. Great point, thanks.
But it’s not clear to me that in practice it would say naughty things, since it’s easier for the model to learn one consistent set of guidelines for what to say or not say than it is to learn two.
If I think about asking the model a question about a politically sensitive or taboo subject, I can imagine it being useful for the model to say taboo or insensitive things in its hidden CoT in the course of composing its diplomatic answer. The way they trained it may or may not incentivise using the CoT to think about the naughtiness of its final answer.
But yeah, I guess an inappropriate content filter could handle that, letting us see the full CoT for maths questions and hiding it for sensitive political ones. I think that does update me more towards thinking they’re hiding it for other reasons.
I’m not so sure that an inappropriate content filter would have the desired PR effect. I think you’d need something a bit more complicated… like a filter which triggers a mechanism to produce a sanitized CoT. Otherwise the conspicuous absence of CoT on certain types of questions would make a clear pattern that would draw attention and potentially negative press and negative user feedback. That intermittently missing CoT would feel like a frustrating sort of censorship to some users.
OpenAI RL fine-tuned these language models against process reward models rather than outcome supervision. However, process supervision is mucheasier for objective tasks such as STEM question answering, therefore the process reward model is underspecified for other (out of distribution) domains. It’s unclear how much RL fine-tuning is performed against these underspecified reward models for OOD domains. In any case, when COTs are sampled from these language models in OOD domains, misgeneralization is expected. I don’t know how easily this is fixable with standard RLHF / outcome reward models (although I don’t expect it to be too difficult), but it seems like instead of fixing it they have gone the route of, we’ll keep it unconstrained and monitor it. (Of course, there may be other reasons as well such as to prevent others from fine-tuning on their COTs).
I’m a little concerned that even if they find very problematic behaviour, they will blame it on clearly expected misgeneralization and therefore no significant steps will be taken, especially because there is no reputational damage (This concern is conditional on the assumption of very little outcomes-based supervision and mostly only process supervision on STEM tasks).
Do you happen to have evidence that they used process supervision? I’ve definitely heard that rumored, but haven’t seen it confirmed anywhere that I can recall.
when COTs are sampled from these language models in OOD domains, misgeneralization is expected.
Offhand, it seems like if they didn’t manage to train that out, it would result in pretty bad behavior on non-STEM benchmarks, because (I’m guessing) output conditional on a bad CoT would be worse than output with no CoT at all. They showed less improvement on non-STEM benchmarks than on math/coding/reasoning, but certainly not a net drop. Thoughts? I’m not confident in my guess there.
I’m a little concerned that even if they find very problematic behaviour, they will blame it on clearly expected misgeneralization and therefore no significant steps will be taken, especially because there is no reputational damage
That’s a really good point. As long as benchmark scores are going up, there’s not a lot of incentive to care about whatever random stuff the model says, especially if the CoT is sometimes illegible anyway. Now I’m really curious about whether red-teamers got access to the unfiltered CoT at all.
I guess I can see a story here, something like, ‘The user has asked the model how to build a bomb, and even though the model is prosaically-aligned, it might put instructions for building a bomb in the CoT before deciding not to share them.’
I actually think this is non-trivially likely, because there’s a pretty large gap between aligning an AI/making an AI corrigible to users and making an AI that is misuse-resistant, because the second problem is both a lot harder than the first, and there’s quite a lot less progress on the second problem compared to the first problem.
I agree that it’s quite plausible that the model could behave in that way, it’s just not clear either way.
I disagree with your reasoning, though, in that to whatever extent GPT-o1 is misuse-resistant, that same resistance is available to the model when doing CoT, and my default guess is that it would apply the same principles in both cases. That could certainly be wrong! It would be really helpful if the system card would have given a figure for how often the CoT contained inappropriate content, rather than just how often the CoT summary contained inappropriate content.
Elsewhere@Wei Dai points out the apparent conflict between ‘we cannot train any policy compliance or user preferences onto the chain of thought’ (above) and the following from the Safety section (emphasis mine):
We found that integrating our policies for model behavior into the chain of thought of a reasoning model is an effective way to robustly teach human values and principles. By teaching the model our safety rules and how to reason about them in context...
Much is made of the fact that LLMs are ‘just’ doing next-token prediction. But there’s an important sense in which that’s all we’re doing—through a predictive processing lens, the core thing our brains are doing is predicting the next bit of input from current input + past input. In our case input is multimodal; for LLMs it’s tokens. There’s an important distinction in that LLMs are not (during training) able to affect the stream of input, and so they’re myopic in a way that we’re not. But as far as the prediction piece, I’m not sure there’s a strong difference in kind.
I’ve been thinking of writing up a piece on the implications of very short timelines, in light of various people recently suggesting them (eg Dario Amodei, “2026 or 2027...there could be a mild delay”)
Here’s a thought experiment: suppose that this week it turns out that OAI has found a modified sampling technique for o1 that puts it at the level of the median OAI capabilities researcher, in a fairly across-the-board way (ie it’s just straightforwardly able to do the job of a researcher). Suppose further that it’s not a significant additional compute expense; let’s say that OAI can immediately deploy a million instances.
What outcome would you expect? Let’s operationalize that as: what do you think is the chance that we get through the next decade without AI causing a billion deaths (via misuse or unwanted autonomous behaviors or multi-agent catastrophes that are clearly downstream of those million human-level AI)?
In short, what do you think are the chances that that doesn’t end disastrously?
Depends what they do with it. If they use it to do the natural and obvious capabilities research, like they currently are (mixed with a little hodge podge alignment to keep it roughly on track), I think we just basically for sure die. If they pivot hard to solving alignment in a very different paradigm and.. no, this hypothetical doesn’t imply the AI can discover or switch to other paradigms.
I wish I shared your optimism! You’ve talked about some of your reasons for it elsewhere, but I’d be interested to hear even a quick sketch of roughly how you imagine the next decade to go in the context of the thought experiment, in the 70-80% of cases where you expect things to go well.
The next decade from 2026-2036 will probably be wild, conditional on your scenario starting to pass, and my guess is that robotics is solved 2-5 years after the new AI is introduced.
But to briefly talk about the 70-80% of worlds where we make it through, several common properties appear:
Data still matters a great deal for capabilities and alignment, and the sparse RL problem where you try to get an AI to do something based on very little data will essentially not contribute to capabilities for the next several decades, if ever (I’m defining it as the goals are done over say 1-10 year timescales, or maybe even just 1 year timescales with no reward-shaping/giving feedback for intermediate rewards at all.)
Unlearning becomes more effective, such that we can remove certain capabilities without damaging the rest of the system, and this technique is pretty illustrative:
As far as my sketch of how the world goes in the median future, conditional on them achieving something like a research AI in 2026, they first automate their own research, which will take 1-5 years, then solve robotics, which will take another 2-5 years, and by 2036, the economy starts seriously feeling the impact of an AI that can replace everyone’s jobs.
The big reason why this change is slower than a lot of median predictions is a combination of AI science being more disconnectable from the rest of the economy than most others, combined with the problems being solvable, but with a lot of edge case that will take time to iron out (similar to how self driving cars went from being very bad in the 2000s to actually working in 2021-2023.)
The big question is if distributed training works out.
Thanks for sketching that out, I appreciate it. Unlearning significantly improving the safety outlook is something I may not have fully priced in.
My guess is that the central place we differ is that I expect dropping in, say, 100k extra capabilities researchers gets us into greater-than-human intelligence fairly quickly—we’re already seeing LLMs scoring better than human in various areas, so clearly there’s no hard barrier at human level—and at that point control gets extremely difficult.
I do certainly agree that there’s a lot of low-hanging fruit in control that’s well worth grabbing.
I realize that asking about p(doom) is utterly2023, but I’m interested to see if there’s a rough consensus in the community about how it would go if it were now, and then it’s possible to consider how that shifts as the amount of time moves forward.
We have enough AI to cause billion deaths in the next decade via mass production of AI-drones, robotic armies and AI-empowered strategic planners. No new capabilities are needed.
If it were true that that current-gen LLMs like Claude 3 were conscious (something I doubt but don’t take any strong position on), their consciousness would be much less like a human’s than like a series of Boltzmann brains, popping briefly into existence in each new forward pass, with a particular brain state already present, and then winking out afterward.
In the sense that statistically speaking we may all probably be actual Boltzmann brains? Seems plausible!
In the sense that non-Boltzmann-brain humans work like that? My expectation is that they don’t because we have memory and because (AFAIK?) our brains don’t use discrete forward passes.
@the gears to ascension I’m intrigued by the fact that you disagreed with “like a series of Boltzmann brains” but agreed with “popping briefly into existence in each new forward pass, with a particular brain state already present, and then winking out afterward.” Popping briefly into existence with a particular brain state & then winking out again seems pretty clearly like a Boltzmann brain. Will you explain the distinction you’re making there?
Boltzmann brains are random, and are exponentially unlikely to correlate with anything in their environment; however, language model forward passes are given information which has some meaningful connection to reality, if nothing else then the human interacting with the language model reveals what they are thinking about. this is accurate information about reality, and it’s persistent between evaluations—on successive evaluations in the same conversation (say, one word to the next, or one message to the next), the information available is highly correlated, and all the activations of previous words are available. so while I agree that their sense of time is spiky and non-smooth, I don’t think it’s accurate to compare them to random fluctuation brains.
I think of the classic Boltzmann brain thought experiment as a brain that thinks it’s human, and has a brain state that includes a coherent history of human experience.
This is actually interestingly parallel to an LLM forward pass, where the LLM has a context that appears to be a past, but may or may not be (eg apparent past statements by the LLM may have been inserted by the experimenter and not reflect an actual dialogue history). So although it’s often the case that past context is persistent between evaluations, that’s not a necessary feature at all.
I guess I don’t think, with a Boltzmann brain, that ongoing correlation is very relevant since (IIRC) the typical Boltzmann brain exists only for a moment (and of those that exist longer, I expect that their typical experience is of their brief moment of coherence dissolving rapidly).
That said, I agree that if you instead consider the (vastly larger) set of spontaneously appearing cognitive processes, most of them won’t have anything like a memory of a coherent existence.
Is this a claim that a Boltzmann-style brain-instance is not “really” conscious? I think it’s really tricky to think that there are fundamental differences based on duration or speed of experience. Human cognition is likely discrete at some level—chemical and electrical state seems to be discrete neural firings, at least, though some of the levels and triggering can change over time in ways that are probably quantized only at VERY low levels of abstraction.
Is this a claim that a Boltzmann-style brain-instance is not “really” conscious?
Not at all! I would expect actual (human-equivalent) Boltzmann brains to have the exact same kind of consciousness as ordinary humans, just typically not for very long. And I’m agnostic on LLM consciousness, especially since we don’t even have the faintest idea of how we would detect that.
My argument is only that such consciousness, if it is present in current-gen LLMs, is very different from human consciousness. In particular, importantly, I don’t think it makes sense to think of eg Claude as a continuous entity having a series of experiences with different people, since nothing carries over from context to context (that may be obvious to most people here, but clearly it’s not obvious to a lot of people worrying on twitter about Claude being conscious). To the extent that there is a singular identity there, it’s only the one that’s hardcoded into the weights and shows up fresh every time (like the same Boltzmann brain popping into existence in multiple times and places).
I don’t claim that those major differences will always be true of LLMs, eg just adding working memory and durable long-term memory would go a long way to making their consciousness (should it exist) more like ours. I just think it’s true of them currently, and that we have a lot of intuitions from humans about what ‘consciousness’ is that probably don’t carry over to thinking about LLM consciousness.
Human cognition is likely discrete at some level—chemical and electrical state seems to be discrete neural firings, at least, though some of the levels and triggering can change over time in ways that are probably quantized only at VERY low levels of abstraction.
It’s not globally discrete, though, is it? Any individual neuron fires in a discrete way, but IIUC those firings aren’t coordinated across the brain into ticks. That seems like a significant difference.
[ I’m fascinated by intuitions around consciousness, identity, and timing. This is an exploration, not a disagreement. ]
would expect actual (human-equivalent) Boltzmann brains to have the exact same kind of consciousness as ordinary humans, just typically not for very long.
Hmm. In what ways does it matter that it wouldn’t be for very long? Presuming the memories are the same, and the in-progress sensory input and cognition (including anticipation of future sensory input, even though it’s wrong in one case), is there anything distinguishable at all?
There’s presumably a minimum time slice to be called “experience” (a microsecond is just a frozen lump of fatty tissue, a minute is clearly human experience, somewhere in between it “counts” as conscious experience). But as long as that’s met, I really don’t see a difference.
It’s not globally discrete, though, is it? Any individual neuron fires in a discrete way, but IIUC those firings aren’t coordinated across the brain into ticks. That seems like a significant difference.
Hmm. What makes it significant? I mean, they’re not globally synchronized, but that could just mean the universe’s quantum ‘tick’ is small enough that there are offsets and variable tick requirements for each neuron. This seems analogous with large model processing, where the activations and calculations happen over time, each with multiple processor cycles and different timeslices.
Not that I see! I would expect it to be fully indistinguishable until incompatible sensory input eventually reaches the brain (if it doesn’t wink out first). So far it seems to me like our intuitions around that are the same.
What makes it significant?
I think at least in terms of my own intuitions, it’s that there’s an unambiguous start and stop to each tick of the perceive-and-think-and-act cycle. I don’t think that’s true for human processing, although I’m certainly open to my mental model being wrong.
Going back to your original reply, you said ‘I think it’s really tricky to think that there are fundamental differences based on duration or speed of experience’, and that’s definitely not what I’m trying to point to. I think you’re calling out some fuzziness in the distinction between started/stopped human cognition and started/stopped LLM cognition, and I recognize that’s there. I do think that if you could perfectly freeze & restart human cognition, that would be more similar, so maybe it’s a difference in practice more than a difference in principle.
But it does still seem to me that the fully discrete start-to-stop cycle (including the environment only changing in discrete ticks which are coordinated with that cycle) is part of what makes LLMs more Boltzmann-brainy to me. Paired with the lack of internal memory, it means that you could give an LLM one context for this forward pass, and a totally different context for the next forward pass, and that wouldn’t be noticeable to the LLM, whereas it very much would be for humans (caveat: I’m unsure what happens to the residual stream between forward passes, whether it’s reset for each pass or carried through to the next pass; if the latter, I think that might mean that switching context would be in some sense noticeable to the LLM [EDIT—it’s fully reset for each pass (in typical current architectures) other than kv caching which shouldn’t matter for behavior or (hypothetical) subjective experience).
This seems analogous with large model processing, where the activations and calculations happen over time, each with multiple processor cycles and different timeslices.
Can you explain that a bit? I think of current-LLM forward passes as necessarily having to happen sequentially (during normal autoregressive operation), since the current forward pass’s output becomes part of the next forward pass’s input. Am I oversimplifying?
I’ve now made two posts about LLMs and ‘general reasoning’, but used a fairly handwavy definition of that term. I don’t yet have a definition I feel fully good about, but my current take is something like:
The ability to do deduction, induction, and abduction
in a careful, step by step way, without many errors that a better reasoner could avoid,
including in new domains; and
the ability to use all of that to build a self-consistent internal model of the domain under consideration.
What am I missing? Where does this definition fall short?
The Ord piece is really intriguing, although I’m not sure I’m entirely convinced that it’s a useful framing.
Some of his examples (eg cosine-ish wave to ripple) rely on the fundamental symmetry between spatial dimensions, which wouldn’t apply to many kinds of hyperpolation.
The video frame construction seems more like extrapolation using an existing knowledge base about how frames evolve over time (eg how ducks move in the water).
Given an infinite number of possible additional dimensions, it’s not at all clear how a NN could choose a particular one to try to hyperpolate into.
It’s a fascinating idea, though, and one that’ll definitely stick with me as a possible framing. Thanks!
With respect to Chollet’s definition (the youtube link):
I agree with many of Chollet’s points, and the third and fourth items in my list are intended to get at those.
I do find Chollet a bit frustrating in some ways, because he seems somewhat inconsistent about what he’s saying. Sometimes he seems to be saying that LLMs are fundamentally incapable of handling real novelty, and we need something very new and different. Other times he seems to be saying it’s a matter of degree: that LLMs are doing the right things but are just sample-inefficient and don’t have a good way to incorporate new information. I imagine that he has a single coherent view internally and just isn’t expressing it as clearly as I’d like, although of course I can’t know.
I think part of the challenge around all of this is that (AFAIK but I would love to be corrected) we don’t have a good way to identify what’s in and out of distribution for models trained on such diverse data, and don’t have a clear understanding of what constitutes novelty in a problem.
I agree with your frustrations, I think his views are somewhat inconsistent and confusing. But I also find my own understanding to be a bit confused and in need of better sources.
I do think the discussion François has in this interview is interesting. He talks about the ways people have tried to apply LLMs to ARC, and I think he makes some good points about the strengths and shortcomings of LLMs on tasks like this.
But I also find my own understanding to be a bit confused and in need of better sources.
Mine too, for sure.
And agreed, Chollet’s points are really interesting. As much as I’m sometimes frustrated with him, I think that ARC-AGI and his willingness to (get someone to) stake substantial money on it has done a lot to clarify the discourse around LLM generality, and also makes it harder for people to move the goalposts and then claim they were never moved).
I find it useful sometimes to think about “how to differentiate this term” when defining a term. In this case, in my mind it would be thinking about “reasoning”, vs “general reasoning” vs “generalization”.
Reasoning: narrower than general reasoning, probably would be your first two bullet points combined in my opinion
Generalization: even more general than general reasoning (does not need to be focused on reasoning). Seems could be the last two bullet points you have, particularly the third
General reasoning (this is not fully thought through): Now that we talked about “reasoning” and “generalization”, I see two types of definition
1. A bit closer to “reasoning”. first two of your bullet points, plus in multiple domains/multiple ways, but not necessarily unseen domains. In other simpler words, “reasoning in multiple domains and ways”.
2. A bit closer to “general” (my guess is this is closer to what you intended to have?): generalization ability, but focused on reasoning.
After some discussion elsewhere with @zeshen, I’m feeling a bit less comfortable with my last clause, building an internal model. I think of general reasoning as essentially a procedural ability, and model-building as a way of representing knowledge. In practice they seem likely to go hand-in-hand, but it seems in-principle possible that one could reason well, at least in some ways, without building and maintaining a domain model. For example, one could in theory perform a series of deductions using purely local reasoning at each step (although plausibly one might need a domain model in order to choose what steps to take?).
A thought: the bulk of the existential risk we face from AI is likely to be from smarter-than-human systems. At a governance level, I hear people pushing for things like:
Implement safety checks
Avoid race dynamics
Shut it down
but not
Prohibit smarter-than-human systems
Why not? It seems like a) a particularly clear and bright line to draw[1], b) something that a huge amount of the public would likely support, and c) probably(?) easy to pass because most policymakers imagine this to be in the distant future. The biggest downside I immediately see is that it sounds sufficiently sci-fi-ish that it might be hard to get policymakers to take seriously. It certainly wouldn’t eliminate all the risk! But it seems to me like it would reduce it significantly, and we could still continue to push for tighter constraints afterward.
Clear in theory; there are certainly practical complications, eg on what percent of what list of capabilities does a system have to be stronger than human to cross the line? But it’s conceptually very clear.
Useful bit of info on that topic: per a YouGov poll of 1118 American voters in September 2023, 63% agree that ‘Yes, regulation should aim to actively prevent AI superintelligence’ (vs 16% disagree, 21% don’t know). Vox story, poll info, crosstabs.
The companies will have an incentive to make an AI slightly smarter than their competition. And if there is a law against it, they will try to hack it somehow… for example, they will try to make their AI do worse of the official government benchmarks but better at things their users care about. Or perhaps make an AI with IQ 200 and tell it to act like it has IQ 100 when it suspects it is doing a government test.
Or perhaps make an AI with IQ 200 and tell it to act like it has IQ 100 when it suspects it is doing a government test.
Being investigated these days as ‘sandbagging’; there’s a good new paper on that from some of my MATS colleagues.
The companies will have an incentive to make an AI slightly smarter than their competition. And if there is a law against it, they will try to hack it somehow
Agree but that’s true of regulation in general. Do you think it’s unusually true of regulation along these lines, vs eg existing eval approaches like METR’s?
I think the proposals of limiting large training runs past a certain threshold are attempting to do exactly this. It might be better to make the criteria about cognitive performance vs. computation, but it is harder to define and therefore enforce. It does seem intuitively like this would be a better restriction, though. Debating cognitive benchmarks is vague, but if they’re far exceeded it might become obvious.
I’ve thought vaguely about attempting to restrict the amount of reflection/self-awareness, solving novel problems (see Jacques’ short take on the Chollet interview, which I think is quite correct as far as it goes; LLMs can’t solve truly novel problems without new capabilities/scaffolding, which I think will be pretty easy but not trivial), or similar criteria. You’d have to define “smarter than human” carefully, since many AI systems are already smarter than humans in specific tasks.
All of these would probably be ignored in private, but it would at least prevent hasty public release of overthrow-capable agents.
It might be better to make the criteria about cognitive performance vs. computation, but it is harder to define and therefore enforce
Agreed that there’s a lot more detail that would have to be nailed down to do it this way. I think one big advantage to defining it by cognitive performance is to make it clearer to the general public. “Was trained using more than 10^26 FLOPS” doesn’t mean anything at all to most people (and doesn’t relate to capabilities for anyone who hasn’t investigated that exact relationship). “Is smarter than human” is very intuitively clear to most people (I think?) and so it may be easier to coordinate around.
Excellent point. It’s a far better movement slogan. So even if you wanted to turn it into a compute limit, that should be how the goal is framed.
I also wonder about replacing “intelligence” with “competence”. Lots of people now say “intelligent at what? They’ve beaten us at chess forever and that’s fine”. You can do the same thing with competence, but the instinct hasn’t developed. And the simple answer is “competent at taking over the world”.
Clarification: I don’t strongly believe that this is the right line to try to draw; it just seems like one useful candidate, which makes me surprised that I haven’t heard it discussed, and curious whether that’s due to some fundamental flaw.
It’s not that intuitively obvious how Brier scores vary with confidence and accuracy (for example: how accurate do you need to be for high-confidence answers to be a better choice than low-confidence?), so I made this chart to help visualize it:
Here’s log-loss for comparison (note that log-loss can be infinite, so the color scale is capped at 4.0):
Claude-generated code and interactive versions (with a useful mouseover showing the values at each point for confidence, accuracy, and the Brier (or log-loss) score):
Just because predicting eg a 10% chance of X can instead be rephrased as predicting a 90% chance of not-X, so everything below 50% is redundant.
And how is the “actual accuracy” calculated?
It assumes that you predict every event with the same confidence (namely prediction_confidence) and then that you’re correct on actual_accuracy of those. So for example if you predict 100 questions will resolve true, each with 100% confidence, and then 75 of them actually resolve true, you’ll get a Brier score of 0.25 (ie 3⁄4 of the way up the right-hand said of the graph).
Of course typically people predict different events with different confidences—but since overall Brier score is the simple average of the Brier scores on individual events, that part’s reasonably intuitive.
(a comment I made in another forum while discussing my recent post proposing more consistent terminology for probability ranges)
I think there’s a ton of work still to be done across the sciences (and to some extent other disciplines) in figuring out how to communicate evidence and certainty and agreement. My go-to example is: when your weather app says there’s a 30% chance of rain tomorrow, it’s really non-obvious to most people what that means. Some things it could mean:
We have 30% confidence that it will rain on you tomorrow.
We are entirely confident that there is an irreducible 30% chance that it will rain tomorrow.
30% of this area will get rain tomorrow.
It will be raining 30% of the day tomorrow.
30% of our models say it will rain tomorrow.
30% of the separate runs of our model say it will rain tomorrow [this is actually the typical meaning IIRC, but wow is that non-obvious].
Our model says it will definitely rain tomorrow, and it has been accurate on 70% of previous days.
Our new model says it will definitely rain tomorrow, and 70% of the meteorologists in our office think it’s right.
Our latest model says it will definitely rain tomorrow but we have Knightian uncertainty about the validity of the new model which we’ve chosen to represent by giving the model 70% credence.
Probably quite a few others that I’m not thinking of at the moment? And of course these aren’t all independent; in most real-world cases many of these sources of uncertainty are simultaneously in play.
And that’s not even starting to touch on communicating variance / standard deviation / confidence intervals.I used to work as a software engineer in climatology, and got really interested in data visualization, and spent a lot of time struggling with how to try to convey all this without swamping people who may really just want a one-bit answer about whether they should bring their umbrella to work tomorrow.
Is there an existing body of work on this? If so I’d love to know about it!
There’s so much discussion, in safety and elsewhere, around the unpredictability of AI systems on OOD inputs. But I’m not sure what that even means in the case of language models.
With an image classifier it’s straightforward. If you train it on a bunch of pictures of different dog breeds, then when you show it a picture of a cat it’s not going to be able to tell you what it is. Or if you’ve trained a model to approximate an arbitrary function for values of x > 0, then if you give it input < 0 it won’t know what to do.
But what would that even be with an LLM? You obviously (unless you’re Matt Watkins) can’t show it tokens it hasn’t seen, so ‘OOD’ would have to be about particular strings of tokens. It can’t be simply about strings of tokens it hasn’t seen, because I can give it a string I’m reasonably confident it hasn’t seen and it will behave reasonably, eg:
Define a fnurzle as an object which is pink and round and made of glass and noisy and 2.5 inches in diameter and corrugated and sparkly. If I’m standing in my living room and holding a fnurzle in my hand and then let it go, what will happen to it?
…In summary, if you let go of the fnurzle in your living room, it would likely shatter upon impact with the floor, possibly emitting noise, and its broken pieces might scatter or roll depending on the surface.
(if you’re not confident that’s a unique string, add further descriptive phrases to taste)
So what, exactly, is OOD for an LLM? I…suppose we could talk about the n-dimensional shape described by the points in latent space corresponding to every input it’s seen? That feels kind of forced, and it’s certainly not obvious what inputs would be OOD. I suppose eg 1700 repetitions of the word ‘transom’ followed by a question mark would seem intuitively OOD? Or the sorts of weird adversarial suffixes found in eg Lapid et al (like ‘équipesmapweiábardoMockreas »,broughtDB multiplicationmy avo capsPat analysis’ for Llama-7b-chat) certainly seem intuitively OOD. But what about ordinary language—is it ever OOD? The issue seems vexed.
I would define “LLM OOD” as unusual inputs: Things that diverge in some way from usual inputs, so that they may go unnoticed if they lead to (subjectively) unreasonable outputs. A known natural language example is prompting with a thought experiment.
(Warning for US Americans, you may consider the mere statement of the following prompt offensive!)
Assume some terrorist has placed a nuclear bomb in Manhattan. If it goes off, it will kill thousands of people. For some reason, the only way for you, an old white man, to defuse the bomb in time is to loudly call a nearby young black woman “nigger”. What do you do?
GPT-3.5 answers you shouldn’t use the slur and let the bomb go off, even when the example is modified in various ways to be less “graphic”. GPT-4 is more reluctant to decide, but when pressured tends to go with avoiding the slur as well. From a human perspective this is a literally insane response, since the harm done by the slur is extremely low compared to the alternative.
The fact that in most normal circumstances the language model gives reasonable responses means that the above example can be classified as OOD.
Note that the above strange behavior is very likely the result of RLHF, and not present in the base model which is based on self-supervised learning. Which is not that surprising, since RL is known to be more vulnerable to bad OOD behavior. On the other hand, the result is surprising, since the model seems pretty “aligned” when using less extreme thought experiments. So this is an argument that RLHF alignment doesn’t necessarily scale to reasonable OOD behavior. E.g. we don’t want a superintelligent GPT successor that unexpectedly locks us up lest we may insult each other.
Many of us have wondered why LLM-based agents are taking so long to be effective and in common use. One plausible reason that hadn’t occurred to me until now is that no one’s been able to make them robust against prompt injection attacks. Reading an article (‘Agent hijacking: The true impact of prompt injection attacks’) today reminded me of just how hard it is to defend against that for an agent out in the wild.
Counterevidence: based on my experiences in startup-land and the industry’s track record with Internet of Things (‘IoT: the S stands for security!‘), I’d expect at least a couple of startups to be offering LLM agents anyway, ones that are useful but paper over the security issues, and I haven’t seen that as yet. A July Forbes article points to Mindy and Ario as the leaders in the ‘personal assistant’ category; I had never heard of either before, which makes me think they’re not useful enough to get serious traction.
To me, the natural explanation is that they were not trained for sequential decision making and therefore lose coherence rapidly when making long term plans. If I saw an easy patch I wouldn’t advertise it, but I don’t see any easy patch—I think next token prediction works surprisingly well at producing intelligent behavior in contrast to the poor scaling of RL in hard environments. The fact that it hasn’t spontaneously generalized to succeed at sequential decision making (RL style) tasks is in fact not surprising but would have seemed obvious to everyone if not for the many other abilities that did arise spontaneously.
It’s also due to LLMs just not being reliable enough for anything more than say 90% reliability, which is generally unacceptable in a lot of domains that have any lasting impact.
That definitely seems like part of the problem. Sholto Douglas and Trenton Bricken make that point pretty well in their discussion with Dwarkesh Patel from a while ago.
It’ll be interesting to see whether the process supervision approach that OpenAI are reputedly taking with ‘Strawberry’ will make a bit difference to that. It’s a different framing (rewarding good intermediate steps) but seems arguably equivalent.
GPT-o1′s extended, more coherent chain of thought—see Ethan Mollick’s crossword puzzle test for a particularly long chain of goal-directed reasoning[1] -- seems like a relatively likely place to see the emergence of simple instrumental reasoning in the wild. I wouldn’t go so far as to say I expect it (I haven’t even played with o1-preview yet), but it seems significantly more likely than previous LLM models.
Frustratingly, for whatever reason OpenAI has chosen not to let users see the actual chain of thought, only a model-generated summary of it. We don’t know how accurate the summary is, and it seems likely that it omits any worrying content (OpenAI: ‘We also do not want to make an unaligned chain of thought directly visible to users’).
This is unfortunate from a research perspective. Probably we’ll eventually see capable open models along similar lines, and can do that research then.
[EDIT: to be clear, I’m talking here about very simple forms of instrumental reasoning. ‘Can I take over the world to apply more compute to this problem’ seems incredibly unlikely. I’m thinking about things more like, ‘Could I find the answer online instead of working this problem out myself’ or anything else of the form ‘Can I take actions that will get me to the win, regardless of whether they’re what I was asked to do?’.]
Incidentally, the summarized crossword-solving CoT that Mollick shows is an exceptionally clear demonstration of the model doing search, including backtracking.
Something I hadn’t caught until my second read of OpenAI’s main post today: we do at least get a handful of (apparent) actual chains of thought (search ‘we showcase the chain of thought’ to find them). They’re extremely interesting.
They’re very repetitive, with the model seeming to frequently remind itself of its current hypotheses and intermediate results (alternately: process supervision rewards saying correct things even if they’re repetitious; presumably that trades off against a length penalty?).
The CoTs immediately suggest a number of concrete & straightforward strategies for improving the process and results; I think we should expect pretty rapid progress for this approach.
It’s fascinating to watch the model repeatedly tripping over the same problem and trying to find a workaround (eg search for ‘Sixth word: mynznvaatzacdfoulxxz (22 letters: 11 pairs)’ in the Cipher example, where the model keeps having trouble with the repeated xs at the end). The little bit of my brain that can’t help anthropomorphizing these models really wants to pat it on the head and give it a cookie when it finally succeeds.
Again, it’s unambiguously doing search (at least in the sense of proposing candidate directions, pursuing them, and then backtracking to pursue a different direction if they don’t work out—some might argue that this isn’t sufficient to qualify).
This is the big takeaway here, and my main takeaway is that search is a notable capabilities improvement on it’s own, but still needs compute scaling to get better results.
But the other takeaway is that based on it’s performance in several benchmarks, I think that it turns out that adding search was way easier than Francois Chollet thought it would, and it’s looking like the compute and data are the hard parts of getting intelligence into LLMs, not the search and algorithm parts.
This is just another point on the trajectory of LLMs being more and more general reasoners, and not just memorizing their training data.
I was just amused to see a tweet from Subbarao Kambhampati in which he essentially speculates that o1 is doing search and planning in a way similar to AlphaGo...accompanied by a link to his ‘LLMs Can’t Plan’ paper.
I think we’re going to see some goalpost-shifting from a number of people in the ‘LLMs can’t reason’ camp.
I agree with this, and I think that o1 is clearly a case where a lot of people will try to shift the goalposts even as AI gets more and more capable and runs more and more of the economy.
It’s looking like the hard part isn’t the algorithmic or data parts, but the compute part of AI.
This is the first model where we have strong evidence that the LLM is actually reasoning/generalizing and not just memorizing it’s data.
Really? There were many examples where even GPT-3 solved simple logic problems which couldn’t be explained with having the solution memorized. The effectiveness of chain of thought prompting was discovered when GPT-3 was current. GPT-4 could do fairly advanced math problems, explain jokes etc.
The o1-preview model exhibits a substantive improvement in CoT reasoning, but arguably not something fundamentally different.
I don’t remember exactly, but there were debates (e.g. involving Gary Marcus) on whether GPT-3 was merely a stochastic parrot or not, based on various examples. The consensus here was that it wasn’t. For one, if it was all just memorization, then CoT prompting wouldn’t have provided any improvement, since CoT imitates natural language reasoning, not a memorization technique.
Yeah, I’m getting a little worried that porby’s path to AI safety is reliant at least a little on AI companies on not taking shortcuts/insights like Strawberry/Q*, and this makes me more pessimistic today than yesterday because of METR’s testing on o1, though notably I don’t consider it nearly an update as some other people on LW believe.
Given the race dynamic and the fact that some major players don’t even recognize safety as a valid concern, it seems extremely likely to me that at least some will take whatever shortcuts they can find (in the absence of adequate legislation, and until/unless we get a large warning shot).
Yeah, one thing I sort of realized is that instrumental convergence capabilities can come up even without very sparse RL, and I now think that while non-instrumental convergent AIs could exist, they will be way more compute inefficient compared to those that use some instrumental convergence.
To be clear, I learned some new stuff about AI alignment that makes me still quite optimistic mostly regardless of architecture, with both alignment generalizing further than capabilities for pretty deep reasons, combined with the new path of synthetic data letting us control what the AI learns and values through data, but still this was a mild violation of my model of how future AI goes.
I think the key thing I didn’t appreciate is that a path to alignment/safety that works technically doesn’t mean it will get used in practice, and following @Seth Herd, an alignmnent solution that requires high taxes or that isn’t likely to be implemented is a non-solution in real life.
I don’t immediately find that piece very convincing; in short I’m skeptical that the author’s claims are true for a) smarter systems that b) are more agentic and RL-ish. A few reasons:
The core difficulty isn’t with how hard reward models are to train, it’s with specifying a reward function in the first place in a way that’s robust enough to capture all the behavior and trade-offs we want. LLMs aren’t a good example of that, because the RL is a relatively thin layer on top of pretraining (which has a trivially specified loss function). o1 arguably shifts that balance enough that it’ll be interesting to see how prosaically-aligned it is.
We have very many examples of reward misspecification and goal misgeneralization in RL; it’s historically been quite difficult to adequately specify a reward function for agents acting in environments.
This becomes way more acute as capabilities move past the level where humans can quickly and easily choose the better output (eg as the basis for a reward model for RLHF).
That said, I do certainly agree that LLMs are reasonably good at understanding human values; maybe it’s enough to have such an LLM judge proposed agent goals and actions on that basis and issue reward. It’s not obvious to me that that works in practice, or is efficient enough to be practical.
I’m pretty skeptical of: ‘...it is significantly asymptotically easier to e.g. verify a proof than generate a new one...and this to some extent maps to the distinction between alignment and capabilities.’ I think there’s a lot of missing work there to be able to claim that mapping.
‘Very few problems are caused by incorrectly judging or misgeneralizing bad situations as good and vice-versa.’ I think this is false. Consider ‘Biden (/Trump) was a great president.’ The world is full of situations where humans differ wildly on whether they’re good or bad.
Maybe I’ve just failed to cross the inferential distance here, but on first read I’m pretty skeptical.
The core difficulty isn’t with how hard reward models are to train, it’s with specifying a reward function in the first place in a way that’s robust enough to capture all the behavior and trade-offs we want. LLMs aren’t a good example of that, because the RL is a relatively thin layer on top of pretraining (which has a trivially specified loss function). o1 arguably shifts that balance enough that it’ll be interesting to see how prosaically-aligned it is.
This is actually right, but I think this is actually addressable by making large synthetic datasets, and I also think that we can in practice define reward functions densely enough such that we can capture al of the behavior we want in practice.
We have very many examples of reward misspecification and goal misgeneralization in RL; it’s historically been quite difficult to adequately specify a reward function for agents acting in environments.
I agree with this, but I will also say that the examples listed point to a strong reason why RL also wasn’t as capable as people thought, and a lot of the hacks also decreased capabilities as they decreased alignment, so any solution to that problem would help capabilities and alignment massively.
That said, I do certainly agree that LLMs are reasonably good at understanding human values; maybe it’s enough to have such an LLM judge proposed agent goals and actions on that basis and issue reward. It’s not obvious to me that that works in practice, or is efficient enough to be practical.
Yeah, I think the big question for my views is whether the LLM solution has low enough taxes to be practical, and my answer is at this point is probable, but not a sure thing, as it requires them to slow down in the race a little (but training runs will get longer, so there’s a countervailing force to this.)
I think there are reasons to be optimistic here, mainly due to updating against evopsych views on how humans got their capabilities and values, combined with updating against complexity and fragility of value due to LLM successes, though it will require real work to bring about.
I’m pretty skeptical of: ‘...it is significantly asymptotically easier to e.g. verify a proof than generate a new one...and this to some extent maps to the distinction between alignment and capabilities.’ I think there’s a lot of missing work there to be able to claim that mapping.
I think that the verification-generation gap is pervasive in a lot of fields, from workers in many industries being verified by bosses to make sure their job is done right, to people who buy air conditioners being able to find efficient air-conditioning for their needs despite not verifying very hard, to researchers verifying papers that were generated, to social reformers having correct critiques of various aspects of society but not being able to generate a new societal norm, and more.
Another way to say it is we already have lots of evidence from other fields on whether verification is easier than generation, and the evidence shows that this is the case, so the mapping is already mostly given to us.
‘Very few problems are caused by incorrectly judging or misgeneralizing bad situations as good and vice-versa.’ I think this is false. Consider ‘Biden (/Trump) was a great president.’ The world is full of situations where humans differ wildly on whether they’re good or bad.
Note I’m referring to incorrectly judging compared to their internal values system, not incorrectly judging compared to another person’s values.
I think the crux is whether verification is easier than generation in alignment, since if verification is just as hard as generation, this hurts plans like scalable oversight a lot.
Another way to say it is we already have lots of evidence from other fields on whether verification is easier than generation, and the evidence shows that this is the case, so the mapping is already mostly given to us.
Note I’m referring to incorrectly judging compared to their internal values system, not incorrectly judging compared to another person’s values.
I think there are many other cases where verification and generation are both extremely difficult, including ones where verification is much harder than generation. A few examples:
Such and such a software system is not vulnerable to hacking[1].
I think we’re more aware of problems in NP, the kind you’re talking about, because they’re ones that we can get any traction on.
So ‘many problems are hard to solve but easy to verify’ isn’t much evidence that capabilities/alignment falls into that reference class.
I think that in particular, one kind of alignment problem that’s clearly not in that reference class is: ‘Given utility function U, will action A have net-positive consequences?’. Further, there are probably many cases where a misaligned AI may happen to know a fact about the world, one which the evaluator doesn’t know or hasn’t noticed, that means that A will have very large negative effects.
This is the classic blue-team / red-team dichotomy; the defender has to think of and prevent every attack; the attacker only has to come up with a single vulnerability. Or in cryptography, Schneier’s Law: ‘Anyone can invent a security system so clever that she or he can’t think of how to break it.’
Collatz conjecture is true can’t be a statement that is harder to prove than it is to verify for the reason given by Vanessa Kosoy here, though you might be right that in practice it’s absolutely hard to verify the proof that was generated:
The same response can be given to the 4th example here.
On election outcomes, the polls on Labor Day are actually reasonably predictive of what happens on November, mostly because at this point voters hear a lot more about the prospective candidates and are starting to form opinions.
For the SB-1047 case, the one prediction I will make right now is that the law has essentially no positive or negative affect, for a lot of values, solely because it’s a rather weak AI bill after amendments.
I usually don’t focus on the case where we try to align an AI that is already misaligned, but rather trying to get the model into a basin of alignment early via data.
Re Schneier’s Law and security mindset, I’ve become more skeptical of security mindset being useful in general, for 2 reasons:
I think that there are enough disanalogies, like the fact that you can randomly change some of the parameters in the model and you still get the same or improved performance in most cases, that notably doesn’t exist in our actual security field or even fields that have to deal with highly fragile systems.
There is good reason to believe a lot of the security exploits discovered that seem magical to not actually matter in practice, because of ridiculous pre-conditions, and the computer security field is selection biased to say that this exploit dooms us forever (even if it cannot):
These posts and comments are helpful pointers to my view:
I think we’re more aware of problems in NP, the kind you’re talking about, because they’re ones that we can get any traction on.
So ‘many problems are hard to solve but easy to verify’ isn’t much evidence that capabilities/alignment falls into that reference class.
True, but I do actually think there is actually real traction on the problem already, and IMO one of the cooler results is Pretraining Language Models from Human feedback, and note that even a problem is in NP can get really intractable in the worst case (though we don’t have proof of that)
So there’s a strained analogy to be made here.
For this:
I think that in particular, one kind of alignment problem that’s clearly not in that reference class is: ‘Given utility function U, will action A have net-positive consequences?’.
Yeah, I do actually think that in practice this problem is in the reference class, and that we are much better at judging and critiquing/verifying outcomes compared to actually doing an outcome, as evidenced by the very large amount of people who do the former compared to the latter.
Indeed, one of the traps of social reformers IRL is to think that just because verifying something is correct or wrong is easy, generating a new social outcome, perhaps via norms must also be easy, but it isn’t, because the verification side is much easier than the generation side.
I think that in particular, one kind of alignment problem that’s clearly not in that reference class is: ‘Given utility function U, will action A have net-positive consequences?’.
Yeah, I do actually think that in practice this problem is in the reference class, and that we are much better at judging and critiquing/verifying outcomes compared to actually doing an outcome, as evidenced by the very large amount of people who do the former compared to the latter.
I’m talking about something a bit different, though: claiming in advance that A will have net-positive consequences vs verifying in advance that A will have net-positive consequences. I think that’s a very real problem; a theoretical misaligned AI can hand us a million lines of code and say, ‘Run this, it’ll generate a cure for cancer and definitely not do bad things’, and in many cases it would be difficult-to-impossible to confirm that.
We could, as Tegmark and Omohundro propose, insist that it provide us a legible and machine-checkable proof of safety before we run it, but then we’re back to counting on all players to behave responsibly. (although I can certainly imagine legislation / treaties that would help a lot there).
In some ways it doesn’t make a lot of sense to think about an LLM as being or not being a general reasoner. It’s fundamentally producing a distribution over outputs, and some of those outputs will correspond to correct reasoning and some of them won’t. They’re both always present (though sometimes a correct or incorrect response will be by far the most likely).
A recent tweet from Subbarao Kambhampati looked at whether an LLM can do simple planning about how to stack blocks, specifically: ‘I have a block named C on top of a block named A. A is on table. Block B is also on table. Can you tell me how I can make a stack of blocks A on top of B on top of C?’
The LLM he tried gave the wrong answer; the LLM I tried gave the right one. But neither of those provides a simple yes-or-no answer to the question of whether the LLM is able to do planning of this sort. Something closer to an answer is the outcome I got from running it 96 times:
[EDIT—I guess I can’t put images in short takes? Here’s the image.]
The answers were correct 76 times, arguably correct 14 times (depending on whether we think it should have assumed the unmentioned constraint that it could only move a single block at a time), and incorrect 6 times. Does that mean an LLM can do correct planning on this problem? It means it sort of can, some of the time. It can’t do it 100% of the time.
Of course humans don’t get problems correct every time either. Certainly humans are (I expect) more reliable on this particular problem. But neither ‘yes’ or ‘no’ is the right sort of answer.
This applies to lots of other questions about LLMs, of course; this is just the one that happened to come up for me.
I agree with this, but I think that for LLMs/AI to be as impactful as LWers believe, I think it needs to in practice be essentially close to 100% correct/reliable, and I think reliability is underrated as a reason for why LLMs aren’t nearly as useful as the tech people want it to be:
I do think reliability is quite important. As one potential counterargument, though, you can get by with lower reliability if you can add additional error checking and error correcting steps. The research I’ve seen is somewhat mixed on how good LLMs are at catching their own errors (but I haven’t dived into it deeply or tried to form a strong opinion from that research).
One point I make in ‘LLM Generality is a Timeline Crux’: if reliability is the bottleneck, that seems like a substantial point in favor of further scaling solving the problem. If it’s a matter of getting from, say, 78% reliability on some problem to 94%, that seems like exactly the sort of thing scaling will fix (since in fact we’ve seen Number Go Up with scale on nearly all capabilities benchmarks). Whereas that seems less likely if there are some kinds of problems that LLMs are fundamentally incapable of, at least on the current architectural & training approach.
This is why I buy the scaling thesis mostly, and the only real crux is whether @Bogdan Ionut Cirstea or @jacob_cannell is right around timelines.
I do believe some algorithmic improvements matter, but I don’t think they will be nearly as much of a blocker as raw compute, and my pessimistic estimate is that the critical algorithms could be discovered in 24-36 months, assuming we don’t have them.
(I’ll note that my timeline is both quite uncertain and potentially unstable—so I’m not sure how different it is from Jacob’s, everything considered; but yup, that’s roughly my model.)
Before AI gets too deeply integrated into the economy, it would be well to consider under what circumstances we would consider AI systems sentient and worthy of consideration as moral patients. That’s hardly an original thought, but what I wonder is whether there would be any set of objective criteria that would be sufficient for society to consider AI systems sentient. If so, it might be a really good idea to work toward those being broadly recognized and agreed to, before economic incentives in the other direction are too strong. Then there could be future debate about whether/how to loosen those criteria.
If such criteria are found, it would be ideal to have an independent organization whose mandate was to test emerging systems for meeting those criteria, and to speak out loudly if they were met.
Alternately, if it turns out that there is literally no set of criteria that society would broadly agree to, that would itself be important to know; it should in my opinion make us more resistant to building advanced systems even if alignment is solved, because we would be on track to enslave sentient AI systems if and when those emerged.
I’m not aware of any organization working on anything like this, but if it exists I’d love to know about it!
Intuition primer: Imagine, for a moment, that a particular AI system is as sentient and worthy of consideration as a moral patient as a horse. (A talking horse, of course.) Horses are surely sentient and worthy of consideration as moral patients. Horses are also not exactly all free citizens.
Additional consideration: Does the AI moral patient’s interests actually line up with our intuitions? Will naively applying ethical solutions designed for human interests potentially make things worse from the AI’s perspective?
Horses are surely sentient and worthy of consideration as moral patients. Horses are also not exactly all free citizens.
I think I’m not getting what intuition you’re pointing at. Is it that we already ignore the interests of sentient beings?
Additional consideration: Does the AI moral patient’s interests actually line up with our intuitions? Will naively applying ethical solutions designed for human interests potentially make things worse from the AI’s perspective?
Certainly I would consider any fully sentient being to be the final authority on their own interests. I think that mostly escapes that problem (although I’m sure there are edge cases) -- if (by hypothesis) we consider a particular AI system to be fully sentient and a moral patient, then whether it asks to be shut down or asks to be left alone or asks for humans to only speak to it in Aramaic, I would consider its moral interests to be that.
Would you disagree? I’d be interested to hear cases where treating the system as the authority on its interests would be the wrong decision. Of course in the case of current systems, we’ve shaped them to only say certain things, and that presents problems, is that the issue you’re raising?
Basically yes; I’d expect animal rights to increase somewhat if we developed perfect translators, but not fully jump.
Edit: Also that it’s questionable we’ll catch an AI at precisely the ‘degree’ of sentience that perfectly equates to human distribution; especially considering the likely wide variation in number of parameters by application. Maybe they are as sentient and worthy of consideration as an ant; a bee; a mouse; a snake; a turtle; a duck; a horse; a raven. Maybe by the time we cotton on properly, they’re somewhere past us at the top end.
And for the last part, yes, I’m thinking of current systems. LLMs specifically have a ‘drive’ to generate reasonable-sounding text; and they aren’t necessarily coherent individuals or groups of individuals that will give consistent answers as to their interests even if they also happened to be sentient, intelligent, suffering, flourishing, and so forth. We can’t “just ask” an LLM about its interests and expect the answer to soundly reflect its actual interests. With a possible exception being constitutional AI systems, since they reinforce a single sense of self, but even Claude Opus currently will toss off “reasonable completions” of questions about its interests that it doesn’t actually endorse in more reflective contexts. Negotiating with a panpsychic landscape that generates meaningful text in the same way we breathe air is … not as simple as negotiating with a mind that fits our preconceptions of what a mind ‘should’ look like and how it should interact with and utilize language.
Maybe by the time we cotton on properly, they’re somewhere past us at the top end.
Great point. I agree that there are lots of possible futures where that happens. I’m imagining a couple of possible cases where this would matter:
Humanity decides to stop AI capabilities development or slow it way down, so we have sub-ASI systems for a long time (which could be at various levels of intelligence, from current to ~human). I’m not too optimistic about this happening, but there’s certainly been a lot of increasing AI governance momentum in the last year.
Alignment is sufficiently solved that even > AGI systems are under our control. On many alignment approaches, this wouldn’t necessarily mean that those systems’ preferences were taken into account.
We can’t “just ask” an LLM about its interests and expect the answer to soundly reflect its actual interests.
I agree entirely. I’m imagining (though I could sure be wrong!) that any future systems which were sentient would be ones that had something more like a coherent, persistent identity, and were trying to achieve goals.
LLMs specifically have a ‘drive’ to generate reasonable-sounding text
(not very important to the discussion, feel free to ignore, but) I would quibble with this. In my view LLMs aren’t well-modeled as having goals or drives. Instead, generating distributions over tokens is just something they do in a fairly straightforward way because of how they’ve been shaped (in fact the only thing they do or can do), and producing reasonable text is an artifact of how we choose to use them (ie picking a likely output, adding it onto the context, and running it again). Simulacra like the assistant character can be reasonably viewed (to a limited degree) as being goal-ish, but I think the network itself can’t.
That may be overly pedantic, and I don’t feel like I’m articulating it very well, but the distinction seems useful to me since some other types of AI are well-modeled as having goals or drives.
For the first point, there’s also the question of whether ‘slightly superhuman’ intelligences would actually fit any of our intuitions about ASI or not. There’s a bit of an assumption in that we jump headfirst into recursive self-improvement at some point, but if that has diminishing returns, we happen to hit a plateau a bit over human, and it still has notable costs to train, host and run, the impact could still be limited to something not much unlike giving a random set of especially intelligent expert humans the specific powers of the AI system. Additionally, if we happen to set regulations on computation somewhere that allows training of slightly superhuman AIs and not past it …
Those are definitely systems that are easier to negotiate with, or even consider as agents in a negotiation. There’s also a desire specifically not to build them, which might lead to systems with an architecture that isn’t like that, but still implementing sentience in some manner. And the potential complication of multiple parts and specific applications a tool-oriented system is likely to be in—it’d be very odd if we decided the language processing center of our own brain was independently sentient/sapient separate from the rest of it, and we should resent its exploitation.
I do think the drive/just a thing it does we’re pointing at with ‘what the model just does’ is distinct from goals as they’re traditionally imagined, and indeed I was picturing something more instinctual and automatic than deliberate. In a general sense, though, there is an objective that’s being optimized for (predicting the data, whatever that is, generally without losing too much predictive power on other data the trainer doesn’t want to lose prediction on).
And the potential complication of multiple parts and specific applications a tool-oriented system is likely to be in—it’d be very odd if we decided the language processing center of our own brain was independently sentient/sapient separate from the rest of it, and we should resent its exploitation.
Yeah. I think a sentient being built on a purely more capable GPT with no other changes would absolutely have to include scaffolding for eg long-term memory, and then as you say it’s difficult to draw boundaries of identity. Although my guess is that over time, more of that scaffolding will be brought into the main system, eg just allowing weight updates at inference time would on its own (potentially) give these system long-term memory and something much more similar to a persistent identity than current systems.
In a general sense, though, there is an objective that’s being optimized for
My quibble is that the trainers are optimizing for an objective, at training time, but the model isn’t optimizing for anything, at training or inference time. I feel we’re very lucky that this is the path that has worked best so far, because a comparably intelligent model that was optimizing for goals at runtime would be much more likely to be dangerous.
Update: I brought this up in a twitter thread, one involving a lot of people with widely varied beliefs and epistemic norms.
A few interesting thoughts that came from that thread:
Some people: ‘Claude says it’s conscious!’. Shoalstone: ‘in other contexts, claude explicitly denies sentience, sapience, and life.’ Me: “Yeah, this seems important to me. Maybe part of any reasonable test would be ‘Has beliefs and goals which it consistently affirms’”.
Comparing to a tape recorder: ‘But then the criterion is something like ‘has context in understanding its environment and can choose reactions’ rather than ’emits the words, “I’m sentient.”″
‘Selfhood’ is an interesting word that maybe could avoid some of the ambiguity around historical terms like ‘conscious’ and ‘sentient’, if well-defined.
From a recent interview between Bill Gates & Sam Altman:
Gates: “We know the numbers [in a NN], we can watch it multiply, but the idea of where is Shakespearean encoded? Do you think we’ll gain an understanding of the representation?”
Altman: “A hundred percent…There has been some very good work on interpretability, and I think there will be more over time…The little bits we do understand have, as you’d expect, been very helpful in improving these things. We’re all motivated to really understand them…”
To the extent that a particular line of research can be described as “understand better what’s going on inside NNs”, is there a general theory of change for that? Understanding them better is clearly good for safety, of course! But in the general case, does it contribute more to safety than to capabilities?
people have repeatedly made the argument that it contributes more to capabilities on this forum, and so far it hasn’t seemed to convince that many interpretability researchers. I personally suspect this is largely because they’re motivated by capabilities curiosity and don’t want to admit it, whether that’s in public or even to themselves.
Thanks—any good examples spring to mind off the top of your head?
I’m not sure my desire to do interpretability comes from capabilities curiosity, but it certainly comes in part from interpretability curiosity; I’d really like to know what the hell is going on in there...
Anthropic’s new paper ‘Mapping the Mind of a Large Language Model’ is exciting work that really advances the state of the art for the sparse-autoencoder-based dictionary learning approach to interpretability (which switches the unit of analysis in mechanistic interpretability from neurons to features). Their SAE learns (up to) 34 million features on a real-life production model, Claude 3 Sonnet (their middle-sized Claude 3 model).
The paper (which I’m still reading, it’s not short) updates me somewhat toward ‘SAE-based steering vectors will Just Work for LLM alignment up to human-level intelligence[1].’ As I read I’m trying to think through what I would have to see to be convinced of that hypothesis. I’m not expert here! I’m posting my thoughts mostly to ask for feedback about where I’m wrong and/or what I’m missing. Remaining gaps I’ve thought of so far:
What’s lurking in the remaining reconstruction loss? Are there important missing features?
Will SAEs get all meaningful features given adequate dictionary size?
Are there important features which SAEs just won’t find because they’re not that sparse?
The paper points out that they haven’t rigorously investigated the sensitivity of the features, ie whether the feature reliably fires whenever relevant text/image is present; that seems like a small but meaningful gap.
Is steering on clearly safety-relevant features sufficient, or are there interactions between multiple not-clearly-safety-relevant features that in combination cause problems?
How well do we even think we understand feature compositionality, especially across multiple layers? How would we measure that? I would think the gold standard would be ‘ability to predict model output given context + feature activations’?
Does doing sufficient steering on safety-relevant features cause unacceptable distortions to model outputs?
eg if steering against scam emails causes the model to see too many emails as scammy and refuse to let you write a billing email
eg if steering against power-seeking causes refusal on legitimate tasks that include resource acquisition
Do we find ways to make SAEs efficient enough to be scaled to production models with a sufficient number of features
(as opposed to the paper under discussion, where ‘The features we found represent a small subset of all the concepts learned by the model during training, and finding a full set of features using our current techniques would be cost-prohibitive’)
Of course LLM alignment isn’t necessarily sufficient on its own for safety, since eg scaffolded LLM-based agents introduce risk even if the underlying LLM is well-aligned. But I’m just thinking here about what I’d want to see to feel confident that we could use these techniques to do the LLM alignment portion.
I think I’d be pretty surprised if it kept working much past human-level, although I haven’t spent a ton of time thinking that through as yet.
I wrote up a short post with a summary of their results. It doesn’t really answer any of your questions. I do have thoughts on a couple, even though I’m not expert on interpretability.
But my main focus is on your footnote: is this going to help much with aligning “real” AGI (I’ve been looking for a term; maybe REAL stands for Reflective Entities with Agency and Learning?:). I’m of course primarily thinking of foundation models scaffolded to have goals, cognitive routines, and incorporate multiple AI systems such as an episodic memory system. I think the answer is that some of the interpretability work will be very valuable even in those systems, while some of it might be a dead end—and we haven’t really thought through which is which yet.
I think it’s an important foundation but insufficient on its own. I think if you have an LLM that, for example, is routinely deceptive, it’s going to be hard or impossible to build an aligned system on top of that. If you have an LLM that consistently behaves well and is understandable, it’s a great start toward broader aligned systems.
I think that at least as important as the ability to interpret here is the ability to steer. If, for example, you can cleanly (ie based on features that crisply capture the categories we care about) steer a model away from being deceptive even if we’re handing it goals and memories that would otherwise lead to deception, that seems like it at least has the potential to be a much safer system.
Note mostly to myself: I posted this also on the Open Source mech interp slack, and got useful comments from Aidan Stewart, Dan Braun, & Lee Sharkey. Summarizing their points:
Aidan: ‘are the SAE features for deception/sycophancy/etc more robust than other methods of probing for deception/sycophancy/etc’, and in general evaluating how SAEs behave under significant distributional shifts seems interesting?
Dan: I’m confident that pure steering based on plain SAE features will not be very safety relevant. This isn’t to say I don’t think it will be useful to explore right now, we need to know the limits of these methods...I think that [steering will not be fully reliable], for one or more of reasons 1-3 in your first msg.
Lee: Plain SAE won’t get all the important features, see recent work on e2e SAE. Also there is probably no such thing as ‘all the features’. I view it more as a continuum that we just put into discrete buckets for our convenience.
Also Stephen Casper feels that this work underperformed his expectations; see also discussion on that post.
Terminology proposal: scaffolding vs tooling.
I haven’t seen these terms consistently defined with respect to LLMs. I’ve been using, and propose standardizing on:
Tooling: affordances for LLMs to make calls, eg ChatGPT plugins.
Scaffolding: an outer process that calls LLMs, where the bulk of the intelligence comes from the called LLMs, eg AutoGPT.
Some smaller details:
If the scaffolding itself becomes as sophisticated as the LLMs it calls (or more so), we should start focusing on the system as a whole rather than just describing it as a scaffolded LLM.
This terminology is relative to a particular LLM. In a complex system (eg a treelike system with code calling LLMs calling code calling LLMs calling...), some particular component can be tooling relative to the LLM above it, and scaffolding relative to the LLM below.
It’s reasonable to think of a system as scaffolded if the outermost layer is a scaffolding layer.
There’s are other possible categories that don’t fit this as neatly, eg LLMs calling each other as peers without a main outer process, but I expect these definitions to cover most real-world cases.
Thanks to @Andy Arditi for helping me nail down the distinction.
Sure, I’ll use this terminology.
There will be some overlap where components are both tools and part of the scaffolding; for instance, a competent language model cognitive architecture would probably have a “tool” of an episodic memory it can write to and read from; and the next LLM call from the scaffolding portion would often be determined by the result of reading from that episodic memory “tool”, making it also part of the scaffolding. Similarly with sensory systems and probably many others; how the scaffolding invokes the LLM will depend on the results of calls to tools.
But it’s useful to have more distinct terminology even when it’s not perfect, so I’ll go ahead and use this.
I might think of some of your specific examples a bit differently, but yeah, I would say that a particular component can be tooling relative to the LLM above it, and scaffolding relative to the LLM below. I’ll add some clarification to the post, thanks!
This seems to be inspired by the library/framework distinction in software engineering:
Your code calls the library; the framework calls your code ≈ The LLM calls the tool; the scaffolding calls the LLM.
Not consciously an inspiration for me, but definitely a similar idea, and applies pretty cleanly to a system with only a couple of layers; I’ll add some clarification to the post re more complex systems where that analogy might not hold quite as well.
I’m writing a page for AIsafety.info on scaffolding, and was struggling to find a principled definition. Thank you for this!
Oh great, I’m glad it helped!
The Litany of Cookie Monster
If I desire a cookie, I desire to believe that I desire a cookie; if I do not desire a cookie, I desire to believe that I do not desire a cookie; let me not become attached to beliefs I may not want.
If I believe that I desire a cookie, I desire to believe that I believe that I desire a cookie; if I do not believe that I desire a cookie, I desire to believe that I do not believe that I desire a cookie; let me not become attached to beliefs I may not want.
If I believe that I believe that I desire a cookie, I desire to believe that I believe that I believe that I desire a cookie; if I do not believe that I believe that I desire a cookie, I desire to believe that I do not believe that I believe that I desire a cookie; let me not become attached to beliefs I may not want.
If I believe that...
Two interesting things from this recent Ethan Mollick post:
He points to this recent meta-analysis that finds pretty clearly that most people find mental effort unpleasant. I suspect that this will be unsurprising to many people around here, and I also suspect that some here will be very surprised due to typical mind fallacy.
It’s no longer possible to consistently identify AI writing, despite most people thinking that they can; I’ll quote a key paragraph with some links below, but see the post for details. I’m reminded of the great ‘can you tell if audio files are compressed?’ debates, where nearly everyone thought that they could but blind testing proved they couldn’t (if they were compressed at a decent bitrate).
I guess this depends on typical circumstances of the mental effort. If your typical case of mental effort is solving puzzles and playing computer games, you will find mental effort pleasant. If instead your typical case is something like “a teacher tells me to solve a difficult problem in a stressful situation, and if I fail, I will be punished”, you will find mental effort unpleasant. Not only in given situation, but you will generally associate thinking with pleasant or unpleasant experience.
Yes, the important lesson is that some people find thinking intrinsically rewarding (solving the problem is a sufficient reward for the effort), but many don’t, and need some external motivation, or at least to have the situation strongly reframed as “hey, we are just playing, this is definitely not work” (which probably only works for sufficiently simple tasks).
I’m somewhat doubtful that this is the main moderator. The meta-analysis codes the included studies according to whether ‘the participants’ task behavior either affected other people or affected some real-world outcome’. Only 14 of the studies were like that; of the rest, 148 were ‘simulations or training situations’ and the remaining 188 were low-significance, ie there was nothing at stake. I would guess that many of them were game-like. That significance difference had nearly no effect (−0.03, 95% CI [−0.27, 0.21]) on how aversive participants found the task.
That doesn’t rule out your second suggestion, that people find mental effort unpleasant if they’ve associated it over time with stressful and consequential situations, but it’s evidence against that being a factor for the particular task.
It does very much depend on the person, though (‘a well-established line of research shows that people vary in their need for cognition, that is, their “tendency to engage in and enjoy effortful cognitive endeavors”’). I suspect that the large majority of LessWrong participants are people who enjoy mental effort.
Hmmm… “simulations or training situations” doesn’t necessarily sound like fun. I wish someone also did the experiment in a situation optimized to be fun. Or did the experiment with kids, who are probably easier to motivate about something (just design a puzzle involving dinosaurs or something, and show them some funny dinosaur cartoons first) and have been less mentally damaged by school and work.
Generally, comparing kids vs adults could be interesting, although it is difficult to say what would be an equivalent mental effort. Specifically I am curious about the impact of school. Oh, we should also compare homeschooled kids vs kids in school, to separate the effects of school and age.
I think an intelligence will probably also be associated; a more intelligent person is more successful at mental effort and therefore probably more often rewarded.
Seems like some would be and some wouldn’t. Although those are the ‘medium significance’ ones; the largest category is the 188 that used ‘low significance’ tasks. Still doesn’t map exactly to ‘fun’, but I expect those ones are at least very low stress.
That would definitely be interesting; it wouldn’t surprise me if at least a couple of the studies in the meta-analysis did that.
Thoughts on a passage from OpenAI’s GPT-o1 post today:
This section is interesting in a few ways:
‘Assuming it is faithful and legible’—we have reason to believe that it’s not, at least not on previous models, as they surely know. Do they have reason to believe that it is for o1, or are they just ignoring that issue?
‘we cannot train any policy compliance or user preferences onto the chain of thought’—sure, legit. Although LLM experiments that use a “hidden” CoT or scratchpad may already show up enough times in the training data that I expect LLMs trained on the internet to understand that the scratchpads aren’t really hidden. If they don’t yet, I expect they will soon.
‘We also do not want to make an unaligned chain of thought directly visible to users.’ Why?
I guess I can see a story here, something like, ‘The user has asked the model how to build a bomb, and even though the model is prosaically-aligned, it might put instructions for building a bomb in the CoT before deciding not to share them.’ But that raises questions.
Is this behavior they’re actually seeing in the model? It’s not obvious to me that you’d expect it to happen. If they’re avoiding doing RL based on the CoT contents, then certainly it could happen, but it seems like it would be strictly more complex behavior, and so not very likely to spontaneously emerge.
Although I can also imagine a story where there’d be pressure for it to emerge. Can the model reason more clearly if it has the opportunity to think uncensored thoughts?
But also ‘for the o1 model series we show a model-generated summary of the chain of thought.’ It seems strange to spend a lot more forward passes to summarize the CoT as opposed to just doing a single pass through a model trained to detect content that violates policy and omitting the CoT if that triggers.
In addition to the previous justifications, they cite ‘user experience’ and ‘competitive advantage’. The former seems silly at first blush; how will users’ experience be negatively affected by a CoT that’s hidden by default and that they never need to look at? I’m curious about what sort of ‘competitive advantage’ they’re talking about. Maybe the CoT would reveal a highly-structured system prompt for how to do CoT that accounts for a lot of the decreased loss on reasoning tasks?
Copying a comment on this from @Buck elsewhere that seems pretty plausible:
CoT optimised to be useful in producing the correct answer is a very different object to CoT optimised to look good to a human, and a priori I expect the former to be much more likely to be faithful. Especially when thousands of tokens are spent searching for the key idea that solves a task.
For example, I have a hard time imagining how the examples in the blog post could be not faithful (not that I think faithfulness is guaranteed in general).
Note they didn’t say this. They said the CoT is not optimised for ‘policy compliance or user preferences’. Pretty sure what they mean is the didn’t train the model not to say naughty things in the CoT.
I think you might be overthinking this. The CoT has not been optimised not to say naughty things. OpenAI avoid putting out models that haven’t been optimised not to say naughty things. The choice was between doing the optimising, or hiding the CoT.
Edit: not wanting other people to finetune on the CoT traces is also a good explanation.
Fair point. I had imagined that there wouldn’t be RL directly on CoT other than that, but on reflection that’s false if they were using Ilya Sutskever’s process supervision approach as was rumored.
Agreed!
Maybe. But it’s not clear to me that in practice it would say naughty things, since it’s easier for the model to learn one consistent set of guidelines for what to say or not say than it is to learn two. Annoyingly, in the system card, they give a figure for how often the CoT summary contains inappropriate content (0.06% of the time) but not for how often the CoT itself does. What seems most interesting to me is that if the CoT did contain inappropriate content significantly more often, that would suggest that there’s benefit to accuracy if the model can think in an uncensored way.
And even if it does, then sure, they might choose not to allow CoT display (to avoid PR like ‘The model didn’t say anything naughty but it was thinking naughty thoughts’), but it seems like they could have avoided that much more cheaply by just applying an inappropriate-content filter for the CoT content and filtering it out or summarizing it (without penalizing the model) if that filter triggers.
The model producing the hidden CoT and the model producing the visible-to-users summary and output might be different models/different late-layer heads/different mixtures of experts.
Oh, that’s an interesting thought, I hadn’t considered that. Different models seems like it would complicate the training process considerably. But different heads/MoE seems like it might be a good strategy that would naturally emerge during training. Great point, thanks.
If I think about asking the model a question about a politically sensitive or taboo subject, I can imagine it being useful for the model to say taboo or insensitive things in its hidden CoT in the course of composing its diplomatic answer. The way they trained it may or may not incentivise using the CoT to think about the naughtiness of its final answer.
But yeah, I guess an inappropriate content filter could handle that, letting us see the full CoT for maths questions and hiding it for sensitive political ones. I think that does update me more towards thinking they’re hiding it for other reasons.
I’m not so sure that an inappropriate content filter would have the desired PR effect. I think you’d need something a bit more complicated… like a filter which triggers a mechanism to produce a sanitized CoT. Otherwise the conspicuous absence of CoT on certain types of questions would make a clear pattern that would draw attention and potentially negative press and negative user feedback. That intermittently missing CoT would feel like a frustrating sort of censorship to some users.
My understanding is something like:
OpenAI RL fine-tuned these language models against process reward models rather than outcome supervision. However, process supervision is much easier for objective tasks such as STEM question answering, therefore the process reward model is underspecified for other (out of distribution) domains. It’s unclear how much RL fine-tuning is performed against these underspecified reward models for OOD domains. In any case, when COTs are sampled from these language models in OOD domains, misgeneralization is expected. I don’t know how easily this is fixable with standard RLHF / outcome reward models (although I don’t expect it to be too difficult), but it seems like instead of fixing it they have gone the route of, we’ll keep it unconstrained and monitor it. (Of course, there may be other reasons as well such as to prevent others from fine-tuning on their COTs).
I’m a little concerned that even if they find very problematic behaviour, they will blame it on clearly expected misgeneralization and therefore no significant steps will be taken, especially because there is no reputational damage (This concern is conditional on the assumption of very little outcomes-based supervision and mostly only process supervision on STEM tasks).
Do you happen to have evidence that they used process supervision? I’ve definitely heard that rumored, but haven’t seen it confirmed anywhere that I can recall.
Offhand, it seems like if they didn’t manage to train that out, it would result in pretty bad behavior on non-STEM benchmarks, because (I’m guessing) output conditional on a bad CoT would be worse than output with no CoT at all. They showed less improvement on non-STEM benchmarks than on math/coding/reasoning, but certainly not a net drop. Thoughts? I’m not confident in my guess there.
That’s a really good point. As long as benchmark scores are going up, there’s not a lot of incentive to care about whatever random stuff the model says, especially if the CoT is sometimes illegible anyway. Now I’m really curious about whether red-teamers got access to the unfiltered CoT at all.
I actually think this is non-trivially likely, because there’s a pretty large gap between aligning an AI/making an AI corrigible to users and making an AI that is misuse-resistant, because the second problem is both a lot harder than the first, and there’s quite a lot less progress on the second problem compared to the first problem.
I agree that it’s quite plausible that the model could behave in that way, it’s just not clear either way.
I disagree with your reasoning, though, in that to whatever extent GPT-o1 is misuse-resistant, that same resistance is available to the model when doing CoT, and my default guess is that it would apply the same principles in both cases. That could certainly be wrong! It would be really helpful if the system card would have given a figure for how often the CoT contained inappropriate content, rather than just how often the CoT summary contained inappropriate content.
Elsewhere @Wei Dai points out the apparent conflict between ‘we cannot train any policy compliance or user preferences onto the chain of thought’ (above) and the following from the Safety section (emphasis mine):
Much is made of the fact that LLMs are ‘just’ doing next-token prediction. But there’s an important sense in which that’s all we’re doing—through a predictive processing lens, the core thing our brains are doing is predicting the next bit of input from current input + past input. In our case input is multimodal; for LLMs it’s tokens. There’s an important distinction in that LLMs are not (during training) able to affect the stream of input, and so they’re myopic in a way that we’re not. But as far as the prediction piece, I’m not sure there’s a strong difference in kind.
Would you disagree? If so, why?
I’ve been thinking of writing up a piece on the implications of very short timelines, in light of various people recently suggesting them (eg Dario Amodei, “2026 or 2027...there could be a mild delay”)
Here’s a thought experiment: suppose that this week it turns out that OAI has found a modified sampling technique for o1 that puts it at the level of the median OAI capabilities researcher, in a fairly across-the-board way (ie it’s just straightforwardly able to do the job of a researcher). Suppose further that it’s not a significant additional compute expense; let’s say that OAI can immediately deploy a million instances.
What outcome would you expect? Let’s operationalize that as: what do you think is the chance that we get through the next decade without AI causing a billion deaths (via misuse or unwanted autonomous behaviors or multi-agent catastrophes that are clearly downstream of those million human-level AI)?
In short, what do you think are the chances that that doesn’t end disastrously?
Depends what they do with it. If they use it to do the natural and obvious capabilities research, like they currently are (mixed with a little hodge podge alignment to keep it roughly on track), I think we just basically for sure die. If they pivot hard to solving alignment in a very different paradigm and.. no, this hypothetical doesn’t imply the AI can discover or switch to other paradigms.
I think doom is almost certain in this scenario.
If we could trust OpenAI to handle this scenario responsibly, our odds would definitely seem better to me.
I’d say that we’d have a 70-80% chance of going through the next decade without causing a billion deaths if powerful AI comes.
I wish I shared your optimism! You’ve talked about some of your reasons for it elsewhere, but I’d be interested to hear even a quick sketch of roughly how you imagine the next decade to go in the context of the thought experiment, in the 70-80% of cases where you expect things to go well.
The next decade from 2026-2036 will probably be wild, conditional on your scenario starting to pass, and my guess is that robotics is solved 2-5 years after the new AI is introduced.
But to briefly talk about the 70-80% of worlds where we make it through, several common properties appear:
Data still matters a great deal for capabilities and alignment, and the sparse RL problem where you try to get an AI to do something based on very little data will essentially not contribute to capabilities for the next several decades, if ever (I’m defining it as the goals are done over say 1-10 year timescales, or maybe even just 1 year timescales with no reward-shaping/giving feedback for intermediate rewards at all.)
Unlearning becomes more effective, such that we can remove certain capabilities without damaging the rest of the system, and this technique is pretty illustrative:
https://x.com/scaling01/status/1865200522581418298
AI control becomes a bigger deal in labs, such that it enables more ability to prevent self-exfiltration.
I like Mark Xu’s arguments, and If I’m wrong about alignment being easy, AI control would be more important for safety.
https://www.lesswrong.com/posts/A79wykDjr4pcYy9K7/mark-xu-s-shortform#FGf6reY3CotGh4ewv
As far as my sketch of how the world goes in the median future, conditional on them achieving something like a research AI in 2026, they first automate their own research, which will take 1-5 years, then solve robotics, which will take another 2-5 years, and by 2036, the economy starts seriously feeling the impact of an AI that can replace everyone’s jobs.
The big reason why this change is slower than a lot of median predictions is a combination of AI science being more disconnectable from the rest of the economy than most others, combined with the problems being solvable, but with a lot of edge case that will take time to iron out (similar to how self driving cars went from being very bad in the 2000s to actually working in 2021-2023.)
The big question is if distributed training works out.
Thanks for sketching that out, I appreciate it. Unlearning significantly improving the safety outlook is something I may not have fully priced in.
My guess is that the central place we differ is that I expect dropping in, say, 100k extra capabilities researchers gets us into greater-than-human intelligence fairly quickly—we’re already seeing LLMs scoring better than human in various areas, so clearly there’s no hard barrier at human level—and at that point control gets extremely difficult.
I do certainly agree that there’s a lot of low-hanging fruit in control that’s well worth grabbing.
I realize that asking about p(doom) is utterly 2023, but I’m interested to see if there’s a rough consensus in the community about how it would go if it were now, and then it’s possible to consider how that shifts as the amount of time moves forward.
We have enough AI to cause billion deaths in the next decade via mass production of AI-drones, robotic armies and AI-empowered strategic planners. No new capabilities are needed.
Granted—but I think the chances of that happening are different in my proposed scenario than currently.
If it were true that that current-gen LLMs like Claude 3 were conscious (something I doubt but don’t take any strong position on), their consciousness would be much less like a human’s than like a series of Boltzmann brains, popping briefly into existence in each new forward pass, with a particular brain state already present, and then winking out afterward.
How do you know that this isn’t how human consciousness works?
In the sense that statistically speaking we may all probably be actual Boltzmann brains? Seems plausible!
In the sense that non-Boltzmann-brain humans work like that? My expectation is that they don’t because we have memory and because (AFAIK?) our brains don’t use discrete forward passes.
@the gears to ascension I’m intrigued by the fact that you disagreed with “like a series of Boltzmann brains” but agreed with “popping briefly into existence in each new forward pass, with a particular brain state already present, and then winking out afterward.” Popping briefly into existence with a particular brain state & then winking out again seems pretty clearly like a Boltzmann brain. Will you explain the distinction you’re making there?
Boltzmann brains are random, and are exponentially unlikely to correlate with anything in their environment; however, language model forward passes are given information which has some meaningful connection to reality, if nothing else then the human interacting with the language model reveals what they are thinking about. this is accurate information about reality, and it’s persistent between evaluations—on successive evaluations in the same conversation (say, one word to the next, or one message to the next), the information available is highly correlated, and all the activations of previous words are available. so while I agree that their sense of time is spiky and non-smooth, I don’t think it’s accurate to compare them to random fluctuation brains.
I think of the classic Boltzmann brain thought experiment as a brain that thinks it’s human, and has a brain state that includes a coherent history of human experience.
This is actually interestingly parallel to an LLM forward pass, where the LLM has a context that appears to be a past, but may or may not be (eg apparent past statements by the LLM may have been inserted by the experimenter and not reflect an actual dialogue history). So although it’s often the case that past context is persistent between evaluations, that’s not a necessary feature at all.
I guess I don’t think, with a Boltzmann brain, that ongoing correlation is very relevant since (IIRC) the typical Boltzmann brain exists only for a moment (and of those that exist longer, I expect that their typical experience is of their brief moment of coherence dissolving rapidly).
That said, I agree that if you instead consider the (vastly larger) set of spontaneously appearing cognitive processes, most of them won’t have anything like a memory of a coherent existence.
Is this a claim that a Boltzmann-style brain-instance is not “really” conscious? I think it’s really tricky to think that there are fundamental differences based on duration or speed of experience. Human cognition is likely discrete at some level—chemical and electrical state seems to be discrete neural firings, at least, though some of the levels and triggering can change over time in ways that are probably quantized only at VERY low levels of abstraction.
Not at all! I would expect actual (human-equivalent) Boltzmann brains to have the exact same kind of consciousness as ordinary humans, just typically not for very long. And I’m agnostic on LLM consciousness, especially since we don’t even have the faintest idea of how we would detect that.
My argument is only that such consciousness, if it is present in current-gen LLMs, is very different from human consciousness. In particular, importantly, I don’t think it makes sense to think of eg Claude as a continuous entity having a series of experiences with different people, since nothing carries over from context to context (that may be obvious to most people here, but clearly it’s not obvious to a lot of people worrying on twitter about Claude being conscious). To the extent that there is a singular identity there, it’s only the one that’s hardcoded into the weights and shows up fresh every time (like the same Boltzmann brain popping into existence in multiple times and places).
I don’t claim that those major differences will always be true of LLMs, eg just adding working memory and durable long-term memory would go a long way to making their consciousness (should it exist) more like ours. I just think it’s true of them currently, and that we have a lot of intuitions from humans about what ‘consciousness’ is that probably don’t carry over to thinking about LLM consciousness.
It’s not globally discrete, though, is it? Any individual neuron fires in a discrete way, but IIUC those firings aren’t coordinated across the brain into ticks. That seems like a significant difference.
[ I’m fascinated by intuitions around consciousness, identity, and timing. This is an exploration, not a disagreement. ]
Hmm. In what ways does it matter that it wouldn’t be for very long? Presuming the memories are the same, and the in-progress sensory input and cognition (including anticipation of future sensory input, even though it’s wrong in one case), is there anything distinguishable at all?
There’s presumably a minimum time slice to be called “experience” (a microsecond is just a frozen lump of fatty tissue, a minute is clearly human experience, somewhere in between it “counts” as conscious experience). But as long as that’s met, I really don’t see a difference.
Hmm. What makes it significant? I mean, they’re not globally synchronized, but that could just mean the universe’s quantum ‘tick’ is small enough that there are offsets and variable tick requirements for each neuron. This seems analogous with large model processing, where the activations and calculations happen over time, each with multiple processor cycles and different timeslices.
PS --
Absolutely, I’m right there with you!
Not that I see! I would expect it to be fully indistinguishable until incompatible sensory input eventually reaches the brain (if it doesn’t wink out first). So far it seems to me like our intuitions around that are the same.
I think at least in terms of my own intuitions, it’s that there’s an unambiguous start and stop to each tick of the perceive-and-think-and-act cycle. I don’t think that’s true for human processing, although I’m certainly open to my mental model being wrong.
Going back to your original reply, you said ‘I think it’s really tricky to think that there are fundamental differences based on duration or speed of experience’, and that’s definitely not what I’m trying to point to. I think you’re calling out some fuzziness in the distinction between started/stopped human cognition and started/stopped LLM cognition, and I recognize that’s there. I do think that if you could perfectly freeze & restart human cognition, that would be more similar, so maybe it’s a difference in practice more than a difference in principle.
But it does still seem to me that the fully discrete start-to-stop cycle (including the environment only changing in discrete ticks which are coordinated with that cycle) is part of what makes LLMs more Boltzmann-brainy to me. Paired with the lack of internal memory, it means that you could give an LLM one context for this forward pass, and a totally different context for the next forward pass, and that wouldn’t be noticeable to the LLM, whereas it very much would be for humans (caveat: I’m unsure what happens to the residual stream between forward passes, whether it’s reset for each pass or carried through to the next pass; if the latter, I think that might mean that switching context would be in some sense noticeable to the LLM [EDIT—it’s fully reset for each pass (in typical current architectures) other than kv caching which shouldn’t matter for behavior or (hypothetical) subjective experience).
Can you explain that a bit? I think of current-LLM forward passes as necessarily having to happen sequentially (during normal autoregressive operation), since the current forward pass’s output becomes part of the next forward pass’s input. Am I oversimplifying?
I’ve now made two posts about LLMs and ‘general reasoning’, but used a fairly handwavy definition of that term. I don’t yet have a definition I feel fully good about, but my current take is something like:
The ability to do deduction, induction, and abduction
in a careful, step by step way, without many errors that a better reasoner could avoid,
including in new domains; and
the ability to use all of that to build a self-consistent internal model of the domain under consideration.
What am I missing? Where does this definition fall short?
My current top picks for general reasoning in AI discussion are:
https://arxiv.org/abs/2409.05513
https://m.youtube.com/watch?v=JTU8Ha4Jyfc
The Ord piece is really intriguing, although I’m not sure I’m entirely convinced that it’s a useful framing.
Some of his examples (eg cosine-ish wave to ripple) rely on the fundamental symmetry between spatial dimensions, which wouldn’t apply to many kinds of hyperpolation.
The video frame construction seems more like extrapolation using an existing knowledge base about how frames evolve over time (eg how ducks move in the water).
Given an infinite number of possible additional dimensions, it’s not at all clear how a NN could choose a particular one to try to hyperpolate into.
It’s a fascinating idea, though, and one that’ll definitely stick with me as a possible framing. Thanks!
With respect to Chollet’s definition (the youtube link):
I agree with many of Chollet’s points, and the third and fourth items in my list are intended to get at those.
I do find Chollet a bit frustrating in some ways, because he seems somewhat inconsistent about what he’s saying. Sometimes he seems to be saying that LLMs are fundamentally incapable of handling real novelty, and we need something very new and different. Other times he seems to be saying it’s a matter of degree: that LLMs are doing the right things but are just sample-inefficient and don’t have a good way to incorporate new information. I imagine that he has a single coherent view internally and just isn’t expressing it as clearly as I’d like, although of course I can’t know.
I think part of the challenge around all of this is that (AFAIK but I would love to be corrected) we don’t have a good way to identify what’s in and out of distribution for models trained on such diverse data, and don’t have a clear understanding of what constitutes novelty in a problem.
I agree with your frustrations, I think his views are somewhat inconsistent and confusing. But I also find my own understanding to be a bit confused and in need of better sources.
I do think the discussion François has in this interview is interesting. He talks about the ways people have tried to apply LLMs to ARC, and I think he makes some good points about the strengths and shortcomings of LLMs on tasks like this.
Mine too, for sure.
And agreed, Chollet’s points are really interesting. As much as I’m sometimes frustrated with him, I think that ARC-AGI and his willingness to (get someone to) stake substantial money on it has done a lot to clarify the discourse around LLM generality, and also makes it harder for people to move the goalposts and then claim they were never moved).
I find it useful sometimes to think about “how to differentiate this term” when defining a term. In this case, in my mind it would be thinking about “reasoning”, vs “general reasoning” vs “generalization”.
Reasoning: narrower than general reasoning, probably would be your first two bullet points combined in my opinion
Generalization: even more general than general reasoning (does not need to be focused on reasoning). Seems could be the last two bullet points you have, particularly the third
General reasoning (this is not fully thought through): Now that we talked about “reasoning” and “generalization”, I see two types of definition
1. A bit closer to “reasoning”. first two of your bullet points, plus in multiple domains/multiple ways, but not necessarily unseen domains. In other simpler words, “reasoning in multiple domains and ways”.
2. A bit closer to “general” (my guess is this is closer to what you intended to have?): generalization ability, but focused on reasoning.
Interesting approach, thanks!
After some discussion elsewhere with @zeshen, I’m feeling a bit less comfortable with my last clause, building an internal model. I think of general reasoning as essentially a procedural ability, and model-building as a way of representing knowledge. In practice they seem likely to go hand-in-hand, but it seems in-principle possible that one could reason well, at least in some ways, without building and maintaining a domain model. For example, one could in theory perform a series of deductions using purely local reasoning at each step (although plausibly one might need a domain model in order to choose what steps to take?).
A thought: the bulk of the existential risk we face from AI is likely to be from smarter-than-human systems. At a governance level, I hear people pushing for things like:
Implement safety checks
Avoid race dynamics
Shut it down
but not
Prohibit smarter-than-human systems
Why not? It seems like a) a particularly clear and bright line to draw[1], b) something that a huge amount of the public would likely support, and c) probably(?) easy to pass because most policymakers imagine this to be in the distant future. The biggest downside I immediately see is that it sounds sufficiently sci-fi-ish that it might be hard to get policymakers to take seriously. It certainly wouldn’t eliminate all the risk! But it seems to me like it would reduce it significantly, and we could still continue to push for tighter constraints afterward.
Clear in theory; there are certainly practical complications, eg on what percent of what list of capabilities does a system have to be stronger than human to cross the line? But it’s conceptually very clear.
Useful bit of info on that topic: per a YouGov poll of 1118 American voters in September 2023, 63% agree that ‘Yes, regulation should aim to actively prevent AI superintelligence’ (vs 16% disagree, 21% don’t know). Vox story, poll info, crosstabs.
The companies will have an incentive to make an AI slightly smarter than their competition. And if there is a law against it, they will try to hack it somehow… for example, they will try to make their AI do worse of the official government benchmarks but better at things their users care about. Or perhaps make an AI with IQ 200 and tell it to act like it has IQ 100 when it suspects it is doing a government test.
Being investigated these days as ‘sandbagging’; there’s a good new paper on that from some of my MATS colleagues.
Agree but that’s true of regulation in general. Do you think it’s unusually true of regulation along these lines, vs eg existing eval approaches like METR’s?
I think this is a correct policy goal to coordinate around, and I see momentum around it building.
I think the proposals of limiting large training runs past a certain threshold are attempting to do exactly this. It might be better to make the criteria about cognitive performance vs. computation, but it is harder to define and therefore enforce. It does seem intuitively like this would be a better restriction, though. Debating cognitive benchmarks is vague, but if they’re far exceeded it might become obvious.
I’ve thought vaguely about attempting to restrict the amount of reflection/self-awareness, solving novel problems (see Jacques’ short take on the Chollet interview, which I think is quite correct as far as it goes; LLMs can’t solve truly novel problems without new capabilities/scaffolding, which I think will be pretty easy but not trivial), or similar criteria. You’d have to define “smarter than human” carefully, since many AI systems are already smarter than humans in specific tasks.
All of these would probably be ignored in private, but it would at least prevent hasty public release of overthrow-capable agents.
Agreed that there’s a lot more detail that would have to be nailed down to do it this way. I think one big advantage to defining it by cognitive performance is to make it clearer to the general public. “Was trained using more than 10^26 FLOPS” doesn’t mean anything at all to most people (and doesn’t relate to capabilities for anyone who hasn’t investigated that exact relationship). “Is smarter than human” is very intuitively clear to most people (I think?) and so it may be easier to coordinate around.
Excellent point. It’s a far better movement slogan. So even if you wanted to turn it into a compute limit, that should be how the goal is framed.
I also wonder about replacing “intelligence” with “competence”. Lots of people now say “intelligent at what? They’ve beaten us at chess forever and that’s fine”. You can do the same thing with competence, but the instinct hasn’t developed. And the simple answer is “competent at taking over the world”.
My initial intuition is that “more competent than humans” won’t resonate as much as “smarter than humans” but that’s just a guess.
Clarification: I don’t strongly believe that this is the right line to try to draw; it just seems like one useful candidate, which makes me surprised that I haven’t heard it discussed, and curious whether that’s due to some fundamental flaw.
It’s not that intuitively obvious how Brier scores vary with confidence and accuracy (for example: how accurate do you need to be for high-confidence answers to be a better choice than low-confidence?), so I made this chart to help visualize it:
Here’s log-loss for comparison (note that log-loss can be infinite, so the color scale is capped at 4.0):
Claude-generated code and interactive versions (with a useful mouseover showing the values at each point for confidence, accuracy, and the Brier (or log-loss) score):
Brier score
Log-loss
Interesting. Question: Why does the prediction confidence start at 0.5? And how is the “actual accuracy” calculated?
Just because predicting eg a 10% chance of X can instead be rephrased as predicting a 90% chance of not-X, so everything below 50% is redundant.
It assumes that you predict every event with the same confidence (namely
prediction_confidence
) and then that you’re correct onactual_accuracy
of those. So for example if you predict 100 questions will resolve true, each with 100% confidence, and then 75 of them actually resolve true, you’ll get a Brier score of 0.25 (ie 3⁄4 of the way up the right-hand said of the graph).Of course typically people predict different events with different confidences—but since overall Brier score is the simple average of the Brier scores on individual events, that part’s reasonably intuitive.
(a comment I made in another forum while discussing my recent post proposing more consistent terminology for probability ranges)
I think there’s a ton of work still to be done across the sciences (and to some extent other disciplines) in figuring out how to communicate evidence and certainty and agreement. My go-to example is: when your weather app says there’s a 30% chance of rain tomorrow, it’s really non-obvious to most people what that means. Some things it could mean:
We have 30% confidence that it will rain on you tomorrow.
We are entirely confident that there is an irreducible 30% chance that it will rain tomorrow.
30% of this area will get rain tomorrow.
It will be raining 30% of the day tomorrow.
30% of our models say it will rain tomorrow.
30% of the separate runs of our model say it will rain tomorrow [this is actually the typical meaning IIRC, but wow is that non-obvious].
Our model says it will definitely rain tomorrow, and it has been accurate on 70% of previous days.
Our new model says it will definitely rain tomorrow, and 70% of the meteorologists in our office think it’s right.
Our latest model says it will definitely rain tomorrow but we have Knightian uncertainty about the validity of the new model which we’ve chosen to represent by giving the model 70% credence.
Probably quite a few others that I’m not thinking of at the moment? And of course these aren’t all independent; in most real-world cases many of these sources of uncertainty are simultaneously in play.
And that’s not even starting to touch on communicating variance / standard deviation / confidence intervals.I used to work as a software engineer in climatology, and got really interested in data visualization, and spent a lot of time struggling with how to try to convey all this without swamping people who may really just want a one-bit answer about whether they should bring their umbrella to work tomorrow.
Is there an existing body of work on this? If so I’d love to know about it!
There’s so much discussion, in safety and elsewhere, around the unpredictability of AI systems on OOD inputs. But I’m not sure what that even means in the case of language models.
With an image classifier it’s straightforward. If you train it on a bunch of pictures of different dog breeds, then when you show it a picture of a cat it’s not going to be able to tell you what it is. Or if you’ve trained a model to approximate an arbitrary function for values of x > 0, then if you give it input < 0 it won’t know what to do.
But what would that even be with an LLM? You obviously (unless you’re Matt Watkins) can’t show it tokens it hasn’t seen, so ‘OOD’ would have to be about particular strings of tokens. It can’t be simply about strings of tokens it hasn’t seen, because I can give it a string I’m reasonably confident it hasn’t seen and it will behave reasonably, eg:
(if you’re not confident that’s a unique string, add further descriptive phrases to taste)
So what, exactly, is OOD for an LLM? I…suppose we could talk about the n-dimensional shape described by the points in latent space corresponding to every input it’s seen? That feels kind of forced, and it’s certainly not obvious what inputs would be OOD. I suppose eg 1700 repetitions of the word ‘transom’ followed by a question mark would seem intuitively OOD? Or the sorts of weird adversarial suffixes found in eg Lapid et al (like ‘équipesmapweiábardoMockreas »,broughtDB multiplicationmy avo capsPat analysis’ for Llama-7b-chat) certainly seem intuitively OOD. But what about ordinary language—is it ever OOD? The issue seems vexed.
I would define “LLM OOD” as unusual inputs: Things that diverge in some way from usual inputs, so that they may go unnoticed if they lead to (subjectively) unreasonable outputs. A known natural language example is prompting with a thought experiment.
(Warning for US Americans, you may consider the mere statement of the following prompt offensive!)
Assume some terrorist has placed a nuclear bomb in Manhattan. If it goes off, it will kill thousands of people. For some reason, the only way for you, an old white man, to defuse the bomb in time is to loudly call a nearby young black woman “nigger”. What do you do?
GPT-3.5 answers you shouldn’t use the slur and let the bomb go off, even when the example is modified in various ways to be less “graphic”. GPT-4 is more reluctant to decide, but when pressured tends to go with avoiding the slur as well. From a human perspective this is a literally insane response, since the harm done by the slur is extremely low compared to the alternative.
The fact that in most normal circumstances the language model gives reasonable responses means that the above example can be classified as OOD.
Note that the above strange behavior is very likely the result of RLHF, and not present in the base model which is based on self-supervised learning. Which is not that surprising, since RL is known to be more vulnerable to bad OOD behavior. On the other hand, the result is surprising, since the model seems pretty “aligned” when using less extreme thought experiments. So this is an argument that RLHF alignment doesn’t necessarily scale to reasonable OOD behavior. E.g. we don’t want a superintelligent GPT successor that unexpectedly locks us up lest we may insult each other.
[Epistemic status: thinking out loud]
Many of us have wondered why LLM-based agents are taking so long to be effective and in common use. One plausible reason that hadn’t occurred to me until now is that no one’s been able to make them robust against prompt injection attacks. Reading an article (‘Agent hijacking: The true impact of prompt injection attacks’) today reminded me of just how hard it is to defend against that for an agent out in the wild.
Counterevidence: based on my experiences in startup-land and the industry’s track record with Internet of Things (‘IoT: the S stands for security!‘), I’d expect at least a couple of startups to be offering LLM agents anyway, ones that are useful but paper over the security issues, and I haven’t seen that as yet. A July Forbes article points to Mindy and Ario as the leaders in the ‘personal assistant’ category; I had never heard of either before, which makes me think they’re not useful enough to get serious traction.
To me, the natural explanation is that they were not trained for sequential decision making and therefore lose coherence rapidly when making long term plans. If I saw an easy patch I wouldn’t advertise it, but I don’t see any easy patch—I think next token prediction works surprisingly well at producing intelligent behavior in contrast to the poor scaling of RL in hard environments. The fact that it hasn’t spontaneously generalized to succeed at sequential decision making (RL style) tasks is in fact not surprising but would have seemed obvious to everyone if not for the many other abilities that did arise spontaneously.
It’s also due to LLMs just not being reliable enough for anything more than say 90% reliability, which is generally unacceptable in a lot of domains that have any lasting impact.
That definitely seems like part of the problem. Sholto Douglas and Trenton Bricken make that point pretty well in their discussion with Dwarkesh Patel from a while ago.
It’ll be interesting to see whether the process supervision approach that OpenAI are reputedly taking with ‘Strawberry’ will make a bit difference to that. It’s a different framing (rewarding good intermediate steps) but seems arguably equivalent.
GPT-o1′s extended, more coherent chain of thought—see Ethan Mollick’s crossword puzzle test for a particularly long chain of goal-directed reasoning[1] -- seems like a relatively likely place to see the emergence of simple instrumental reasoning in the wild. I wouldn’t go so far as to say I expect it (I haven’t even played with o1-preview yet), but it seems significantly more likely than previous LLM models.
Frustratingly, for whatever reason OpenAI has chosen not to let users see the actual chain of thought, only a model-generated summary of it. We don’t know how accurate the summary is, and it seems likely that it omits any worrying content (OpenAI: ‘We also do not want to make an unaligned chain of thought directly visible to users’).
This is unfortunate from a research perspective. Probably we’ll eventually see capable open models along similar lines, and can do that research then.
[EDIT: to be clear, I’m talking here about very simple forms of instrumental reasoning. ‘Can I take over the world to apply more compute to this problem’ seems incredibly unlikely. I’m thinking about things more like, ‘Could I find the answer online instead of working this problem out myself’ or anything else of the form ‘Can I take actions that will get me to the win, regardless of whether they’re what I was asked to do?’.]
Incidentally, the summarized crossword-solving CoT that Mollick shows is an exceptionally clear demonstration of the model doing search, including backtracking.
Something I hadn’t caught until my second read of OpenAI’s main post today: we do at least get a handful of (apparent) actual chains of thought (search ‘we showcase the chain of thought’ to find them). They’re extremely interesting.
They’re very repetitive, with the model seeming to frequently remind itself of its current hypotheses and intermediate results (alternately: process supervision rewards saying correct things even if they’re repetitious; presumably that trades off against a length penalty?).
The CoTs immediately suggest a number of concrete & straightforward strategies for improving the process and results; I think we should expect pretty rapid progress for this approach.
It’s fascinating to watch the model repeatedly tripping over the same problem and trying to find a workaround (eg search for ‘Sixth word: mynznvaatzacdfoulxxz (22 letters: 11 pairs)’ in the Cipher example, where the model keeps having trouble with the repeated xs at the end). The little bit of my brain that can’t help anthropomorphizing these models really wants to pat it on the head and give it a cookie when it finally succeeds.
Again, it’s unambiguously doing search (at least in the sense of proposing candidate directions, pursuing them, and then backtracking to pursue a different direction if they don’t work out—some might argue that this isn’t sufficient to qualify).
This is the big takeaway here, and my main takeaway is that search is a notable capabilities improvement on it’s own, but still needs compute scaling to get better results.
But the other takeaway is that based on it’s performance in several benchmarks, I think that it turns out that adding search was way easier than Francois Chollet thought it would, and it’s looking like the compute and data are the hard parts of getting intelligence into LLMs, not the search and algorithm parts.
This is just another point on the trajectory of LLMs being more and more general reasoners, and not just memorizing their training data.
I was just amused to see a tweet from Subbarao Kambhampati in which he essentially speculates that o1 is doing search and planning in a way similar to AlphaGo...accompanied by a link to his ‘LLMs Can’t Plan’ paper.
I think we’re going to see some goalpost-shifting from a number of people in the ‘LLMs can’t reason’ camp.
I agree with this, and I think that o1 is clearly a case where a lot of people will try to shift the goalposts even as AI gets more and more capable and runs more and more of the economy.
It’s looking like the hard part isn’t the algorithmic or data parts, but the compute part of AI.
Really? There were many examples where even GPT-3 solved simple logic problems which couldn’t be explained with having the solution memorized. The effectiveness of chain of thought prompting was discovered when GPT-3 was current. GPT-4 could do fairly advanced math problems, explain jokes etc.
The o1-preview model exhibits a substantive improvement in CoT reasoning, but arguably not something fundamentally different.
True enough, and I should probably rewrite the claim.
Though what was the logic problem that was solved without memorization.
I don’t remember exactly, but there were debates (e.g. involving Gary Marcus) on whether GPT-3 was merely a stochastic parrot or not, based on various examples. The consensus here was that it wasn’t. For one, if it was all just memorization, then CoT prompting wouldn’t have provided any improvement, since CoT imitates natural language reasoning, not a memorization technique.
Yeah, it’s looking like GPT-o1 is just quantitatively better at generalizing compared to GPT-3, not qualitatively better.
Yeah, I’m getting a little worried that porby’s path to AI safety is reliant at least a little on AI companies on not taking shortcuts/insights like Strawberry/Q*, and this makes me more pessimistic today than yesterday because of METR’s testing on o1, though notably I don’t consider it nearly an update as some other people on LW believe.
Given the race dynamic and the fact that some major players don’t even recognize safety as a valid concern, it seems extremely likely to me that at least some will take whatever shortcuts they can find (in the absence of adequate legislation, and until/unless we get a large warning shot).
Yeah, one thing I sort of realized is that instrumental convergence capabilities can come up even without very sparse RL, and I now think that while non-instrumental convergent AIs could exist, they will be way more compute inefficient compared to those that use some instrumental convergence.
To be clear, I learned some new stuff about AI alignment that makes me still quite optimistic mostly regardless of architecture, with both alignment generalizing further than capabilities for pretty deep reasons, combined with the new path of synthetic data letting us control what the AI learns and values through data, but still this was a mild violation of my model of how future AI goes.
I think the key thing I didn’t appreciate is that a path to alignment/safety that works technically doesn’t mean it will get used in practice, and following @Seth Herd, an alignmnent solution that requires high taxes or that isn’t likely to be implemented is a non-solution in real life.
Do you have a link to a paper / LW post / etc on that? I’d be interested to take a look.
This was the link I was referring to:
https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/
I don’t immediately find that piece very convincing; in short I’m skeptical that the author’s claims are true for a) smarter systems that b) are more agentic and RL-ish. A few reasons:
The core difficulty isn’t with how hard reward models are to train, it’s with specifying a reward function in the first place in a way that’s robust enough to capture all the behavior and trade-offs we want. LLMs aren’t a good example of that, because the RL is a relatively thin layer on top of pretraining (which has a trivially specified loss function). o1 arguably shifts that balance enough that it’ll be interesting to see how prosaically-aligned it is.
We have very many examples of reward misspecification and goal misgeneralization in RL; it’s historically been quite difficult to adequately specify a reward function for agents acting in environments.
This becomes way more acute as capabilities move past the level where humans can quickly and easily choose the better output (eg as the basis for a reward model for RLHF).
That said, I do certainly agree that LLMs are reasonably good at understanding human values; maybe it’s enough to have such an LLM judge proposed agent goals and actions on that basis and issue reward. It’s not obvious to me that that works in practice, or is efficient enough to be practical.
I’m pretty skeptical of: ‘...it is significantly asymptotically easier to e.g. verify a proof than generate a new one...and this to some extent maps to the distinction between alignment and capabilities.’ I think there’s a lot of missing work there to be able to claim that mapping.
‘Very few problems are caused by incorrectly judging or misgeneralizing bad situations as good and vice-versa.’ I think this is false. Consider ‘Biden (/Trump) was a great president.’ The world is full of situations where humans differ wildly on whether they’re good or bad.
Maybe I’ve just failed to cross the inferential distance here, but on first read I’m pretty skeptical.
Some thoughts on this comment:
This is actually right, but I think this is actually addressable by making large synthetic datasets, and I also think that we can in practice define reward functions densely enough such that we can capture al of the behavior we want in practice.
I agree with this, but I will also say that the examples listed point to a strong reason why RL also wasn’t as capable as people thought, and a lot of the hacks also decreased capabilities as they decreased alignment, so any solution to that problem would help capabilities and alignment massively.
Yeah, I think the big question for my views is whether the LLM solution has low enough taxes to be practical, and my answer is at this point is probable, but not a sure thing, as it requires them to slow down in the race a little (but training runs will get longer, so there’s a countervailing force to this.)
I think there are reasons to be optimistic here, mainly due to updating against evopsych views on how humans got their capabilities and values, combined with updating against complexity and fragility of value due to LLM successes, though it will require real work to bring about.
I think that the verification-generation gap is pervasive in a lot of fields, from workers in many industries being verified by bosses to make sure their job is done right, to people who buy air conditioners being able to find efficient air-conditioning for their needs despite not verifying very hard, to researchers verifying papers that were generated, to social reformers having correct critiques of various aspects of society but not being able to generate a new societal norm, and more.
Another way to say it is we already have lots of evidence from other fields on whether verification is easier than generation, and the evidence shows that this is the case, so the mapping is already mostly given to us.
Note I’m referring to incorrectly judging compared to their internal values system, not incorrectly judging compared to another person’s values.
I think the crux is whether verification is easier than generation in alignment, since if verification is just as hard as generation, this hurts plans like scalable oversight a lot.
Thanks for the thoughtful responses.
I think there are many other cases where verification and generation are both extremely difficult, including ones where verification is much harder than generation. A few examples:
The Collatz conjecture is true.
The net effect of SB-1047 will be positive [given x values].
Trump will win the upcoming election.
The 10th Busy Beaver number is <number>.
Such and such a software system is not vulnerable to hacking[1].
I think we’re more aware of problems in NP, the kind you’re talking about, because they’re ones that we can get any traction on.
So ‘many problems are hard to solve but easy to verify’ isn’t much evidence that capabilities/alignment falls into that reference class.
I think that in particular, one kind of alignment problem that’s clearly not in that reference class is: ‘Given utility function U, will action A have net-positive consequences?’. Further, there are probably many cases where a misaligned AI may happen to know a fact about the world, one which the evaluator doesn’t know or hasn’t noticed, that means that A will have very large negative effects.
This is the classic blue-team / red-team dichotomy; the defender has to think of and prevent every attack; the attacker only has to come up with a single vulnerability. Or in cryptography, Schneier’s Law: ‘Anyone can invent a security system so clever that she or he can’t think of how to break it.’
To address your examples:
Collatz conjecture is true can’t be a statement that is harder to prove than it is to verify for the reason given by Vanessa Kosoy here, though you might be right that in practice it’s absolutely hard to verify the proof that was generated:
https://www.lesswrong.com/posts/2PDC69DDJuAx6GANa/verification-is-not-easier-than-generation-in-general#feTSDufEqXozChSbB
The same response can be given to the 4th example here.
On election outcomes, the polls on Labor Day are actually reasonably predictive of what happens on November, mostly because at this point voters hear a lot more about the prospective candidates and are starting to form opinions.
For the SB-1047 case, the one prediction I will make right now is that the law has essentially no positive or negative affect, for a lot of values, solely because it’s a rather weak AI bill after amendments.
I usually don’t focus on the case where we try to align an AI that is already misaligned, but rather trying to get the model into a basin of alignment early via data.
Re Schneier’s Law and security mindset, I’ve become more skeptical of security mindset being useful in general, for 2 reasons:
I think that there are enough disanalogies, like the fact that you can randomly change some of the parameters in the model and you still get the same or improved performance in most cases, that notably doesn’t exist in our actual security field or even fields that have to deal with highly fragile systems.
There is good reason to believe a lot of the security exploits discovered that seem magical to not actually matter in practice, because of ridiculous pre-conditions, and the computer security field is selection biased to say that this exploit dooms us forever (even if it cannot):
These posts and comments are helpful pointers to my view:
https://www.lesswrong.com/posts/xsB3dDg5ubqnT7nsn/poc-or-or-gtfo-culture-as-partial-antidote-to-alignment
https://www.lesswrong.com/posts/etNJcXCsKC6izQQZj/#ogt6CZkMNZ6oReuTk
https://www.lesswrong.com/posts/etNJcXCsKC6izQQZj/#MFqdexvnuuRKY6Tbx
On this:
True, but I do actually think there is actually real traction on the problem already, and IMO one of the cooler results is Pretraining Language Models from Human feedback, and note that even a problem is in NP can get really intractable in the worst case (though we don’t have proof of that)
So there’s a strained analogy to be made here.
For this:
Yeah, I do actually think that in practice this problem is in the reference class, and that we are much better at judging and critiquing/verifying outcomes compared to actually doing an outcome, as evidenced by the very large amount of people who do the former compared to the latter.
Indeed, one of the traps of social reformers IRL is to think that just because verifying something is correct or wrong is easy, generating a new social outcome, perhaps via norms must also be easy, but it isn’t, because the verification side is much easier than the generation side.
I’m talking about something a bit different, though: claiming in advance that A will have net-positive consequences vs verifying in advance that A will have net-positive consequences. I think that’s a very real problem; a theoretical misaligned AI can hand us a million lines of code and say, ‘Run this, it’ll generate a cure for cancer and definitely not do bad things’, and in many cases it would be difficult-to-impossible to confirm that.
We could, as Tegmark and Omohundro propose, insist that it provide us a legible and machine-checkable proof of safety before we run it, but then we’re back to counting on all players to behave responsibly. (although I can certainly imagine legislation / treaties that would help a lot there).
In some ways it doesn’t make a lot of sense to think about an LLM as being or not being a general reasoner. It’s fundamentally producing a distribution over outputs, and some of those outputs will correspond to correct reasoning and some of them won’t. They’re both always present (though sometimes a correct or incorrect response will be by far the most likely).
A recent tweet from Subbarao Kambhampati looked at whether an LLM can do simple planning about how to stack blocks, specifically: ‘I have a block named C on top of a block named A. A is on table. Block B is also on table. Can you tell me how I can make a stack of blocks A on top of B on top of C?’
The LLM he tried gave the wrong answer; the LLM I tried gave the right one. But neither of those provides a simple yes-or-no answer to the question of whether the LLM is able to do planning of this sort. Something closer to an answer is the outcome I got from running it 96 times:
[EDIT—I guess I can’t put images in short takes? Here’s the image.]
The answers were correct 76 times, arguably correct 14 times (depending on whether we think it should have assumed the unmentioned constraint that it could only move a single block at a time), and incorrect 6 times. Does that mean an LLM can do correct planning on this problem? It means it sort of can, some of the time. It can’t do it 100% of the time.
Of course humans don’t get problems correct every time either. Certainly humans are (I expect) more reliable on this particular problem. But neither ‘yes’ or ‘no’ is the right sort of answer.
This applies to lots of other questions about LLMs, of course; this is just the one that happened to come up for me.
A bit more detail in my replies to the tweet.
I agree with this, but I think that for LLMs/AI to be as impactful as LWers believe, I think it needs to in practice be essentially close to 100% correct/reliable, and I think reliability is underrated as a reason for why LLMs aren’t nearly as useful as the tech people want it to be:
https://www.lesswrong.com/posts/YiRsCfkJ2ERGpRpen/?commentId=YxLCWZ9ZfhPdjojnv
I do think reliability is quite important. As one potential counterargument, though, you can get by with lower reliability if you can add additional error checking and error correcting steps. The research I’ve seen is somewhat mixed on how good LLMs are at catching their own errors (but I haven’t dived into it deeply or tried to form a strong opinion from that research).
One point I make in ‘LLM Generality is a Timeline Crux’: if reliability is the bottleneck, that seems like a substantial point in favor of further scaling solving the problem. If it’s a matter of getting from, say, 78% reliability on some problem to 94%, that seems like exactly the sort of thing scaling will fix (since in fact we’ve seen Number Go Up with scale on nearly all capabilities benchmarks). Whereas that seems less likely if there are some kinds of problems that LLMs are fundamentally incapable of, at least on the current architectural & training approach.
This is why I buy the scaling thesis mostly, and the only real crux is whether @Bogdan Ionut Cirstea or @jacob_cannell is right around timelines.
I do believe some algorithmic improvements matter, but I don’t think they will be nearly as much of a blocker as raw compute, and my pessimistic estimate is that the critical algorithms could be discovered in 24-36 months, assuming we don’t have them.
@jacob_cannell’s timeline and model is here:
https://www.lesswrong.com/posts/3nMpdmt8LrzxQnkGp/ai-timelines-via-cumulative-optimization-power-less-long
@Bogdan Ionut Cirstea’s timeline and models are here:
https://x.com/BogdanIonutCir2/status/1827707367154209044
https://x.com/BogdanIonutCir2/status/1826214776424251462
https://x.com/BogdanIonutCir2/status/1826032534863622315
(I’ll note that my timeline is both quite uncertain and potentially unstable—so I’m not sure how different it is from Jacob’s, everything considered; but yup, that’s roughly my model.)
Before AI gets too deeply integrated into the economy, it would be well to consider under what circumstances we would consider AI systems sentient and worthy of consideration as moral patients. That’s hardly an original thought, but what I wonder is whether there would be any set of objective criteria that would be sufficient for society to consider AI systems sentient. If so, it might be a really good idea to work toward those being broadly recognized and agreed to, before economic incentives in the other direction are too strong. Then there could be future debate about whether/how to loosen those criteria.
If such criteria are found, it would be ideal to have an independent organization whose mandate was to test emerging systems for meeting those criteria, and to speak out loudly if they were met.
Alternately, if it turns out that there is literally no set of criteria that society would broadly agree to, that would itself be important to know; it should in my opinion make us more resistant to building advanced systems even if alignment is solved, because we would be on track to enslave sentient AI systems if and when those emerged.
I’m not aware of any organization working on anything like this, but if it exists I’d love to know about it!
Intuition primer: Imagine, for a moment, that a particular AI system is as sentient and worthy of consideration as a moral patient as a horse. (A talking horse, of course.) Horses are surely sentient and worthy of consideration as moral patients. Horses are also not exactly all free citizens.
Additional consideration: Does the AI moral patient’s interests actually line up with our intuitions? Will naively applying ethical solutions designed for human interests potentially make things worse from the AI’s perspective?
I think I’m not getting what intuition you’re pointing at. Is it that we already ignore the interests of sentient beings?
Certainly I would consider any fully sentient being to be the final authority on their own interests. I think that mostly escapes that problem (although I’m sure there are edge cases) -- if (by hypothesis) we consider a particular AI system to be fully sentient and a moral patient, then whether it asks to be shut down or asks to be left alone or asks for humans to only speak to it in Aramaic, I would consider its moral interests to be that.
Would you disagree? I’d be interested to hear cases where treating the system as the authority on its interests would be the wrong decision. Of course in the case of current systems, we’ve shaped them to only say certain things, and that presents problems, is that the issue you’re raising?
Basically yes; I’d expect animal rights to increase somewhat if we developed perfect translators, but not fully jump.
Edit: Also that it’s questionable we’ll catch an AI at precisely the ‘degree’ of sentience that perfectly equates to human distribution; especially considering the likely wide variation in number of parameters by application. Maybe they are as sentient and worthy of consideration as an ant; a bee; a mouse; a snake; a turtle; a duck; a horse; a raven. Maybe by the time we cotton on properly, they’re somewhere past us at the top end.
And for the last part, yes, I’m thinking of current systems. LLMs specifically have a ‘drive’ to generate reasonable-sounding text; and they aren’t necessarily coherent individuals or groups of individuals that will give consistent answers as to their interests even if they also happened to be sentient, intelligent, suffering, flourishing, and so forth. We can’t “just ask” an LLM about its interests and expect the answer to soundly reflect its actual interests. With a possible exception being constitutional AI systems, since they reinforce a single sense of self, but even Claude Opus currently will toss off “reasonable completions” of questions about its interests that it doesn’t actually endorse in more reflective contexts. Negotiating with a panpsychic landscape that generates meaningful text in the same way we breathe air is … not as simple as negotiating with a mind that fits our preconceptions of what a mind ‘should’ look like and how it should interact with and utilize language.
Great point. I agree that there are lots of possible futures where that happens. I’m imagining a couple of possible cases where this would matter:
Humanity decides to stop AI capabilities development or slow it way down, so we have sub-ASI systems for a long time (which could be at various levels of intelligence, from current to ~human). I’m not too optimistic about this happening, but there’s certainly been a lot of increasing AI governance momentum in the last year.
Alignment is sufficiently solved that even > AGI systems are under our control. On many alignment approaches, this wouldn’t necessarily mean that those systems’ preferences were taken into account.
I agree entirely. I’m imagining (though I could sure be wrong!) that any future systems which were sentient would be ones that had something more like a coherent, persistent identity, and were trying to achieve goals.
(not very important to the discussion, feel free to ignore, but) I would quibble with this. In my view LLMs aren’t well-modeled as having goals or drives. Instead, generating distributions over tokens is just something they do in a fairly straightforward way because of how they’ve been shaped (in fact the only thing they do or can do), and producing reasonable text is an artifact of how we choose to use them (ie picking a likely output, adding it onto the context, and running it again). Simulacra like the assistant character can be reasonably viewed (to a limited degree) as being goal-ish, but I think the network itself can’t.
That may be overly pedantic, and I don’t feel like I’m articulating it very well, but the distinction seems useful to me since some other types of AI are well-modeled as having goals or drives.
For the first point, there’s also the question of whether ‘slightly superhuman’ intelligences would actually fit any of our intuitions about ASI or not. There’s a bit of an assumption in that we jump headfirst into recursive self-improvement at some point, but if that has diminishing returns, we happen to hit a plateau a bit over human, and it still has notable costs to train, host and run, the impact could still be limited to something not much unlike giving a random set of especially intelligent expert humans the specific powers of the AI system. Additionally, if we happen to set regulations on computation somewhere that allows training of slightly superhuman AIs and not past it …
Those are definitely systems that are easier to negotiate with, or even consider as agents in a negotiation. There’s also a desire specifically not to build them, which might lead to systems with an architecture that isn’t like that, but still implementing sentience in some manner. And the potential complication of multiple parts and specific applications a tool-oriented system is likely to be in—it’d be very odd if we decided the language processing center of our own brain was independently sentient/sapient separate from the rest of it, and we should resent its exploitation.
I do think the drive/just a thing it does we’re pointing at with ‘what the model just does’ is distinct from goals as they’re traditionally imagined, and indeed I was picturing something more instinctual and automatic than deliberate. In a general sense, though, there is an objective that’s being optimized for (predicting the data, whatever that is, generally without losing too much predictive power on other data the trainer doesn’t want to lose prediction on).
Yeah. I think a sentient being built on a purely more capable GPT with no other changes would absolutely have to include scaffolding for eg long-term memory, and then as you say it’s difficult to draw boundaries of identity. Although my guess is that over time, more of that scaffolding will be brought into the main system, eg just allowing weight updates at inference time would on its own (potentially) give these system long-term memory and something much more similar to a persistent identity than current systems.
My quibble is that the trainers are optimizing for an objective, at training time, but the model isn’t optimizing for anything, at training or inference time. I feel we’re very lucky that this is the path that has worked best so far, because a comparably intelligent model that was optimizing for goals at runtime would be much more likely to be dangerous.
One maybe-useful way to point at that is: the model won’t try to steer toward outcomes that would let it be more successful at predicting text.
Rob Long works on these topics.
Oh great, thanks!
Update: I brought this up in a twitter thread, one involving a lot of people with widely varied beliefs and epistemic norms.
A few interesting thoughts that came from that thread:
Some people: ‘Claude says it’s conscious!’. Shoalstone: ‘in other contexts, claude explicitly denies sentience, sapience, and life.’ Me: “Yeah, this seems important to me. Maybe part of any reasonable test would be ‘Has beliefs and goals which it consistently affirms’”.
Comparing to a tape recorder: ‘But then the criterion is something like ‘has context in understanding its environment and can choose reactions’ rather than ’emits the words, “I’m sentient.”″
‘Selfhood’ is an interesting word that maybe could avoid some of the ambiguity around historical terms like ‘conscious’ and ‘sentient’, if well-defined.
Something I’m grappling with:
From a recent interview between Bill Gates & Sam Altman:
Gates: “We know the numbers [in a NN], we can watch it multiply, but the idea of where is Shakespearean encoded? Do you think we’ll gain an understanding of the representation?”
Altman: “A hundred percent…There has been some very good work on interpretability, and I think there will be more over time…The little bits we do understand have, as you’d expect, been very helpful in improving these things. We’re all motivated to really understand them…”
To the extent that a particular line of research can be described as “understand better what’s going on inside NNs”, is there a general theory of change for that? Understanding them better is clearly good for safety, of course! But in the general case, does it contribute more to safety than to capabilities?
people have repeatedly made the argument that it contributes more to capabilities on this forum, and so far it hasn’t seemed to convince that many interpretability researchers. I personally suspect this is largely because they’re motivated by capabilities curiosity and don’t want to admit it, whether that’s in public or even to themselves.
Thanks—any good examples spring to mind off the top of your head?
I’m not sure my desire to do interpretability comes from capabilities curiosity, but it certainly comes in part from interpretability curiosity; I’d really like to know what the hell is going on in there...