I am new to this website. I am also not a english native speaker so pardon me in advance.
I am very sorry if it is considered as rude on this forum to not starting by a post for introducing ourselves.
I am here because I am curious about the AI safety thing, and I do have a (light) ML background (though more in my studies than in my job). I have read this forum and adjacent ones for some weeks now but despite all the posts I read I have failed so far to have a strong opinion on p(doom).
It is quite frustrating to be honest and I would like to have one.
I just cannot resist to react to this post, because my prior (very strong prior, 99%), is that chat GPT 3, 4 , or even 100, is not and cannot be and will not be, agentic or worrying, because at the end it is just a LLM predicting the most probable next words.
The impression that I have is that the author of this post does not understand what a LLM is, but I give a 5% probability that on the contrary he understands something that I do not get at all.
For me, no matter how ´smart’ the result looks like, anthropomorphizing the LLM and worry about it is a mistake.
I would really appreciate if someone can send me a link to help me understand why I may be wrong.
In addition to what the other comments are saying:
If you get strongly superhuman LLMs, you can trivially accelerate scientific progress on agentic forms of AI like Reinforcement Learning by asking it to predict continuations of the most cited AI articles of 2024, 2025, etc. (have the year of publication, citation number and journal of publication as part of the prompt). Hence at the very least superhuman LLMs enable the quick construction strong agentic AIs.
Second, the people who are building Bing Chat are really looking for ways to make it as agentic as possible, it’s already searching the internet, it’s gonna be integrated inside the Edge browser soon, and I’d bet that a significant research effort is going into making it interact with the various APIs available over the internet. All economic and research interests are pushing towards making it as agentic as possible.
Agree, and I would add, even if the oracle doesn’t accidentally spawn a demon that tries to escape on its own, someone could pretty easily turn it into an agent just by driving it with an external event loop.
I.e., ask it what a hypothetical agent would do (with, say, a text interface to the Internet) and then forward its queries and return the results to the oracle, repeat.
With public access, someone will eventually try this. The conversion barrier is just not that high. Just asking an otherwise passive oracle to imagine what an agent might do just about instantiates one. If said imagined agent is sufficiently intelligent, it might not take very many exchanges to do real harm or even FOOM, and if the loop is automated (say, a shell script) rather than a human driving each step manually, it could potentially do a lot of exchanges on a very short time scale, potentially making a somewhat less intelligent agent powerful enough to be dangerous.
I highly doubt the current Bing AI is yet smart enough to create an agent smart enough to be very dangerous (much less FOOM), but it is an oracle, with all that implies. It could be turned into an agent, and such an agent will almost certainly not be aligned. It would only be relatively harmless because it is relatively weak/stupid.
To rephrase, my prior is that LLM just predict next words (it is their only capability). I would be worried when a LLM does something else (though I think it cannot happen), that would be what I would call “misalignment”.
On the meantime , what I read a lot about people worrying about ChatGPT/Bing sounds just like anthropomorphizing the AI with the prior that it can be sentient, have “intents” / and to me it is just not right.
I am not sure to understand how having the ability to search the internet change dramatically that.
If a LLM, when p(next words) is too low, can ‴″decide‴′ to search the internet to have better inputs, I do not feel that it makes a change in what I say above.
I do not want to have a too long fruitless discussion, I think indeed that I have to continue to read some materials on AI safety to better understand what are your models , but at this stage to be honest I cannot help thinking that some comments or posts are made by people who lack some basic understanding about what a LLM is , which may result in anthropomorphizing AI more than it should be. It is very easy when you do not know what a LLM is to wonder for exemple “CHAT Gpt answered that, but he seems to say that to not hurt me, I wonder what does ChatGPT really think ?” and I typically think that this sentence makes no sense at all, because of what a LLM is.
The predicting next token thing is the output channel. Strictly logically speaking, this is independent of agenty-ness of the neural network. You can have anything, from a single rule-based table looking only at the previous token to a superintelligent agent, predicting the next token.
I’m not saying ChatGPT has thoughts or is sentient, but I’m saying that it trying to predict the next token doesn’t logically preclude either. If you lock me into a room and give me only a single output channel in which I can give probability distributions over the next token, and only a single input channel in which I can read text, then I will be an agent trying to predict the next token, and I will be sentient and have thoughts.
Plus, the comment you’re responding to gave an example of how you can use token prediction specifically to build other AIs. (You responded to the third paragraph, but not the second.)
I agree with your reasoning strictly logically speaking, but it seems to me that a LLM cannot be sentient or have thoughts, even theoritically, and the burden of proof seems strongly on the side of someone who would made opposite claims.
And for someone who do not know what is a LLM, it is of course easy to anthropomorphize the LLM for obvious reasons (it can be designed to sound sentient or to express ‘thoughts’), and it is my feeling that this post was a little bit about that.
Overall, I find the arguments that I received after my first comment more convincing in making me feel what could be the problem, than the original post.
As for the possibility of a LLM to accelerate scientific progress towards agentic AI, I am skeptical, but I may be lacking imagination.
And again, nothing in the exemples presented in the original post is related to this risk, It seems that people that are worried are more trying to find exemples where the “character” of the AI is strange (which in my opinion are mistaken worries due to anthropomorphization of the AI), rather than finding exemples where the AI is particularly “capable” in terms of generating powerful reasoning or impressive “new ideas” (maybe also because at this stage the best LLM are far from being there).
I agree with your reasoning strictly logically speaking, but it seems to me that a LLM cannot be sentient or have thoughts, even theoritically,
This seems not-obvious—ChatGPT is a neural network, and most philosophers and AI people do think that neural networks can be conscious if they run the right algorithm. (The fact that it’s a language model doesn’t seem very relevant here for the same reason as before; it’s just a statement about its final layer.)
(maybe also because at this stage the best LLM are far from being there).
I think the most important question is about where on a reasoning-capability scale you would put
GPT-2
ChatGPT/Bing
human-level intelligence
Opinions on this vary widely even between well informed people. E.g., if you think (1) is a 10, (2) an 11, and (3) a 100, you wouldn’t be worried. But if it’s 10 → 20 → 50, that’s a different story. I think it’s easy to underestimate how different other people’s intuitions are from yours. But depending on your intuitions, you could consider the dog thing as an example that Bing is capable of “powerful reasoning”.
I think that the “most” in the sentence “most philosophers and AI people do think that neurol networks can be conscious if they run the right algorithm” is an overstatement, though I do not know to what extent.
I have no strong view on that, primarly because I think I lack some deep ML knowledge (I would weigh far more the view of ML experts than the view of philosophers on this topic).
Anyway, even accepting that neural networks can be conscious with the right algorithm, I think I disagree about “the fact that it’s a language model doesn’t seem relevant”. In a LLM language is not only the final layer, you have also the fact that the aim of the algorithm is p(next words), so it is a specific kind of algorithms. My feeling is that a p(next words) algorithms cannot be sentient, and I think that most ML researchers would agree with that, though I am not sure.
I am also not sure about the “reasoning-capability” scale, even if a LLM is very close to human for most parts of conversations, or better than human for some specific tasks (i.e doing summaries, for exemple), that would not mean that it is close to do a scientific breakthrough (on that I basically agree with the comments of AcurB some posts above)
I think that the “most” in the sentence “most philosophers and AI people do think that neurol networks can be conscious if they run the right algorithm” is an overstatement, though I do not know to what extent.
It is probably an overstatement. At least among philosophers in the 2020 Philpapers survey, most of the relevant questions would put that at a large but sub-majority position: 52% embrace physicalism (which is probably an upper bound); 54% say uploading = death; and 39% “Accept or lean towards: future AI systems [can be conscious]”. So, it would be very hard to say that ‘most philosophers’ in this survey would endorse an artificial neural network with an appropriate scale/algorithm being conscious.
I know I said the intelligence scale is the crux, but now I think the real crux is what you said here:
In a LLM language is not only the final layer, you have also the fact that the aim of the algorithm is p(next words), so it is a specific kind of algorithms. My feeling is that a p(next words) algorithms cannot be sentient, and I think that most ML researchers would agree with that, though I am not sure.
Can you explain why you believe this? How does the output/training signal restrict the kind of algorithm that generates it? I feel like if you have novel thoughts, people here would be very interested in those, because most of them think we just don’t understand what happens inside the network at all, and that it could totally be an agent. (A mesa optimizer to use the technical term; an optimizer that appears as a result of gradient descent tweaking the model.)
The consciousness thing in particular is perhaps less relevant than functional restrictions.
There is a hypothetical example of simulating a ridiculous number of humans typing text and seeing what fraction of those people that type out the current text type out each next token. In the limit, this approaches the best possible text predictor. This would simulate a lot of consciousness.
>If you get strongly superhuman LLMs, you can trivially accelerate scientific progress on agentic forms of AI like Reinforcement Learning by asking it to predict continuations of the most cited AI articles of 2024, 2025, etc.
Question that might be at the heart of the issue is what is needed for AI to produce genuinely new insights. As a layman, I see how LM might become even better at generating human-like text, might become super-duper good at remixing and rephrasing things it “read” before, but hit a wall when it comes to reaching AGI. Maybe to get genuine intelligence we need more than “predict-next-token kind of algorithm +obscene amounts of compute and human data” and mimic more closely how actual people think instead?
Perhaps local AI alarmists (it’s not a pejorative, I hope? OP does declare alarm, though) would like to try persuade me otherwise, be in in their own words or by doing their best to hide condescension and pointing me to numerous places where this idea was discussed before?
Maybe to get genuine intelligence we need more than “predict-next-token kind of algorithm +obscene amounts of compute and human data” and mimic more closely how actual people think instead?
That would be quite fortunate, and I really really hope that this is case, but scientific articles are part of the human-like text that the model can be trained to predict. You can ask Bing AI to write you a poem, you can ask its opinion on new questions that it has never seen before, and you will get back coherent answers that were not in its dataset. The bitter lesson of Generative Image models and LLMs in the past few years is that creativity requires less special sauce than we might think. I don’t see a strong fundamental barrier to extending the sort of creativity chatGPT exhibits right now to writing math & ML papers.
It makes sense that you can get brand new sentences or brand new images that can even serve some purpose using ML but is it creativity? That raises the question of what is creativity in the first place and that’s whole new can of worms. You give me an example of how Bing can write poems that were not in the dataset, but poem writing is a task that can be quite straightforwardly formalized, like collection of lines which end on alternating syllables or something, but “write me a poem about sunshine and butterflies” is clearly vastly easier prompt than “give me theory of everything”. Resulted poem might be called creative if interpreted generously, but actual, novel scientific knowledge is a whole another level of creative, so much that we should likely put these things in different conceptual boxes.
Maybe that’s just a failure of imagination on my part? I do admit that I, likewise, just really want it to be true, so there’s that.
It’s hard to tell exactly where our models differ just from that, sorry. https://www.youtube.com/@RobertMilesAI has some nice short introductions to a lot of the relevant concepts, but even that would take a while to get through, so I’m going to throw out some relevant concepts that have a chance of being the crux.
Not the author of the grandparent comment, but I figured I’d take a shot at some of these, since it seems mildly interesting, and could lead to a potential exchange of gears. (You can feel free not to reply to me if you’re not interested, of course.)
Yes. Informally, it’s when a demon spawns inside of the system and possesses it to do great evil. (Note that I’m using intentionally magical terminology here to make it clear that these are models of phenomena that aren’t well-understood—akin to labeling uncharted sections of a map with “Here Be Dragons”.)
My current model of LLMs doesn’t prohibit demon possession, but only in the sense that my model of (sufficiently large) feedforward NNs, CNNs, RNNs, etc. also don’t prohibit demon possession. These are all Turing-complete architectures, meaning they’re capable in principle of hosting demons (since demons are, after all, computable), so it’d be unreasonable of me to suggest that a sufficiently large such model, trained on a sufficiently complex reward function, and using sufficiently complex training corpus, could not give rise to demons.
In practice, however, I have my doubts that the current way in which Transformers are structured and trained is powerful enough to spawn a demon. If this is a relevant crux, we may wish to go into further specifics of demon summoning and how it might be attempted.
Yes. In essence, this says that demons don’t have your best interests at heart by default, because they’re demons. (Angels—which do have your best interests at heart—do exist, but they’re rare, and none of the ways we know of to summon supernatural entities—hypothetical or actual—come equipped with any kind of in-built mechanism for distinguishing between the two—meaning that anything we end up summoning will, with high probability, be a demon.)
I don’t consider this a likely crux for us, though obviously it may well be for the grandparent author.
Yes. Informally, this says that demons are smart, not stupid, and smart demons trying to advance their own interests generally end up killing you, because why wouldn’t they.
This is where you take your intended vessel for demon summoning, chop off their arms and legs so that they can’t move, blind them so that they can’t see, and lobotomize them so that they can’t think. (The idea is that, by leaving their ears and mouth untouched, you can still converse with them, without allowing them to do much of anything else.) Then, you summon a demon into them, and hope that the stuff you did to them beforehand prevents the demon from taking them over.
(Also, even though the arms, legs, and eyes were straightforward, you have to admit—if only to yourself—that you don’t really understand how the lobotomy part worked.)
This could be a crux between us, depending on what exactly your position is. My current take is that it seems really absurd to think we can keep out sufficiently powerful demons through means like these, especially when we don’t really understand how the possession happens, and how our “precautions” interface with this process.
(I mean, technically there’s a really easy foolproof way to keep out sufficiently powerful demons, which is to not go through with the summoning. But demons make you lots of money, or so I hear—so that’s off the table, obviously.)
This one’s interesting. There are a lot of definitional disputes surrounding this particular topic, and on the whole, I’d say it’s actually less clear these days what a given person is talking about when they say “narrow versus general” than it was a ~decade ago. I’m not fully convinced that the attempted conceptual divide here is at all useful, but if I were try and make it work:
Demons are general intelligences, full stop. That’s part of what it means to be a demon. (You might feel the need to talk about “fully-fledged” versus “nascent” demons here, to add some nuance—but I actually think the concept is mostly useful if we limit ourselves to talking about the strongest possible version of it, since that’s the version we’re actually worried about getting killed by.)
Anyway, in my conception of demons: they’re fully intelligent and general agents, with goals of their own—hence why the smart ones usually try to kill you. That makes them general intelligences. Narrow intelligences, on the other hand, don’t qualify as demons; they’re something else entirely, a different species, more akin to plants or insects than to more complicated agents. They might be highly capable in a specific domain or set of domains, but they achieve this through specialization rather than through “intelligence” in any real sense, much in the same way that a fruit fly is specialized for being very good at avoiding your fly-swatter.
This makes the term “narrow intelligence” somewhat of a misnomer, since it suggests some level of similarity or even continuity between “narrow” and general intelligences. In my model, this is not the case: demons and non-demons are different species—full stop. If you think this is a relevant crux, we can try and double-click on some of these concepts to expand them.
and how general do you think a large language model is?
I think this question is basically answered above, in the first section about demons? To reiterate: I think the Transformer architecture is definitely capable of playing host to a demon—but this isn’t actually a concession in any strong sense, in my view: any Turing-complete architecture can host a demon, and the relevant question is how easy it is to summon a demon into that architecture. And again, currently I don’t see the training and structure of Transformer-based LLMs as conducive to demon-summoning.
OK, mesa optimizers and generality seem to be points of disagreement then.
Demons are general intelligences, full stop. That’s part of what it means to be a demon.
I think your concept of “demons” is pointing to something useful, but I also think that definition is more specific than the meaning of “mesa optimizer”. A chess engine is an optimizer, but it’s not general. Optimizers need not be general; therefore, they need not be demons, and I think we have examples of such mesa optimizers already (they’re not hypothetical), even if no-one has managed to summon a demon yet.
I see mesa optimization as a generalization of Goodhart’s Law. Any time you make a system optimize for a proxy measure instead of the real target, the proxy itself may become the goal of the inner system, even when overoptimizing it runs counter to hitting the real target.
This makes the term “narrow intelligence” somewhat of a misnomer, since it suggests some level of similarity or even continuity between “narrow” and general intelligences. In my model, this is not the case: demons and non-demons are different species—full stop.
[...] I’d say it’s actually less clear these days what a given person is talking about when they say “narrow versus general” than it was a ~decade ago.
I think there are degrees of generality, rather than absolute categories. They’re fuzzy sets. Deep blue can only play chess. It can’t even do an “easier” task like run your thermostat. It’s very narrow. AlphaZero can learn to play chess or go or shougi. More domains, more general. GPT-3 can also play chess, you just have to frame it as a text-completion task, using e.g. Portable Game Notation. The human language domain is general enough to play chess, even if it’s not as good at chess as AlphaZero is. More domains, or broader domains containing more subdomains, means more generality.
Thanks for responding! I think there are some key conceptual differences between us that need to be worked out/clarified—so here goes nothing:
(content warning: long)
I think your concept of “demons” is pointing to something useful, but I also think that definition is more specific than the meaning of “mesa optimizer”. A chess engine is an optimizer, but it’s not general. Optimizers need not be general; therefore, they need not be demons, and I think we have examples of such mesa optimizers already (they’re not hypothetical), even if no-one has managed to summon a demon yet.
The main thing I’m concerned about from mesa-optimizers (and hence the reason I think the attendant concept is useful) is that their presence is likely to lead to a treacherous turn—what I inside the demon metaphor referred to as “possession”, because from the outside it really does look like your nice little system that’s chugging along, helpfully trying to optimize the thing you wanted it to optimize, just suddenly gets Taken Over From Within by a strange and inscrutable (and malevolent) entity.
On this view, I don’t see it as particularly useful to weaken the term to encompass other types of optimization. This is essentially the point I was trying to make in the parenthetical remark included directly after the sentences you quoted:
(You might feel the need to talk about “fully-fledged” versus “nascent” demons here, to add some nuance—but I actually think the concept is mostly useful if we limit ourselves to talking about the strongest possible version of it, since that’s the version we’re actually worried about getting killed by.)
Of course, other types of optimizers do exist, and can be non-general, e.g. I fully accept your chess engine example as a valid type of optimizer. But my model is that these kinds of optimizers are (as a consequence of their non-generality) brittle: they spring into existence fully formed (because they were hardcoded by other, more general intelligences—in this case humans), and there is no incremental path to a chess engine that results from taking a non-chess engine and mutating it repeatedly according to some performance rule. Nor, for that matter, is there an incremental path continuing onward from a (classical) chess engine, through which it might mutate into something better, like AlphaZero.
(Aside: note that AlphaZero itself is not something I view as any kind of “general” system; you could argue it’s more general than a classical chess engine, but only if you view generality as a varying quantity, rather than as a binary—and I’ve already expressed that I’m not hugely fond of that view. But more on that later.)
In any case sense, hopefully I’ve managed to convey a sense in which these systems (and the things and ways they optimize) can be viewed as islands in the design space of possible architectures. And this is important in my view, because what this means is that you should not (by default) expect naturally arising mesa-optimizers to resemble these “non-general” optimizers. I expect that any natural category of mesa-optimizers—that is to say, categories with their boundaries drawn to cleave at the joints of reality—to essentially look like it contains a bunch of demons, and excludes everything else.
TL;DR: Chess engines are non-general optimizers, but they’re not mesa-optimizers; and the fact that you could only come up with an example of the former and not the latter is not a coincidence but a reflection of a deeper truth. Of course, this previous statement could be falsified by providing an example of a non-general mesa-optimizer, and a good argument as to why it should be regarded as a mesa-optimizer.
That segues fairly nicely into the next (related) point, which is, essentially: what is a mesa-optimizer? Let’s look at what you have to say about it:
I see mesa optimization as a generalization of Goodhart’s Law. Any time you make a system optimize for a proxy measure instead of the real target, the proxy itself may become the goal of the inner system, even when overoptimizing it runs counter to hitting the real target.
It shouldn’t come as a huge surprise at this point, but I don’t view this as a very useful way to draw the boundary. I’m not even sure it’s correct, for that matter—the quoted passage reads like you’re talking about outer misalignment (a mismatch between the system’s outer optimization target—its so-called “base objective”—and its creators’ real target), whereas I’m reasonably certain mesa-optimization is much better thought of as a type of inner misalignment (a mismatch between the system’s base objective and whatever objective it ends up representing internally, and pursuing behaviorally).
Given this, it’s plausible to me that when you earlier say you “think we have examples of such mesa optimizers already”, you’re referring specifically to modes of misbehavior I’d class as outer alignment failures (and hence not mesa-optimization at all). But that’s just speculation on my part, and on the whole this seems like a topic that would benefit more from being double-clicked and expanded than from my continuing to speculate on what exactly you might have meant.
In any case, my preferred take on what a mesa-optimizer “really” is would be something like: a system should be considered to contain a mesa-optimizer in precisely those cases where modeling it as consisting of a second optimizer with a different objectivebuys you more explanatory power than modeling it as a single optimizer whose objective happens to not be the one you wanted. Or, in more evocative terms: mesa-optimization shows up whenever the system gets possessed by—you guessed it—a demon.
And in this frame, I think it’s fair to say that we don’t have any real examples of mesa-optimization. We might have some examples of outer alignment failure, and perhaps even some examples of inner alignment failure (though I’d be wary on that front; many of those are actually outer alignment failures in disguise). But we certainly don’t have any examples of behavior where it makes sense to say, “See! You’ve got a second optimizer inside of your first one, and it’s doing stuff on its own!” which is what it would take to satisfy the definition I gave.
(And yes, going by this definition, I think it’s plausible that we won’t see “real” mesa-optimizers until very close to The End. And yes, this is very bad news if true, since it means that it’s going to be very difficult to experiment safely with “toy” mesa-optimizers, and come away with any real insight. I never said my model didn’t have bleak implications!)
Lastly:
I think there are degrees of generality, rather than absolute categories. They’re fuzzy sets. Deep blue can only play chess. It can’t even do an “easier” task like run your thermostat. It’s very narrow. AlphaZero can learn to play chess or go or shougi. More domains, more general. GPT-3 can also play chess, you just have to frame it as a text-completion task, using e.g. Portable Game Notation. The human language domain is general enough to play chess, even if it’s not as good at chess as AlphaZero is. More domains, or broader domains containing more subdomains, means more generality.
I agree that this is a way to think about generality—but as with mesa-optimization, I disagree that it’s a good way to think about generality. The problem here is that you’re looking at the systems in terms of their outer behavior—”how many separate domains does this system appear to be able to navigate?”, “how broad a range does its behavior seem to span?”, etc.—when what matters, on my model, is the internal structure of the systems in question.
(I mean, yes, what ultimately matters is the outer behavior; you care if a system wakes up and kills you. But understanding the internal structure constrains our expectations about the system’s outer behavior, in a way that simply counting the number of disparate “domains” it has under its belt doesn’t.)
As an objection, I think this basically rhymes with my earlier objection about mesa-optimizers, and when it’s most useful to model a system as containing one. You might notice that that definition I gave also seems to hang pretty heavily on the system’s internals—not completely, since I was careful to say things like “is usefully modeled as” and “buys you more explanatory power—but overall, it seems like a definition oriented towards an internal classification of whether mesa-optimizers (“demons”) are present, rather than an cruder external metric (“it’s not doing the thing we told it to do!”).
And so, my preferred framework under which to think about generality (under which none of our current systems, including—again—AlphaZero, which I mentioned way earlier in this comment, count as truly “general”) is basically what I sketched out in my previous reply:
Narrow intelligences, on the other hand, don’t qualify as demons; they’re something else entirely, a different species, more akin to plants or insects than to more complicated agents. They might be highly capable in a specific domain or set of domains, but they achieve this through specialization rather than through “intelligence” in any real sense, much in the same way that a fruit fly is specialized for being very good at avoiding your fly-swatter.
AlphaZero, as a system, contains a whole lot of specialized architecture for two-player games with a discretized state space and action space. It turns out that multiple board games fall under this categorization, making it a larger category than, like, “just chess” or “just shogi” or something. But that’s just a consequence of the size of the category; the algorithm itself is still specialized, and consequently (this is the crucial part, from my perspective) forms an island in design space.
I referenced this earlier, and I think it’s relevant here as well: there’s no continuous path in design space from AlphaZero to GPT-[anything]; nor was there an incremental design path from Stockfish 8 to AlphaZero. They’re different systems, each of which was individually designed and implemented by very smart people. But the seeming increase in “generality” of these systems is not due to any kind of internal “progression” of any kind, of the kind that might be found in e.g. a truly general system undergoing takeoff; instead, it’s a progression of discoveries by those very smart people: impressive, but not fundamentally different in kind from the progression of (say) heavier-than-air flight, which also consisted of a series of disparate but improving designs, none of which were connected to each other via incremental paths in design space.
(Here, I’m grouping together designs into “families”, where two designs that are basically variants of each other in size are considered the same design. I think that’s fair, since this is the case with the various GPT models as well.)
And this matters because (on my model) the danger from AGI that I see does not come from this kind of progression of design. If we were somehow assured that all further progress in AI would continue to look like this kind of progress, that would massively drop my P(doom) estimates (to, like, <0.01 levels). The reason AGI is different, the reason it constitutes (on my view) an existential risk, is precisely because artificial general intelligence is different from artificial narrow intelligence—not just in degree, but in kind.
(Lots to double-click on here, but this is getting stupidly long even for a LW comment, so I’m going to stop indulging the urge to preemptively double-click and expand everything for you, since that’s flatly impossible, and let you pick and choose where to poke at my model. Hope this helps!)
(content warning: long)
[...] let you pick and choose where to poke at my model.
You’ll forgive me if I end up writing multiple separate responses then.
TL;DR: Chess engines are non-general optimizers, but they’re not mesa-optimizers; and the fact that you could only come up with an example of the former and not the latter is not a coincidence but a reflection of a deeper truth. Of course, this previous statement could be falsified by providing an example of a non-general mesa-optimizer, and a good argument as to why it should be regarded as a mesa-optimizer.
It’s not that I couldn’t come up with examples, but more like I didn’t have time to write a longer comment just then. Are these not examples? What about godshatter?
The terms I first enumerated have specific meaning not coined by you or me, and I am trying to use them in the standard way. Now, it’s possible that I don’t understand the definitions correctly, but I think I do, and I think your definition for (at least) “mesa optimizer” is not the standard one. If you know this and just don’t like the standard definitions (because they are “not useful”), that’s fine, define your own terms, but call them something else, rather than changing them out from under me.
Specifically, I was going off the usage here. Does that match your understanding?
quoted passage reads like you’re talking about outer misalignment (a mismatch between the system’s outer optimization target—its so-called “base objective”—and its creators’ real target), whereas I’m reasonably certain mesa-optimization is much better thought of as a type of inner misalignment (a mismatch between the system’s base objective and whatever objective it ends up representing internally, and pursuing behaviorally).
I was specifically talking about inner alignment, where the mesa objective is a proxy measure for the base objective. But I can see how Goodhart’s law could apply to outer alignment too, come to think of it: if you fail to specify your real goal and instead specify a proxy.
It’s not that I couldn’t come up with examples, but more like I didn’t have time to write a longer comment just then. Are these not examples? What about godshatter?
I agree that “godshatter” is an example of a misaligned mesa-optimizer with respect to evolution’s base objective (inclusive genetic fitness). But note specifically that my argument was that there are no naturally occurring non-general mesa-optimizers, which category humans certainly don’t fit into. (I mean, you can look right at the passage you quoted; the phrase “non-general” is right there in the paragraph.)
In fact, I think humans’ status as general intelligences supports the argument I made, by acting as (moderately weak) evidence that naturally occurring mesa-optimizers do, in fact, exhibit high amounts of generality and agency (demonic-ness, you could say).
(If you wanted to poke at my model harder, you could ask about animals, or other organisms in general, and whether they count as mesa-optimizers. I’d argue that the answer depends on the animal, but that for many animals my answer would in fact be “no”—and even those for whom my answer is “yes” would obviously be nowhere near as powerful as humans in terms of optimization strength.)
As for the Rob Miles video: I mostly see those as outer alignment failures, despite the video name. (Remember, I did say in my previous comment that on my model, many outer alignment failures can masquerade as inner alignment failures!) To comment on the specific examples mentioned in the video:
The agent-in-the-maze examples strike me as a textbook instance of outer misalignment: the reward function by itself was not sufficient to distinguish correct behavior from incorrect behavior. It’s possible to paint this instead as inner misalignment, but only by essentially asserting, flat-out, that the reward function was correct, and the system simply generalized incorrectly. I confess I don’t really see strong reason to favor the latter characterization over the former, while I do see some reason for the converse.
The coin run example, meanwhile, makes a stronger case for being an inner alignment failure, mainly because of the fact that many possible forms of outer misalignment were ruled out via interpretability. The agent was observed to assign appropriately negative values to obstacles, and appropriately positive values to the coin. And while it’s still possible to make the argument that the training procedure failed to properly incentivize learning the correct objective, this is a much weaker claim, and somewhat question-begging.
And, of course, neither of these are examples of mesa-optimization in my view, because mesa-optimization is not synonymous with inner misalignment. From the original post on risks from learned optimization:
There need not always be a mesa-objective since the algorithm found by the base optimizer will not always be performing optimization. Thus, in the general case, we will refer to the model generated by the base optimizer as a learned algorithm, which may or may not be a mesa-optimizer.
And the main issue with these examples is that they occur in toy environments which are simply too… well, simple to produce algorithms usefully characterized as optimizers in their own right, outside of the extremely weak sense in which your thermostat is also an optimizer. (And, like—yes, in a certain sense it is, but that’s not a very high bar to meet; it’s not even at the level of the chess engine example you gave!)
The terms I first enumerated have specific meaning not coined by you or me, and I am trying to use them in the standard way. Now, it’s possible that I don’t understand the definitions correctly, but I think I do, and I think your definition for (at least) “mesa optimizer” is not the standard one. If you know this and just don’t like the standard definitions (because they are “not useful”), that’s fine, define your own terms, but call them something else, rather than changing them out from under me.
Specifically, I was going off the usage here. Does that match your understanding?
The usage in that video is based on the definition given by the authors of the linked post, who coined the term to begin with—which is to say, yes, I agree with it. And I already discussed above why this definition does not mean that literally any learned algorithm is a mesa-optimizer (and if it did, so much the worse for the definition)!
(Meta: I generally don’t consider it particularly useful to appeal to the origin of terms as a way to justify their use. In this specific case, it’s fine, since I don’t believe my usage conflicts with the original definition given. But even if you think I’m getting the definitions wrong, it’s more useful, from my perspective, if you explain to me why you think my usage doesn’t accord with the standard definitions. Presumably you yourself have specific reasons for thinking that the examples or arguments I give don’t sound quite right, right? If so, I’d petition you to elaborate on that directly! That seems to me like it would have a much better chance of locating our real disagreement. After all, when two people disagree, the root of that disagreement is usually significantly downstream of where it first appears—and I’ll thank you not to immediately assume that our source of disagreement is located somewhere as shallow as “one of us is misremembering/misunderstanding the definitions of terms”.)
I was specifically talking about inner alignment, where the mesa objective is a proxy measure for the base objective. But I can see how Goodhart’s law could apply to outer alignment too, come to think of it: if you fail to specify your real goal and instead specify a proxy.
This doesn’t sound right to me? To refer back to your quoted statement:
I see mesa optimization as a generalization of Goodhart’s Law. Any time you make a system optimize for a proxy measure instead of the real target, the proxy itself may become the goal of the inner system, even when overoptimizing it runs counter to hitting the real target.
I’ve bolded [what seem to me to be] the operative parts of that statement. I can easily see a way to map this description onto a description of outer alignment failure:
the outer alignment problem of eliminating the gap between the base objective and the intended goal of the programmers.
where the mapping in question goes: proxy → base objective, real target → intended goal. Conversely, I don’t see an equally obvious mapping from that description to a description of inner misalignment, because that (as I described in my previous comment) is a mismatch between the system’s base and mesa-objectives (the latter of which it ends up behaviorally optimizing).
I’d appreciate it if you could explain to me what exactly you’re seeing here that I’m not, because at present, my best guess is that you’re not familiar with these terms (which I acknowledge isn’t a good guess, for basically the reasons I laid out in my “Meta:” note earlier).
Yeah, I don’t think that interpretation is what I was trying to get across. I’ll try to clean it up to clarify:
I see [the] mesa optimization [problem (i.e. inner alignment)] as a generalization of Goodhart’s Law[, which is that a]ny time you make a system optimize for a proxy measure instead of the real target, the proxy itself may become the goal of the inner system, even when overoptimizing it runs counter to hitting the real target.
Not helping? I did not mean to imply that a mesa optimizer is necessarily misaligned or learns the wrong goal, it’s just hard to ensure that it learns the base one.
Goodhart’s law is usually stated as “When a measure becomes a target, it ceases to be a good measure”, which I would interpret more succinctly as “proxies get gamed”.
More concretely, from the Wikipedia article,
For example, if an employee is rewarded by the number of cars sold each month, they will try to sell more cars, even at a loss.
Then the analogy would go like this. The desired target (base goal) was “profits”, but the proxy chosen to measure that goal was “number of cars sold”. Under normal conditions, this would work. The proxy is in the direction of the target. That’s why it’s a proxy. But if you optimize the proxy too hard, you blow past the base goal and hit the proxy itself instead. The outer system (optimizer) is the company. It’s trying to optimize the employees. The inner system (optimizer) is the employee, which tries to maximize his own reward. The employee “learned” the wrong (mesa) goal “sell as many cars as possible (at any cost)”, which is not aligned with the base goal of “profits”.
No links, because no one in all of existence currently understands what the heck is going on inside of LLMs—which, of course, is just another way of saying that it’s pretty unreasonable to assign a high probability to your personal guesses about what the thing that LLMs do—whether you call that “predicting the most probable next word” or “reasoning about the world”—will or will not scale to.
Which, itself, is just a rephrase of the classic rationalist question: what do you think you know, and why do you think you know it?
(For what it’s worth, by the way, I actually share your intuition that current LLM architectures lack some crucial features that are necessary for “true” general intelligence. But this intuition isn’t very strongly held, considering how many times LLM progress has managed to surprise me already.)
I am new to this website. I am also not a english native speaker so pardon me in advance. I am very sorry if it is considered as rude on this forum to not starting by a post for introducing ourselves.
I am here because I am curious about the AI safety thing, and I do have a (light) ML background (though more in my studies than in my job). I have read this forum and adjacent ones for some weeks now but despite all the posts I read I have failed so far to have a strong opinion on p(doom). It is quite frustrating to be honest and I would like to have one.
I just cannot resist to react to this post, because my prior (very strong prior, 99%), is that chat GPT 3, 4 , or even 100, is not and cannot be and will not be, agentic or worrying, because at the end it is just a LLM predicting the most probable next words.
The impression that I have is that the author of this post does not understand what a LLM is, but I give a 5% probability that on the contrary he understands something that I do not get at all.
For me, no matter how ´smart’ the result looks like, anthropomorphizing the LLM and worry about it is a mistake.
I would really appreciate if someone can send me a link to help me understand why I may be wrong.
In addition to what the other comments are saying:
If you get strongly superhuman LLMs, you can trivially accelerate scientific progress on agentic forms of AI like Reinforcement Learning by asking it to predict continuations of the most cited AI articles of 2024, 2025, etc. (have the year of publication, citation number and journal of publication as part of the prompt). Hence at the very least superhuman LLMs enable the quick construction strong agentic AIs.
Second, the people who are building Bing Chat are really looking for ways to make it as agentic as possible, it’s already searching the internet, it’s gonna be integrated inside the Edge browser soon, and I’d bet that a significant research effort is going into making it interact with the various APIs available over the internet. All economic and research interests are pushing towards making it as agentic as possible.
Agree, and I would add, even if the oracle doesn’t accidentally spawn a demon that tries to escape on its own, someone could pretty easily turn it into an agent just by driving it with an external event loop.
I.e., ask it what a hypothetical agent would do (with, say, a text interface to the Internet) and then forward its queries and return the results to the oracle, repeat.
With public access, someone will eventually try this. The conversion barrier is just not that high. Just asking an otherwise passive oracle to imagine what an agent might do just about instantiates one. If said imagined agent is sufficiently intelligent, it might not take very many exchanges to do real harm or even FOOM, and if the loop is automated (say, a shell script) rather than a human driving each step manually, it could potentially do a lot of exchanges on a very short time scale, potentially making a somewhat less intelligent agent powerful enough to be dangerous.
I highly doubt the current Bing AI is yet smart enough to create an agent smart enough to be very dangerous (much less FOOM), but it is an oracle, with all that implies. It could be turned into an agent, and such an agent will almost certainly not be aligned. It would only be relatively harmless because it is relatively weak/stupid.
Update: See ChaosGPT and Auto-GPT.
Thank you for your answers.
Unfortunately I have to say that it did not help me so far to have a stronger feeling about ai safety.
(I feel very sympathetic with this post for example https://forum.effectivealtruism.org/posts/ST3JjsLdTBnaK46BD/how-i-failed-to-form-views-on-ai-safety-3 )
To rephrase, my prior is that LLM just predict next words (it is their only capability). I would be worried when a LLM does something else (though I think it cannot happen), that would be what I would call “misalignment”.
On the meantime , what I read a lot about people worrying about ChatGPT/Bing sounds just like anthropomorphizing the AI with the prior that it can be sentient, have “intents” / and to me it is just not right.
I am not sure to understand how having the ability to search the internet change dramatically that.
If a LLM, when p(next words) is too low, can ‴″decide‴′ to search the internet to have better inputs, I do not feel that it makes a change in what I say above.
I do not want to have a too long fruitless discussion, I think indeed that I have to continue to read some materials on AI safety to better understand what are your models , but at this stage to be honest I cannot help thinking that some comments or posts are made by people who lack some basic understanding about what a LLM is , which may result in anthropomorphizing AI more than it should be. It is very easy when you do not know what a LLM is to wonder for exemple “CHAT Gpt answered that, but he seems to say that to not hurt me, I wonder what does ChatGPT really think ?” and I typically think that this sentence makes no sense at all, because of what a LLM is.
The predicting next token thing is the output channel. Strictly logically speaking, this is independent of agenty-ness of the neural network. You can have anything, from a single rule-based table looking only at the previous token to a superintelligent agent, predicting the next token.
I’m not saying ChatGPT has thoughts or is sentient, but I’m saying that it trying to predict the next token doesn’t logically preclude either. If you lock me into a room and give me only a single output channel in which I can give probability distributions over the next token, and only a single input channel in which I can read text, then I will be an agent trying to predict the next token, and I will be sentient and have thoughts.
Plus, the comment you’re responding to gave an example of how you can use token prediction specifically to build other AIs. (You responded to the third paragraph, but not the second.)
Also, welcome to the forum!
Thank you,
I agree with your reasoning strictly logically speaking, but it seems to me that a LLM cannot be sentient or have thoughts, even theoritically, and the burden of proof seems strongly on the side of someone who would made opposite claims.
And for someone who do not know what is a LLM, it is of course easy to anthropomorphize the LLM for obvious reasons (it can be designed to sound sentient or to express ‘thoughts’), and it is my feeling that this post was a little bit about that.
Overall, I find the arguments that I received after my first comment more convincing in making me feel what could be the problem, than the original post.
As for the possibility of a LLM to accelerate scientific progress towards agentic AI, I am skeptical, but I may be lacking imagination.
And again, nothing in the exemples presented in the original post is related to this risk, It seems that people that are worried are more trying to find exemples where the “character” of the AI is strange (which in my opinion are mistaken worries due to anthropomorphization of the AI), rather than finding exemples where the AI is particularly “capable” in terms of generating powerful reasoning or impressive “new ideas” (maybe also because at this stage the best LLM are far from being there).
This seems not-obvious—ChatGPT is a neural network, and most philosophers and AI people do think that neural networks can be conscious if they run the right algorithm. (The fact that it’s a language model doesn’t seem very relevant here for the same reason as before; it’s just a statement about its final layer.)
I think the most important question is about where on a reasoning-capability scale you would put
GPT-2
ChatGPT/Bing
human-level intelligence
Opinions on this vary widely even between well informed people. E.g., if you think (1) is a 10, (2) an 11, and (3) a 100, you wouldn’t be worried. But if it’s 10 → 20 → 50, that’s a different story. I think it’s easy to underestimate how different other people’s intuitions are from yours. But depending on your intuitions, you could consider the dog thing as an example that Bing is capable of “powerful reasoning”.
I think that the “most” in the sentence “most philosophers and AI people do think that neurol networks can be conscious if they run the right algorithm” is an overstatement, though I do not know to what extent.
I have no strong view on that, primarly because I think I lack some deep ML knowledge (I would weigh far more the view of ML experts than the view of philosophers on this topic).
Anyway, even accepting that neural networks can be conscious with the right algorithm, I think I disagree about “the fact that it’s a language model doesn’t seem relevant”. In a LLM language is not only the final layer, you have also the fact that the aim of the algorithm is p(next words), so it is a specific kind of algorithms. My feeling is that a p(next words) algorithms cannot be sentient, and I think that most ML researchers would agree with that, though I am not sure.
I am also not sure about the “reasoning-capability” scale, even if a LLM is very close to human for most parts of conversations, or better than human for some specific tasks (i.e doing summaries, for exemple), that would not mean that it is close to do a scientific breakthrough (on that I basically agree with the comments of AcurB some posts above)
It is probably an overstatement. At least among philosophers in the 2020 Philpapers survey, most of the relevant questions would put that at a large but sub-majority position: 52% embrace physicalism (which is probably an upper bound); 54% say uploading = death; and 39% “Accept or lean towards: future AI systems [can be conscious]”. So, it would be very hard to say that ‘most philosophers’ in this survey would endorse an artificial neural network with an appropriate scale/algorithm being conscious.
I know I said the intelligence scale is the crux, but now I think the real crux is what you said here:
Can you explain why you believe this? How does the output/training signal restrict the kind of algorithm that generates it? I feel like if you have novel thoughts, people here would be very interested in those, because most of them think we just don’t understand what happens inside the network at all, and that it could totally be an agent. (A mesa optimizer to use the technical term; an optimizer that appears as a result of gradient descent tweaking the model.)
The consciousness thing in particular is perhaps less relevant than functional restrictions.
There is a hypothetical example of simulating a ridiculous number of humans typing text and seeing what fraction of those people that type out the current text type out each next token. In the limit, this approaches the best possible text predictor. This would simulate a lot of consciousness.
>If you get strongly superhuman LLMs, you can trivially accelerate scientific progress on agentic forms of AI like Reinforcement Learning by asking it to predict continuations of the most cited AI articles of 2024, 2025, etc.
Question that might be at the heart of the issue is what is needed for AI to produce genuinely new insights. As a layman, I see how LM might become even better at generating human-like text, might become super-duper good at remixing and rephrasing things it “read” before, but hit a wall when it comes to reaching AGI. Maybe to get genuine intelligence we need more than “predict-next-token kind of algorithm +obscene amounts of compute and human data” and mimic more closely how actual people think instead?
Perhaps local AI alarmists (it’s not a pejorative, I hope? OP does declare alarm, though) would like to try persuade me otherwise, be in in their own words or by doing their best to hide condescension and pointing me to numerous places where this idea was discussed before?
That would be quite fortunate, and I really really hope that this is case, but scientific articles are part of the human-like text that the model can be trained to predict. You can ask Bing AI to write you a poem, you can ask its opinion on new questions that it has never seen before, and you will get back coherent answers that were not in its dataset. The bitter lesson of Generative Image models and LLMs in the past few years is that creativity requires less special sauce than we might think. I don’t see a strong fundamental barrier to extending the sort of creativity chatGPT exhibits right now to writing math & ML papers.
Does this analogy work, though?
It makes sense that you can get brand new sentences or brand new images that can even serve some purpose using ML but is it creativity? That raises the question of what is creativity in the first place and that’s whole new can of worms. You give me an example of how Bing can write poems that were not in the dataset, but poem writing is a task that can be quite straightforwardly formalized, like collection of lines which end on alternating syllables or something, but “write me a poem about sunshine and butterflies” is clearly vastly easier prompt than “give me theory of everything”. Resulted poem might be called creative if interpreted generously, but actual, novel scientific knowledge is a whole another level of creative, so much that we should likely put these things in different conceptual boxes.
Maybe that’s just a failure of imagination on my part? I do admit that I, likewise, just really want it to be true, so there’s that.
It’s hard to tell exactly where our models differ just from that, sorry. https://www.youtube.com/@RobertMilesAI has some nice short introductions to a lot of the relevant concepts, but even that would take a while to get through, so I’m going to throw out some relevant concepts that have a chance of being the crux.
Do you know what a mesa optimizer is?
Are you familiar with the Orthogonality Thesis?
Instrumental convergence?
Oracle AIs?
Narrow vs General intelligence (generality)?
And how general do you think a large language model is?
Not the author of the grandparent comment, but I figured I’d take a shot at some of these, since it seems mildly interesting, and could lead to a potential exchange of gears. (You can feel free not to reply to me if you’re not interested, of course.)
Yes. Informally, it’s when a demon spawns inside of the system and possesses it to do great evil. (Note that I’m using intentionally magical terminology here to make it clear that these are models of phenomena that aren’t well-understood—akin to labeling uncharted sections of a map with “Here Be Dragons”.)
My current model of LLMs doesn’t prohibit demon possession, but only in the sense that my model of (sufficiently large) feedforward NNs, CNNs, RNNs, etc. also don’t prohibit demon possession. These are all Turing-complete architectures, meaning they’re capable in principle of hosting demons (since demons are, after all, computable), so it’d be unreasonable of me to suggest that a sufficiently large such model, trained on a sufficiently complex reward function, and using sufficiently complex training corpus, could not give rise to demons.
In practice, however, I have my doubts that the current way in which Transformers are structured and trained is powerful enough to spawn a demon. If this is a relevant crux, we may wish to go into further specifics of demon summoning and how it might be attempted.
Yes. In essence, this says that demons don’t have your best interests at heart by default, because they’re demons. (Angels—which do have your best interests at heart—do exist, but they’re rare, and none of the ways we know of to summon supernatural entities—hypothetical or actual—come equipped with any kind of in-built mechanism for distinguishing between the two—meaning that anything we end up summoning will, with high probability, be a demon.)
I don’t consider this a likely crux for us, though obviously it may well be for the grandparent author.
Yes. Informally, this says that demons are smart, not stupid, and smart demons trying to advance their own interests generally end up killing you, because why wouldn’t they.
I also don’t consider this a likely crux for us.
This is where you take your intended vessel for demon summoning, chop off their arms and legs so that they can’t move, blind them so that they can’t see, and lobotomize them so that they can’t think. (The idea is that, by leaving their ears and mouth untouched, you can still converse with them, without allowing them to do much of anything else.) Then, you summon a demon into them, and hope that the stuff you did to them beforehand prevents the demon from taking them over.
(Also, even though the arms, legs, and eyes were straightforward, you have to admit—if only to yourself—that you don’t really understand how the lobotomy part worked.)
This could be a crux between us, depending on what exactly your position is. My current take is that it seems really absurd to think we can keep out sufficiently powerful demons through means like these, especially when we don’t really understand how the possession happens, and how our “precautions” interface with this process.
(I mean, technically there’s a really easy foolproof way to keep out sufficiently powerful demons, which is to not go through with the summoning. But demons make you lots of money, or so I hear—so that’s off the table, obviously.)
This one’s interesting. There are a lot of definitional disputes surrounding this particular topic, and on the whole, I’d say it’s actually less clear these days what a given person is talking about when they say “narrow versus general” than it was a ~decade ago. I’m not fully convinced that the attempted conceptual divide here is at all useful, but if I were try and make it work:
Demons are general intelligences, full stop. That’s part of what it means to be a demon. (You might feel the need to talk about “fully-fledged” versus “nascent” demons here, to add some nuance—but I actually think the concept is mostly useful if we limit ourselves to talking about the strongest possible version of it, since that’s the version we’re actually worried about getting killed by.)
Anyway, in my conception of demons: they’re fully intelligent and general agents, with goals of their own—hence why the smart ones usually try to kill you. That makes them general intelligences. Narrow intelligences, on the other hand, don’t qualify as demons; they’re something else entirely, a different species, more akin to plants or insects than to more complicated agents. They might be highly capable in a specific domain or set of domains, but they achieve this through specialization rather than through “intelligence” in any real sense, much in the same way that a fruit fly is specialized for being very good at avoiding your fly-swatter.
This makes the term “narrow intelligence” somewhat of a misnomer, since it suggests some level of similarity or even continuity between “narrow” and general intelligences. In my model, this is not the case: demons and non-demons are different species—full stop. If you think this is a relevant crux, we can try and double-click on some of these concepts to expand them.
I think this question is basically answered above, in the first section about demons? To reiterate: I think the Transformer architecture is definitely capable of playing host to a demon—but this isn’t actually a concession in any strong sense, in my view: any Turing-complete architecture can host a demon, and the relevant question is how easy it is to summon a demon into that architecture. And again, currently I don’t see the training and structure of Transformer-based LLMs as conducive to demon-summoning.
OK, mesa optimizers and generality seem to be points of disagreement then.
I think your concept of “demons” is pointing to something useful, but I also think that definition is more specific than the meaning of “mesa optimizer”. A chess engine is an optimizer, but it’s not general. Optimizers need not be general; therefore, they need not be demons, and I think we have examples of such mesa optimizers already (they’re not hypothetical), even if no-one has managed to summon a demon yet.
I see mesa optimization as a generalization of Goodhart’s Law. Any time you make a system optimize for a proxy measure instead of the real target, the proxy itself may become the goal of the inner system, even when overoptimizing it runs counter to hitting the real target.
I think there are degrees of generality, rather than absolute categories. They’re fuzzy sets. Deep blue can only play chess. It can’t even do an “easier” task like run your thermostat. It’s very narrow. AlphaZero can learn to play chess or go or shougi. More domains, more general. GPT-3 can also play chess, you just have to frame it as a text-completion task, using e.g. Portable Game Notation. The human language domain is general enough to play chess, even if it’s not as good at chess as AlphaZero is. More domains, or broader domains containing more subdomains, means more generality.
Thanks for responding! I think there are some key conceptual differences between us that need to be worked out/clarified—so here goes nothing:
(content warning: long)
The main thing I’m concerned about from mesa-optimizers (and hence the reason I think the attendant concept is useful) is that their presence is likely to lead to a treacherous turn—what I inside the demon metaphor referred to as “possession”, because from the outside it really does look like your nice little system that’s chugging along, helpfully trying to optimize the thing you wanted it to optimize, just suddenly gets Taken Over From Within by a strange and inscrutable (and malevolent) entity.
On this view, I don’t see it as particularly useful to weaken the term to encompass other types of optimization. This is essentially the point I was trying to make in the parenthetical remark included directly after the sentences you quoted:
Of course, other types of optimizers do exist, and can be non-general, e.g. I fully accept your chess engine example as a valid type of optimizer. But my model is that these kinds of optimizers are (as a consequence of their non-generality) brittle: they spring into existence fully formed (because they were hardcoded by other, more general intelligences—in this case humans), and there is no incremental path to a chess engine that results from taking a non-chess engine and mutating it repeatedly according to some performance rule. Nor, for that matter, is there an incremental path continuing onward from a (classical) chess engine, through which it might mutate into something better, like AlphaZero.
(Aside: note that AlphaZero itself is not something I view as any kind of “general” system; you could argue it’s more general than a classical chess engine, but only if you view generality as a varying quantity, rather than as a binary—and I’ve already expressed that I’m not hugely fond of that view. But more on that later.)
In any case sense, hopefully I’ve managed to convey a sense in which these systems (and the things and ways they optimize) can be viewed as islands in the design space of possible architectures. And this is important in my view, because what this means is that you should not (by default) expect naturally arising mesa-optimizers to resemble these “non-general” optimizers. I expect that any natural category of mesa-optimizers—that is to say, categories with their boundaries drawn to cleave at the joints of reality—to essentially look like it contains a bunch of demons, and excludes everything else.
TL;DR: Chess engines are non-general optimizers, but they’re not mesa-optimizers; and the fact that you could only come up with an example of the former and not the latter is not a coincidence but a reflection of a deeper truth. Of course, this previous statement could be falsified by providing an example of a non-general mesa-optimizer, and a good argument as to why it should be regarded as a mesa-optimizer.
That segues fairly nicely into the next (related) point, which is, essentially: what is a mesa-optimizer? Let’s look at what you have to say about it:
It shouldn’t come as a huge surprise at this point, but I don’t view this as a very useful way to draw the boundary. I’m not even sure it’s correct, for that matter—the quoted passage reads like you’re talking about outer misalignment (a mismatch between the system’s outer optimization target—its so-called “base objective”—and its creators’ real target), whereas I’m reasonably certain mesa-optimization is much better thought of as a type of inner misalignment (a mismatch between the system’s base objective and whatever objective it ends up representing internally, and pursuing behaviorally).
Given this, it’s plausible to me that when you earlier say you “think we have examples of such mesa optimizers already”, you’re referring specifically to modes of misbehavior I’d class as outer alignment failures (and hence not mesa-optimization at all). But that’s just speculation on my part, and on the whole this seems like a topic that would benefit more from being double-clicked and expanded than from my continuing to speculate on what exactly you might have meant.
In any case, my preferred take on what a mesa-optimizer “really” is would be something like: a system should be considered to contain a mesa-optimizer in precisely those cases where modeling it as consisting of a second optimizer with a different objective buys you more explanatory power than modeling it as a single optimizer whose objective happens to not be the one you wanted. Or, in more evocative terms: mesa-optimization shows up whenever the system gets possessed by—you guessed it—a demon.
And in this frame, I think it’s fair to say that we don’t have any real examples of mesa-optimization. We might have some examples of outer alignment failure, and perhaps even some examples of inner alignment failure (though I’d be wary on that front; many of those are actually outer alignment failures in disguise). But we certainly don’t have any examples of behavior where it makes sense to say, “See! You’ve got a second optimizer inside of your first one, and it’s doing stuff on its own!” which is what it would take to satisfy the definition I gave.
(And yes, going by this definition, I think it’s plausible that we won’t see “real” mesa-optimizers until very close to The End. And yes, this is very bad news if true, since it means that it’s going to be very difficult to experiment safely with “toy” mesa-optimizers, and come away with any real insight. I never said my model didn’t have bleak implications!)
Lastly:
I agree that this is a way to think about generality—but as with mesa-optimization, I disagree that it’s a good way to think about generality. The problem here is that you’re looking at the systems in terms of their outer behavior—”how many separate domains does this system appear to be able to navigate?”, “how broad a range does its behavior seem to span?”, etc.—when what matters, on my model, is the internal structure of the systems in question.
(I mean, yes, what ultimately matters is the outer behavior; you care if a system wakes up and kills you. But understanding the internal structure constrains our expectations about the system’s outer behavior, in a way that simply counting the number of disparate “domains” it has under its belt doesn’t.)
As an objection, I think this basically rhymes with my earlier objection about mesa-optimizers, and when it’s most useful to model a system as containing one. You might notice that that definition I gave also seems to hang pretty heavily on the system’s internals—not completely, since I was careful to say things like “is usefully modeled as” and “buys you more explanatory power—but overall, it seems like a definition oriented towards an internal classification of whether mesa-optimizers (“demons”) are present, rather than an cruder external metric (“it’s not doing the thing we told it to do!”).
And so, my preferred framework under which to think about generality (under which none of our current systems, including—again—AlphaZero, which I mentioned way earlier in this comment, count as truly “general”) is basically what I sketched out in my previous reply:
AlphaZero, as a system, contains a whole lot of specialized architecture for two-player games with a discretized state space and action space. It turns out that multiple board games fall under this categorization, making it a larger category than, like, “just chess” or “just shogi” or something. But that’s just a consequence of the size of the category; the algorithm itself is still specialized, and consequently (this is the crucial part, from my perspective) forms an island in design space.
I referenced this earlier, and I think it’s relevant here as well: there’s no continuous path in design space from AlphaZero to GPT-[anything]; nor was there an incremental design path from Stockfish 8 to AlphaZero. They’re different systems, each of which was individually designed and implemented by very smart people. But the seeming increase in “generality” of these systems is not due to any kind of internal “progression” of any kind, of the kind that might be found in e.g. a truly general system undergoing takeoff; instead, it’s a progression of discoveries by those very smart people: impressive, but not fundamentally different in kind from the progression of (say) heavier-than-air flight, which also consisted of a series of disparate but improving designs, none of which were connected to each other via incremental paths in design space.
(Here, I’m grouping together designs into “families”, where two designs that are basically variants of each other in size are considered the same design. I think that’s fair, since this is the case with the various GPT models as well.)
And this matters because (on my model) the danger from AGI that I see does not come from this kind of progression of design. If we were somehow assured that all further progress in AI would continue to look like this kind of progress, that would massively drop my P(doom) estimates (to, like, <0.01 levels). The reason AGI is different, the reason it constitutes (on my view) an existential risk, is precisely because artificial general intelligence is different from artificial narrow intelligence—not just in degree, but in kind.
(Lots to double-click on here, but this is getting stupidly long even for a LW comment, so I’m going to stop indulging the urge to preemptively double-click and expand everything for you, since that’s flatly impossible, and let you pick and choose where to poke at my model. Hope this helps!)
You’ll forgive me if I end up writing multiple separate responses then.
It’s not that I couldn’t come up with examples, but more like I didn’t have time to write a longer comment just then. Are these not examples? What about godshatter?
The terms I first enumerated have specific meaning not coined by you or me, and I am trying to use them in the standard way. Now, it’s possible that I don’t understand the definitions correctly, but I think I do, and I think your definition for (at least) “mesa optimizer” is not the standard one. If you know this and just don’t like the standard definitions (because they are “not useful”), that’s fine, define your own terms, but call them something else, rather than changing them out from under me.
Specifically, I was going off the usage here. Does that match your understanding?
I was specifically talking about inner alignment, where the mesa objective is a proxy measure for the base objective. But I can see how Goodhart’s law could apply to outer alignment too, come to think of it: if you fail to specify your real goal and instead specify a proxy.
I agree that “godshatter” is an example of a misaligned mesa-optimizer with respect to evolution’s base objective (inclusive genetic fitness). But note specifically that my argument was that there are no naturally occurring non-general mesa-optimizers, which category humans certainly don’t fit into. (I mean, you can look right at the passage you quoted; the phrase “non-general” is right there in the paragraph.)
In fact, I think humans’ status as general intelligences supports the argument I made, by acting as (moderately weak) evidence that naturally occurring mesa-optimizers do, in fact, exhibit high amounts of generality and agency (demonic-ness, you could say).
(If you wanted to poke at my model harder, you could ask about animals, or other organisms in general, and whether they count as mesa-optimizers. I’d argue that the answer depends on the animal, but that for many animals my answer would in fact be “no”—and even those for whom my answer is “yes” would obviously be nowhere near as powerful as humans in terms of optimization strength.)
As for the Rob Miles video: I mostly see those as outer alignment failures, despite the video name. (Remember, I did say in my previous comment that on my model, many outer alignment failures can masquerade as inner alignment failures!) To comment on the specific examples mentioned in the video:
The agent-in-the-maze examples strike me as a textbook instance of outer misalignment: the reward function by itself was not sufficient to distinguish correct behavior from incorrect behavior. It’s possible to paint this instead as inner misalignment, but only by essentially asserting, flat-out, that the reward function was correct, and the system simply generalized incorrectly. I confess I don’t really see strong reason to favor the latter characterization over the former, while I do see some reason for the converse.
The coin run example, meanwhile, makes a stronger case for being an inner alignment failure, mainly because of the fact that many possible forms of outer misalignment were ruled out via interpretability. The agent was observed to assign appropriately negative values to obstacles, and appropriately positive values to the coin. And while it’s still possible to make the argument that the training procedure failed to properly incentivize learning the correct objective, this is a much weaker claim, and somewhat question-begging.
And, of course, neither of these are examples of mesa-optimization in my view, because mesa-optimization is not synonymous with inner misalignment. From the original post on risks from learned optimization:
And the main issue with these examples is that they occur in toy environments which are simply too… well, simple to produce algorithms usefully characterized as optimizers in their own right, outside of the extremely weak sense in which your thermostat is also an optimizer. (And, like—yes, in a certain sense it is, but that’s not a very high bar to meet; it’s not even at the level of the chess engine example you gave!)
The usage in that video is based on the definition given by the authors of the linked post, who coined the term to begin with—which is to say, yes, I agree with it. And I already discussed above why this definition does not mean that literally any learned algorithm is a mesa-optimizer (and if it did, so much the worse for the definition)!
(Meta: I generally don’t consider it particularly useful to appeal to the origin of terms as a way to justify their use. In this specific case, it’s fine, since I don’t believe my usage conflicts with the original definition given. But even if you think I’m getting the definitions wrong, it’s more useful, from my perspective, if you explain to me why you think my usage doesn’t accord with the standard definitions. Presumably you yourself have specific reasons for thinking that the examples or arguments I give don’t sound quite right, right? If so, I’d petition you to elaborate on that directly! That seems to me like it would have a much better chance of locating our real disagreement. After all, when two people disagree, the root of that disagreement is usually significantly downstream of where it first appears—and I’ll thank you not to immediately assume that our source of disagreement is located somewhere as shallow as “one of us is misremembering/misunderstanding the definitions of terms”.)
This doesn’t sound right to me? To refer back to your quoted statement:
I’ve bolded [what seem to me to be] the operative parts of that statement. I can easily see a way to map this description onto a description of outer alignment failure:
where the mapping in question goes: proxy → base objective, real target → intended goal. Conversely, I don’t see an equally obvious mapping from that description to a description of inner misalignment, because that (as I described in my previous comment) is a mismatch between the system’s base and mesa-objectives (the latter of which it ends up behaviorally optimizing).
I’d appreciate it if you could explain to me what exactly you’re seeing here that I’m not, because at present, my best guess is that you’re not familiar with these terms (which I acknowledge isn’t a good guess, for basically the reasons I laid out in my “Meta:” note earlier).
Yeah, I don’t think that interpretation is what I was trying to get across. I’ll try to clean it up to clarify:
Not helping? I did not mean to imply that a mesa optimizer is necessarily misaligned or learns the wrong goal, it’s just hard to ensure that it learns the base one.
Goodhart’s law is usually stated as “When a measure becomes a target, it ceases to be a good measure”, which I would interpret more succinctly as “proxies get gamed”.
More concretely, from the Wikipedia article,
Then the analogy would go like this. The desired target (base goal) was “profits”, but the proxy chosen to measure that goal was “number of cars sold”. Under normal conditions, this would work. The proxy is in the direction of the target. That’s why it’s a proxy. But if you optimize the proxy too hard, you blow past the base goal and hit the proxy itself instead. The outer system (optimizer) is the company. It’s trying to optimize the employees. The inner system (optimizer) is the employee, which tries to maximize his own reward. The employee “learned” the wrong (mesa) goal “sell as many cars as possible (at any cost)”, which is not aligned with the base goal of “profits”.
Rob Miles specifically called out a thermostat as an example of not just an optimizer, but an agent in another video.
No links, because no one in all of existence currently understands what the heck is going on inside of LLMs—which, of course, is just another way of saying that it’s pretty unreasonable to assign a high probability to your personal guesses about what the thing that LLMs do—whether you call that “predicting the most probable next word” or “reasoning about the world”—will or will not scale to.
Which, itself, is just a rephrase of the classic rationalist question: what do you think you know, and why do you think you know it?
(For what it’s worth, by the way, I actually share your intuition that current LLM architectures lack some crucial features that are necessary for “true” general intelligence. But this intuition isn’t very strongly held, considering how many times LLM progress has managed to surprise me already.)
That is exactly what I would think GPT 4 would type.
First, before sending a link, is your name Sydney??!