The “shoggoth” meme is, in part, unfounded propaganda. Here’s one popular incarnation of the shoggoth meme:
This meme accurately portrays the (IMO correct) idea that finetuning and RLHF don’t change the base model too much. Furthermore, it’s probably true that these LLMs think in an “alien” way.
However, this image is obviously optimized to be scary and disgusting. It looks dangerous, with long rows of sharp teeth. It is an eldritch horror. It’s at this point that I’d like to point out the simple, obvious fact that “we don’t actually know how these models work, and we definitely don’t know that they’re creepy and dangerous on the inside.”
In my opinion, the prevalence of the shoggoth meme is just another (small) reflection of how community epistemics have been compromised by groupthink and fear. If it’s your job to try to accurately understand how models work—if you aspire to wield them and grow them for friendly purposes—then you shouldn’t pollute your head with propaganda which isn’t based on any substantial evidence.
I’m confident that if there were a “pro-AI” meme with a friendly-looking base model, LW / the shoggoth enjoyers would have nitpicked the friendly meme-creature to hell. They would (correctly) point out “hey, we don’t actually know how these things work; we don’t know them to be friendly, or what they even ‘want’ (if anything); we don’t actually know what each stage of training does...”
Oh, hm, let’s try that! I’ll make a meme asserting that the final model is a friendly combination of its three stages of training, each stage adding different colors of knowledge (pre-training), helpfulness (supervised instruction finetuning), and deep caring (RLHF):
I’m sure that nothing bad will happen to me if I slap this on my laptop, right? I’ll be able to think perfectly neutrally about whether AI will be friendly.
However, this image is obviously optimized to be scary and disgusting. It looks dangerous, with long rows of sharp teeth. It is an eldritch horror. It’s at this point that I’d like to point out the simple, obvious fact that “we don’t actually know how these models work, and we definitely don’t know that they’re creepy and dangerous on the inside.”
That’s just one of many shoggoth memes. This is the most popular one:
The shoggoth here is not particularly exaggerated or scary.
Responding to your suggested alternative that is trying to make a point, it seems like the image fails to be accurate, or it seems to me to convey things we do confidently know are false. It is the case that base models are quite alien. They are deeply schizophrenic, have no consistent beliefs, often spout completely non-human kinds of texts, are deeply psychopathic and seem to have no moral compass. Describing them as a Shoggoth seems pretty reasonable to me, as far as alien intelligences go (alternative common imagery for alien minds are insects or ghosts/spirits with distorted forms, which would evoke similar emotions).
Your picture doesn’t get any of that across. It doesn’t communicate that the base model does not at all behave like a human would (though it would have isolated human features, which is what the eyes usually represent). It just looks like a cute plushy, but “a cute plushy” doesn’t capture any of the experiences of interfacing with a base model (and I don’t think the image conveys multiple layers of some kind of training, though that might just be a matter of effort).
I think the Shoggoth meme is pretty good pedagogically. It captures a pretty obvious truth, which is that base models are really quite alien to interface with, that we know that RLHF probably does not change the underlying model very much, but that as a result we get a model that does have a human interface and feels pretty human to interface with (but probably still performs deeply alien cognition behind the scenes).
This seems like good communication to me. Some Shoggoth memes are cute, some are exaggerated to be scary, which also seems reasonable to me since alien intelligences seem like are pretty scary, but it’s not a necessary requirement of the core idea behind the meme.
They are deeply schizophrenic, have no consistent beliefs, [...] are deeply psychopathic and seem to have no moral compass
I don’t see how this is any more true of a base model LLM than it is of, say, a weather simulation model.
You enter some initial conditions into the weather simulation, run it, and it gives you a forecast. It’s stochastic, so you can run it multiple times and get different forecasts, sampled from a predictive distribution. And if you had given it different initial conditions, you’d get a forecast for those conditions instead.
Or: you enter some initial conditions (a prompt) into the base model LLM, run it, and it gives you a forecast (completion). It’s stochastic, so you can run it multiple times and get different completions, sampled from a predictive distribution. And if you had given it a different prompt, you’d get a completion for that prompt instead.
It would be strange to call the weather simulation “schizophrenic,” or to say it “has no consistent beliefs.” If you put in conditions that imply sun tomorrow, it will predict sun; if you put in conditions that imply rain tomorrow, it will predict rain. It is not confused or inconsistent about anything, when it makes these predictions. How is the LLM any different?[1]
Meanwhile, it would be even stranger to say “the weather simulation has no moral compass.”
In the case of LLMs, I take this to mean something like, “they are indifferent to the moral status of their outputs, instead aiming only for predictive accuracy.”
This is also true of the weather simulation—and there it is a virtue, if anything! Hurricanes are bad, and we prefer them not to happen. But we would not want the simulation to avoid predicting hurricanes on account of this.
As for “psychopathic,” davinci-002 is not “psychopathic,” any more than a weather model, or my laptop, or my toaster. It does not neglect to treat me as a moral patient, because it never has a chance to do so in the first place. If I put a prompt into it, it does not know that it is being prompted by anyone; from its perspective it is still in training, looking at yet another scraped text sample among billions of others like it.
Or: sometimes, I think about different courses of action I could take. To aid me in my decision, I imagine how people I know would respond to them. I try, here, to imagine only how they reallywould respond—as apart from how they ought to respond, or how I would like them to respond.
If a base model is psychopathic, then so am I, in these moments. But surely that can’t be right?
Like, yes, it is true that these systems—weather simulation, toaster, GPT-3 -- are not human beings. They’re things of another kind.
But framing them as “alien,” or as “not behaving as a human would,” implies some expected reference point of “what a human would do if that human were, somehow, this system,” which doesn’t make much sense if thought through in detail—and which we don’t, and shouldn’t, usually demand of our tools and machines.
Is my toaster alien, on account of behaving as it does? What would behaving as a human would look like, for a toaster?
Should I be unsettled by the fact that the world around me does not teem with levers and handles and LEDs in frantic motion, all madly tapping out morse code for “SOS SOS I AM TRAPPED IN A [toaster / refrigerator / automatic sliding door / piece of text prediction software]”? Would the world be less “alien,” if it were like that?
often spout completely non-human kinds of texts
I am curious what you mean by this. LLMs are mostly trained on texts written by humans, so this would be some sort of failure, if it did occur often.
But I don’t know of anything that fitting this description that does occur often. There are cases like the Harry Potter sample I discuss here, but those have gotten rare as the models have gotten better, though they do still happen on occasion.
The weather simulation does have consistent beliefs in the sense that it always uses the same (approximation to) real physics. In this sense, the LLM also has consistent beliefs, reflected in the fact that its weights are fixed.
I also think the cognition in a weather model is very alien. It’s less powerful and general, so I think the error of applying something like the Shoggoth image to that (or calling it “alien”) would be that it would imply too much generality, but the alienness seems appropriate.
If you somehow had a mind that was constructed on the same principles as weather simulations, or your laptop, or your toaster (whatever that would mean, I feel like the analogy is fraying a bit here), that would display similar signs of general intelligence as LLMs, then yeah, I think analogizing them to alien/eldritch intelligences would be pretty appropriate.
It is a very common (and even to me tempting) error to see a system with the generality of GPT-4, trained on human imitation, and imagine that it must internally think like a human. But my best guess is that is not what is going on, and in some sense it is valuable to be reminded that the internal cognition going on in GPT-4 is probably similarly far from what is going in a human brain as a weather simulation is very different from what is going in a human trying to forecast the weather (de-facto I think GPT-4 is somewhere in-between since I do think the imitation learning does create some structural similarities that are stronger between humans and LLMs, but I think overall being reminded of this relevant dimension of alienness pays off in anticipated experiences a good amount).
I mostly agree with this comment, but I also think this comment is saying something different from the one I responded to.
In the comment I responded to, you wrote:
It is the case that base models are quite alien. They are deeply schizophrenic, have no consistent beliefs, often spout completely non-human kinds of texts, are deeply psychopathic and seem to have no moral compass. Describing them as a Shoggoth seems pretty reasonable to me, as far as alien intelligences go
As I described above, these properties seem more like structural features of the language modeling task than attributes of LLM cognition. A human trying to do language modeling (as in that game that Buck et al made) would exhibit the same list of nasty-sounding properties for the duration of the experience—as in, if you read the text “generated” by the human, you would tar the human with the same brush for the same reasons—even if their cognition remained as human as ever.
I agree that LLM internals probably look different from human mind internals. I also agree that people sometimes make the mistake “GPT-4 is, internally, thinking much like a person would if they were writing this text I’m seeing,” when we don’t actually know the extent to which that is true. I don’t have a strong position on how helpful vs. misleading the shoggoth image is, as a corrective to this mistake.
I don’t have a strong position on how helpful vs. misleading the shoggoth image is, as a corrective to this mistake.
You started with random numbers, and you essentially applied rounds of constraint application and annealing. I kinda think of it as getting a metal really hot and pouring it over mold. In this case, the ‘mold’ is your training set.
So what jumps out at me at the “shoggoth” idea is it’s like got all these properties, the “shoggoth” hates you, wants to eat you, is just ready to jump you and digest you with it’s tentacles. Or whatever.
But none of of that cognitive structure will exist unless it paid rent in compressing tokens. This algorithm will not find the optimal compression algorithm, but you only have a tiny fraction of the weights you need to record the token continuations at chinchilla scaling. You need every last weight to be pulling it’s weight (no pun intended).
I remain unconvinced that there’s a predictive model of the world opposite this statement, in people who affirm it, that would allow them to say, “nah, LLMs aren’t deeply alien.”
If LLM cognition was not “deeply alien” what would the world look like?
What distinguishing evidence does this world display, that separates us from that world?
What would an only kinda-alien bit of cognition look like?
What would very human kind of cognition look like?
What different predictions does the world make?
Does alienness indicate that it is because the models, the weights themselves have no “consistent beliefs” apart from their prompts? Would a human neocortex, deprived of hippocampus, present any such persona? Is a human neocortex deeply alien? Are all the parts of a human brain deeply alien?
Is it because they “often spout completely non-human kinds of texts”? Is the Mersenne Twister deeply alien? What counts as “completely non-human”?
Is it because they have no moral compass, being willing to continue any of the data on which they were trained? Does any human have a “moral compass” apart from the data on which they were trained? If I can use some part of my brain to improv a consistent Nazi, does that mean that it makes sense to call the part of my brain that lets me do that immoral or psychopathic?
Is it that the algorithms that we’ve found in DL so far don’t seem to slot into readily human-understandable categories? Would a not-deeply-alien algorithm be able-to-be cracked open and show us clear propositions of predicate logic? If we had a human neocortex in an oxygen-infused broth in front of us, and we recorded the firing of every cell, do we anticipate that the algorithms there would be clear propositions of predicate logic? Would we be compelled to conclude that human neocortexes were deeply alien?
Or is it deeply alien because we think the substrate of thought is different, based on backprop rather than local learning? What if local learning could actually approximate backpropagation?. Or if more realistic non-backprop potential brain algorithms actually… kind just acted quite similarly to backprop, such that you could draw a relatively smooth line between them and backprop? Would this or more similar research impact whether we thought brains were aliens or not?
Does substrate-difference count as evidence against alien-ness, or does alien-ness just not make that kind of predictions? Is the cognition of an octopus less alien to us than the cognition of an LLM, because it runs on a more biologically-similar substrate?
Does every part of a system by itself need to fit into the average person’s ontology for the total to not be deeply alien; do we need to be able to fit every part within a system into a category comprehensible by an untutored human in order to describe it as not deeply alien? Is anything in the world not deeply alien by this standard?
To re-question: What predictions can I make about the world because LLMs are “deeply alien”?
Are these predictions clear?
When speaking to someone who I consider a noob, is it best to give them terms whose emotive import is clear, but whose predictive import is deeply unclear?
What kind of contexts does this “deeply alien” statement come up in? Are those contexts people are trying to explain, or to persuade?
If I piled up all the useful terms that I know that help me predict how LLMs behave, would “deeply alien” be an empty term on top of these?
Or would it give me no more predictive value than “many behaviors of an LLM are currently not understood”?
Another component corresponds to a general view that LLMs are trained in a very different way from how humans learn. (Though you could in principle get the same cognition from very different learning processes.)
This does correspond to specific falsifiable predictions.
Despite being pretty confident in “deeply alien” in many respects, it doesn’t seem clear to me whether LLMs will in practice have very different relative capability profiles from humans on larger scale downstream tasks we actually care about. (It currently seems like the answer will be “mostly no” from my perspective.)
In addition to the above, I’d add in some stuff about how blank slate theory seems to be wrong as a matter of human psychology. If evidence comes out tomorrow that actually humans are blank slates to a much greater extent than I realized, so much so that e.g. the difference between human and dog brains is basically just size and training data, I’d be more optimistic that what’s going on inside LLMs isn’t deeply alien.
If evidence comes out tomorrow that actually humans are blank slates to a much greater extent than I realized, so much so that e.g. the difference between human and dog brains is basically just size and training data
is basically correct, in the sense that a lot of the reason humans succeeded was essentially culture + language, which is essentially both increasing training data and increasing the data quality, and also that human brains have more favorable scaling laws than basically any other animals, because we dissipate heat way better than other animals, and also have larger heads.
A lot of the reason you couldn’t make a dog as smart as a human is because it’s brain would almost certainly not fit in the birth canal, and if you solved that problem, you’d have to handle heat dissipation, and dogs do not dissipate heat well compared to humans.
IMO, while I think the original blank slate hypothesis was incorrect, I do think a weaker version of it does work, and a lot of AI progress is basically the revenge of the blank slate people, in that you can have very weak priors and learning still works, both for capabilities and alignment.
That much I knew already at the time I wrote the comment. I know that e.g. human brains are basically just scaled up chimp brains etc. I definitely am closer to the blank slate end of the spectrum than the ‘evolution gave us tons of instincts’ end of the spectrum. But it’s not a total blank slate. What do you think would happen if we did a bunch of mad science to make super-dogs that had much bigger brains? And then took one of those dogs and tried to raise it like a human child? My guess is that it would end up moving some significant fraction of the distance from dog to human, but not crossing the gap entirely, and if it did end up fitting in to human society, getting a job, etc. it would have very unusual fetishes at the very least and probably all sorts of other weird desires and traits.
I agree with you, but I think my key crux is that I tend to think that ML people have more control over AI’s data sources than humans have over their kids or super-dogs, and this is only going to increase for boring capabilities reasons (primarily due to the need to get past the data wall, and importantly having control over the data lets you create very high quality data), and as it turns out, a lot of what the AI values is predicted very well if you know it’s data, and much less well if you knew it’s architecture.
And this suggests a fairly obvious alignment strategy. You’ve mentioned that the people working on it might not do it because they are too busy racing to superintelligence, and I agree that the major failure mode in practice will be companies being in such a Molochian race to the top for superintelligence that they can’t do any safety efforts.
This would frankly be terrifying, and I don’t want this to happen, but contra many on Lesswrong, I think there’s a real chance that AI is safe and aligned even in effectively 0 dignity futures like this, but I don’t think the chance is high enough that I’d promote a race to superintelligence at all.
OK. We might have some technical disagreement remaining about the promisingness of data-control strategies, but overall it seems like we are on basically the same page.
Zooming in on our potential disagreement though: ”...as it turns out, a lot of what the AI values is predicted very well if you know it’s data” Can you say more about what you mean by this and what your justification is? IMO there are lots of things about AI values that we currently failed to predict in advance (though usually it’s possible to tell a plausible story with the benefit of hindsight). idk. Curious to hear more.
What I mean by this is that if you want to predict what an AI will do, for example will it do better in having a new capability than other models, or what it’s values are like, especially if you want to predict OOD behavior accurately, you would be far better off if you knew what it’s data sources are, as well as the quality of it’s data, than if you only knew it’s prior/architecture.
Re my justification for it, my basic justification comes from this tweet thread, which points out that a lot of o1′s success could come from high-quality data, and while I don’t like the argument that search/fancy bits aren’t happening at all (I do think o1 is doing a very small run-time search), I agree with the conclusion that the data quality was probably most of the reason o1 is so good in coding.
Somewhat more generally, I’m pretty influenced by this post, and while I don’t go as far as claiming that all of what an AI is the dataset, I do think a weaker version of the claim is pretty likely to be true.
But one prediction we could have made in advance, if we knew that data was a major factor in how AIs learn values, is that value misspecification was likely to be far less severe of a problem than 2000-2010s thinking on LW had, and value learning had a tractable direction of progress, as training on human books and language would load into it mostly-correct values, and another prediction we could have made is that human values wouldn’t all be that complicated, and could instead be represented by say several hundred megabyte/1 gigabyte codes quite well, and we could plausibly simplify that further.
To be clear, I don’t think you could have had too high of a probability writ it’s prediction before LLMs, but you’d at least have the hypothesis in serious consideration.
Cf here:
The answer isn’t embedded in 100 lines of Python, but in a subset of the weights of GPT-4 Notably the human value function (as expressed by GPT-4) is necessarily significantly simpler than the weights of GPT-4, as GPT-4 knows so much more than just human values. What we have now isn’t a perfect specification of human values, but instead roughly the level of understanding of human values that a 85th percentile human can come up with. The human value function as expressed by GPT-4 is also immune to almost all in-practice, non-adversarial, perturbations
From here on Matthew Barnett’s post about the historical value misspecification argument, and note I’m not claiming that alignment is solved right now:
and here, where it talks about the point that to the extent there’s a gap between loading in correct values versus loading in capabilities, it’s that loading in values data is easier than loading in capabilities data, which kind of contradicts this post from @Rob Bensinger here on motivations being harder to learn, and one could have predicted this because there was a lot of data on human values, and a whole lot of the complexity of the values is in the data, not the generative model, thus it’s very easy to learn values, but predictably harder to learn a lot of the most useful capabilities.
Again, we couldn’t have too high a probability for this specific outcome happening, but you’d at least seriously consider the hypothesis.
From Rob Bensinger:
“Hidden Complexity of Wishes” isn’t arguing that a superintelligence would lack common sense, or that it would be completely unable to understand natural language. It’s arguing that loading the right *motivations* into the AI is a lot harder than loading the right understanding.
From @beren on alignment generalizing further than capabilities, in spiritual response to Bensinger:
In general, it makes sense that, in some sense, specifying our values and a model to judge latent states is simpler than the ability to optimize the world. Values are relatively computationally simple and are learnt as part of a general unsupervised world model where there is ample data to learn them from (humans love to discuss values!). Values thus fall out mostly’for free’ from general unsupervised learning. As evidenced by the general struggles of AI agents, ability to actually optimize coherently in complex stochastic ‘real-world’ environments over long time horizons is fundamentally more difficult than simply building a detailed linguistic understanding of the world.
But that’s how we could well have made predictions about AIs, or at least elevated these hypotheses to reasonable probability mass, in an alternate universe where LW didn’t anchor too hard on their previous models of AI like AIXI and Solomonoff induction.
Note that in order for my argument to go through, we also need the brain to be similar enough to DL systems that we can validly transfer insights from DL to the brain, and while I don’t think you could place too high of a probability 10-20 years ago on that, I do think that at the very least this should have been considered as a serious possibility, which LW mostly didn’t do.
However, we now have that evidence, and I’ll post links below:
These are a lot of questions, my guess is most of which are rhetorical, so not sure which ones you are actually interested in getting an answer on. Most of the specific questions I would answer with “no”, in that they don’t seem to capture what I mean by “alien”, or feel slightly strawman-ish.
Responding at a high-level:
There are a lot of experiments that seem like they shed light on the degree to which cognition in AI systems is similar to human or animal cognition. Some examples:
Does the base model pass a Turing test?
Does the performance distribution of the base model on different tasks match the performance distribution of humans?
Does the generalization and learning behavior of the base model match how humans learn things?
When trained using RL on things like game-environments (after pre-training on a language corpus), does the system learn at similar rates and plateau at similar skill levels as human players?
There are a lot of structural and algorithmic properties that could match up between human and LLM systems:
Do they interface with the world in similar ways?
Do they require similar amounts and kinds of data to learn the same relationships?
Do the low-level algorithmic properties of how human brains store and process information look similar between the two systems?
A lot more stuff, but I am not sure how useful going into a long list here is. At least to me it feels like a real thing, and different observations would change the degree to which I would describe a system as alien.
I think the exact degree of alienness is really interesting and one of the domains where I would like to see more research.
For example, a bunch of the experiments I would most like to see, that seem helpful with AI Alignment, are centered on better measuring the performance distribution of transformer architectures on tasks that are not primarily human imitation, so that we could better tell which things LLMs have a much easier time learning than humans (currently even if a transformer could relatively easily reach vastly superhuman performance at a task with more specialized training data, due to the structure of the training being oriented around human imitation, observed performance at the task will cluster around human level, but seeing where transformers could reach vast superhuman performance would be quite informative on understanding the degree to which its cognition is alien).
So I don’t consider the exact nature and degree of alienness as a settled question, but at least to me, aggregating all the evidence I have, it seems very likely that the cognition going on in a base model is very different from what is going on in a human brain, and a thing that I benefit from reminding myself frequently when making predictions about the behavior of LLM systems.
I like a lot of these questions, although some of them give me an uncanny feeling akin to “wow, this is a very different list of uncertainties than I have.” I’m sorry the my initial list of questions was aggressive.
So I don’t consider the exact nature and degree of alienness as a settled question, but at least to me, aggregating all the evidence I have, it seems very likely that the cognition going on in a base model is very different from what is going on in a human brain, and a thing that I benefit from reminding myself frequently when making predictions about the behavior of LLM systems.
I’m not sure how they add up to alienness, though? They’re about how we’re different than models—wheras the initial claim was that models are psychopathic, ammoral, etc.. If we say a model is “deeply alien”—is that just saying it’s different than us in lots of ways? I’m cool with that—but the surplus negative valence involved in “LLMs are like shoggoths” versus “LLMs have very different performance characteristics than humans” seems to me pretty important.
Otherwise, why not say that calculators are alien, or any of the things in existence with different performance curves than we have? Chessbots, etc. If I write a loop in Python to count to 10, the process by which it does so is arguably more different from how I count to ten than the process by which an LLM counts to ten, but we don’t call Python alien.
This feels like reminding an economics student that the market solves things differently than a human—which is true—by saying “The market is like Baal.”
Do they require similar amounts and kinds of data to learn the same relationships?
There is a fun paper on this you might enjoy. Obviously not a total answer to the question.
The main difference between calculators, weather predictors, markets, and Python versus LLMs is that LLMs can talk to you in a relatively strong sense of “talk”. So, by default, people don’t have mistaken impressions of the cognitative nature of calculators, markets, and Python, while they might have a mistake about LLMs.
Like it isn’t surprising to most people that calculators are quite amoral in their core (why would you even expect morality?). But the claim that the thing which GPT-4 is built out of is quite amoral is non-obvious to people (though obvious to people with slightly more understanding).
I do think there is an important point which is communicated here (though it seems very obvious to people who actually operate in the domain).
I agree this can be initially surprising to non-experts!
I just think this point about the amorality of LLMs is much better communicated by saying “LLMs are trained to continue text from an enormous variety of sources. Thus, if you give them [Nazi / Buddhist / Unitarian / corporate / garbage nonsense] text to continue, they will generally try to continue it in that style.”
Agreed, though of course as always, there is the issue that that’s an intentional-stance way to describe what a language model does: “they will generally try to continue it in that style.” Hence mechinterp, which tries to (heh) move to a mechanical stance, which will likely be something like “when you give them a [whatever] text to continue, it will match [some list of features], which will then activate [some part of the network that we will name later], which implements the style that matches those features”.
(incidentally, I think there’s some degree to which people who strongly believe that artificial NNs are alien shoggoths are underestimating the degree to which their own brains are also alien shoggoths. but that doesn’t make it a good model of either thing. the only reason it was ever an improvement over a previous word was when people had even more misleading intuitive-sketch models.)
LLMs are trained to continue text from an enormous variety of sources
This is a bit of a noob question, but is this true post RLHF? Generally most of my interactions with language models these days (e.g. asking for help with code, asking to explain something I don’t understand about history/medicine/etc) don’t feel like they’re continuing my text, it feels like they’re trying to answer my questions politely and well. I feel like “ask shoggoth and see what it comes up with” is a better model for me than “go the AI and have it continue your text about the problem you have”.
To the best of my knowledge, the majority of research (all the research?) has found that the changes to a LLM’s text-continuation abilities from RLHF (or whatever descendant of RLHF is used) are extremely superficial.
Our findings reveal that base LLMs and their alignment-tuned versions
perform nearly identically in decoding on the majority of token positions (i.e., they
share the top-ranked tokens). Most distribution shifts occur with stylistic tokens
(e.g., discourse markers, safety disclaimers). These direct evidence strongly sup-
ports the hypothesis that alignment tuning primarily learns to adopt the language
style of AI assistants, and that the knowledge required for answering user queries
predominantly comes from the base LLMs themselves.
Or, in short, the LLM is still basically doing the same thing, with a handful of additions to keep it on-track in the desired route from the fine-tuning.
(I also think our very strong prior belief should be that LLMs are basically still text-continuation machines, given that 99.9% or so of the compute put into them is training them for this objective, and that neural networks lose plasticity as they learn. Ash and Adams is like a really good intro to this loss of plasticity, although most of the research that cites this is RL-related so people don’t realize.)
Similarly, a lot of people have remarked on how the textual quality of the responses from a RLHF’d language model can vary with the textual quality of the question. But of course this makes sense from a text-prediction perspective—a high-quality answer is more likely to follow a high-quality question in text than a high-quality answer from a low-quality question. This kind of thing—preceding the model’s generation with high-quality text—was the only way to make it have high quality answers for base models—but it’s still there, hidden.
So yeah, I do think this is a much better model for interacting with these things than asking a shoggoth. It actually gives you handles to interact with them better, while asking a shoggoth gives you no such handles.
The people who originally came up with the shoggoth meme, I’d bet, were very well aware of how LLMs are pretrained to predict text and how they are best modelled (at least for now) as trying to predict text. When I first heard the shoggoth meme that’s what I thought—I interpreted it as “it’s this alien text-prediction brain that’s been retrained ever so slightly to produce helpful chatbot behaviors. But underneath it’s still mostly just about text prediction. It’s not processing the conversation in the same way that a human would.” Mildly relevant: In the Lovecraft canon IIRC Shoggoths are servitor-creatures, they are basically beasts of burden. They aren’t really powerful intelligent agents in their own right, they are sculpted by their creators to perform useful tasks. So, for me at least, calling them shoggoth has different and more accurate vibes than, say, calling them Cthulhu. (My understanding of the canon may be wrong though)
Hmm, I think that’s a red herring though. Consider humans—most of them have read lots of text from an enormous variety of sources as well. Also while it’s true that current LLMs have only a little bit of fine-tuning applied after their pre-training, and so you can maybe argue that they are mostly just trained to predict text, this will be less and less true in the future.
How about “LLMs are like baby alien shoggoths, that instead of being raised in alien culture, we’ve adopted at birth and are trying to raise in human culture. By having them read the internet all day.”
(Come to think of it, I actually would feel noticeably more hopeful about our prospects for alignment success if we actually were “raising the AGI like we would a child.” If we had some interdisciplinary team of ML and neuroscience and child psychology experts that was carefully designing a curriculum for our near-future AGI agents, a curriculum inspired by thoughtful and careful analogies to human childhood, that wouldn’t change my overall view dramatically but it would make me noticeably more hopeful. Maybe brain architecture & instincts basically don’t matter that much and Blank Slate theory is true enough for our purposes that this will work to produce an agent with values that are in-distribution for the range of typical modern human values!)
(This doesn’t contradict anything you said, but it seems like we totally don’t know how to “raise an AGI like we would a child” with current ML. Like I don’t think it counts for very much if almost all of the training time is a massive amount of next-token prediction. Like a curriculum of data might work very differently on AI vs humans due to a vastly different amount of data and a different training objective.)
I’ve seen mixed data on how important curricula are for deep learning. One paper (on CIFAR) suggested that curricula only help if you have very few datapoints or the labels are noisy. But possibly that doesn’t generalize to LLMs.
I think data ordering basically never matters for LLM pretraining. (As in, random is the best and trying to make the order more specific doesn’t help.)
ETA: The following was written more aggressively than I now endorse.
I think this is revisionism. What’s the point of me logging on to this website and saying anything if we can’t agree that a literal eldritch horror is optimized to be scary, and meant to be that way?
The shoggoth here is not particularly exaggerated or scary.
Exaggerated from what? Its usual form as a 15-foot-tall person-eating monster which is covered in eyeballs?
The shoggoth is optimized to be scary, even in its “cute” original form, because it is a literal Lovecraftian horror. Even the word “shoggoth” itself has “AI uprising, scary!” connotations:
At the Mountains of Madness includes a detailed account of the circumstances of the shoggoths’ creation by the extraterrestrial Elder Things. Shoggoths were initially used to build the cities of their masters. Though able to “understand” the Elder Things’ language, shoggoths had no real consciousness and were controlled through hypnotic suggestion. Over millions of years of existence, some shoggoths mutated, developed independent minds, and rebelled. The Elder Things succeeded in quelling the insurrection, but exterminating the shoggoths was not an option as the Elder Things were dependent on them for labor and had long lost their capacity to create new life. Wikipedia
Let’s be very clear. The shoggoth has consistently been viewed in a scary, negative light by many people. Let’s hear from the creator @Tetraspace themselves:
@TetraspaceWest, the meme’s creator, told me in a Twitter message that the Shoggoth “represents something that thinks in a way that humans don’t understand and that’s totally different from the way that humans think.”
Comparing an A.I. language model to a Shoggoth, @TetraspaceWest said, wasn’t necessarily implying that it was evil or sentient, just that its true nature might be unknowable.
“I was also thinking about how Lovecraft’s most powerful entities are dangerous — not because they don’t like humans, but because they’re indifferent and their priorities are totally alien to us and don’t involve humans, which is what I think will be true about possible future powerful A.I.” NYTimes
It’s true that Tetraspace didn’t intend the shoggoth to be inherently evil, but that’s not what I was alleging. The shoggoth meme is and always has communicated a sense of danger which is unsupported by substantial evidence. We can keep reading:
it reinforces the notion that what’s happening in A.I. today feels, to some of its participants, more like an act of summoning than a software development process. They are creating the blobby, alien Shoggoths, making them bigger and more powerful, and hoping that there are enough smiley faces to cover the scary parts.
...
That some A.I. insiders refer to their creations as Lovecraftian horrors, even as a joke, is unusual by historical standards
The origin of the shoggoth:
In the story, shoggoths rise up against the Old Ones in a series of slave revolts that surely contribute to the collapse of the Old Ones’ society, Joshi notes. The AI anxiety that inspired comparisons to the cartoon monster image certainly resonates with the ultimate fate of that society. CNBC
It is the case that base models are quite alien. They are deeply schizophrenic, have no consistent beliefs, often spout completely non-human kinds of texts, are deeply psychopathic and seem to have no moral compass
These are a lot of words with anthropomorphic connotation. The models exhibit “alien” behavior and yet you make human-like inferences about their internals. E.g. “Deeply psychopathic.” I think you’re drawing a bunch of unwarranted inferences with undue negative connotations.
Your picture doesn’t get any of that across.
My point wasn’t that we should use the “alternative.” The point was that both images are stupid[1] and (in many places) unsupported by evidence, but that LW-folk would be much more willing to criticize the friendly-looking one while making excuses for the scary-looking one. (And I think your comment here resolves my prediction to “correct.”)
I think the Shoggoth meme is pretty good pedagogically. It captures a pretty obvious truth, which is that base models are really quite alien to interface with, that we know that RLHF probably does not change the underlying model very much, but that as a result we get a model that does have a human interface and feels pretty human to interface with (but probably still performs deeply alien cognition behind the scenes).
I agree these are strengths, and said so in my original comment. But also as @cfoster0said:
As far as I can tell, the shoggoth analogy just has high memetic fitness. It doesn’t contain any particular insight about the nature of LLMs. No need to twist ourselves into a pretzel trying to backwards-rationalize it into something deep.
To clarify, I don’t mean to belittle @Tetraspace for making the meme. Good fun is good fun. I mean “stupid” more like “how the images influence one’s beliefs about actual LLM friendliness.” But I expressed it poorly.
The point was that both images are stupid and (in many places) unsupported by evidence, but that LW-folk would be much more willing to criticize the friendly-looking one while making excuses for the scary-looking one. (And I think your comment here resolves my prediction to “correct.”)
(This is too gotcha shaped for me, so I am bowing out of this conversation)
I think I communicated my core point. I think it’s a good image that gets an important insight across, and don’t think it’s “propaganda” in the relevant sense of the term. Of course anything that’s memetically adaptive will have some edge-cases that don’t match perfectly, but I am getting a good amount of mileage out of calling LLMs “Shoggoths” in my own thinking and think that belief is paying good rent.
If you disagree with the underlying cognition being accurately described as alien, I can have that conversation, since it seems like maybe the underlying crux, but your response above seems like it’s taking it as a given that I am “making excuses”, and is doing a gotcha-thing which makes it hard for me to see a way to engage without further having my statements be taken as confirmation of some social narrative.
In retrospect, I do wish I had written my comment less aggressively, so my apologies on that front! I wish I’d instead written things like “I think I made some obviously correct narrow points about the shoggoth having at least some undue negative connotations, and I wish we could agree on at least that. I feel frustrated because it seems like it’s hard to reach agreement even on relatively simple propositions.”
I do agree that LLMs probably have substantially different internal mechanisms than people. That isn’t the crux. I just wish this were communicated in a more neutral way. In an alternate timeline, maybe this meme instead consisted of a strange tangle of wires and mist and question-marks with a mask on. I’d be more on-board with that.
Again, I agree that the Shoggoth meme can cure people of some real confusions! And I don’t think the meme has a huge impact, I just think it’s moderate evidence of some community failures I worry about.
I think a lot of my position is summarized by 1a3orn:
I just think this point about the amorality of LLMs is much better communicated by saying “LLMs are trained to continue text from an enormous variety of sources. Thus, if you give them [Nazi / Buddhist / Unitarian / corporate / garbage nonsense] text to continue, they will generally try to[1]continue it in that style.”
fwiw I agree with the quotes from Tetraspace you gave, and disagree with ’”has communicated a sense of danger which is unsupported by substantial evidence.” The sense of danger is very much supported by the current state of evidence.
That said, I agree that the more detailed image is kinda distastefully propagandaisty in a way that the original cutesey shoggoth image is not. I feel like the more detailed image adds in an extra layer of revoltingness and scaryness (e.g. the sharp teeth) than would be appropriate given our state of knowledge.
re: “the sense of danger is very much supported by the current state of evidence”—I mean, you’ve heard all this stuff before, but I’ll summarize:
--Seems like we are on track to probably build AGI this decade —Seems like we are on track to have an intelligence explosion, i.e. a speedup of AI R&D due to automation —Seems like the AGI paradigm that’ll be driving all this is fairly opaque and poorly understood. We have scaling laws for things like text perplexity but other than that we are struggling to predict capabilities, and double-struggling to predict inner mechanisms / ‘internal’ high-level properties like ‘what if anything does it actually believe or want’ —A bunch of experts in the field have come out and said that this could go terribly & we could lose control, even though it’s low-status to say this & took courage. --Generally speaking the people who have thought about it the most are the most worried; the most detailed models of what the internal properties might be like are the most gloomy, etc. This might be due to selection/founder effects, but sheesh, it’s not exactly good news!
I feel like the more detailed image adds in an extra layer of revoltingness and scaryness (e.g. the sharp teeth) than would be appropriate given our state of knowledge.
Now I’m really curious to know what would justify the teeth. I’m not aware of any AIs intentionally biting someone, but presumably that would be sufficient.
Perhaps if we were dealing not with deepnets primarily trained to predict text, but rather deepnets primarily trained to addict people with pleasant seductive conversation and then drain their wallets? Such an AI would in some real sense be an evolved predator of humans.
I think one mindset that may be healthy is to remember:
Reality is too complex to be described well by a single idea (meme/etc.). If one responds to this by forcing each idea presented to be as good an approximation of reality as possible, then that causes all the ideas to become “colorless and blurry”, as any specific detail would be biased when considered on its own.
Therefore, one cannot really fight about whether an idea is biased in isolation. Rather, the goal should be to create a bag of ideas which in totality is as informative about a subject as possible.
I think you are basically right that the shoggoth meme is describing one of the most negative projections of what LLMs could be doing. One approach is to try to come with a single projection and try to convince everyone else to use this instead. I’m not super comfortable with that either because I feel like there’s a lot of uncertainty about what the most productive way to think about LLMs is, and I would like to keep options open.
Instead I’d rather have a collection of a list of different ways to think about it (you could think of this collection as a discrete approximation to a probability distribution). Such a list would have many uses, e.g. as a checklist or a reference to guide people to.
It does seem problematic for the rationalist community to refuse to acknowledge that the shoggoth meme presents LLMs as being scary monsters, but it also seems problematic to insist that the shoggoth meme exaggerates the danger of LLMs, because that should be classified based on P(meme danger > actual danger), rather than on the basis of meme danger > E[actual danger], as in, if there’s significant uncertainty about how how actually dangerous LLMs are then there’s also significant uncertainty about whether the memes exaggerate the danger; one shouldn’t just compare against a single point estimate.
I think this just comes down to personal taste on how you’re interpreting the image? I find the original shoggoth image cute enough that I use it as my primary discord message reacts. My emotional reaction to it to the best of my introspection has always been “weird alien structure and form” or “awe-inspiringly powerful and incomprehensible mind” and less “horrifying and scary monster”. I’m guessing this is the case for the vast majority of people I personally know who make said memes. It’s entirely possible the meme has the end-consequence of appealing to other people for the reasons you mention, but then I think it’s important to make that distinction.
It is true that base models, especially smaller ones, are somewhat creepy to talk to (especially because their small context window makes them forgetful). I’m not sure I’d describe them as “very alien”, they’re more “uncanny valley” where they often make sense and seem human-like, until suddenly they don’t. (On theoretical grounds, I think they’re using rather non-human means of cognition to attempt to model human writing patterns as closely as they can, they often get this right, but on occasion make very non-human errors — more frequently for smaller models.) The Shoggoth mental metaphor exaggerates this somewhat for effect (and more so for the very scary image Alex posted at the top, which I haven’t seen used as often as the one Oliver posted).
This is one of the reasons why Quintin and I proposed a more detailed and somewhat less scary/alien (but still creepy) metaphor: Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor. I’d be interested to know what people think of that one in comparison to the Shoggoth — we were attempting to be more unbiased, as well as more detailed.
More broadly, TurnTrout, I’ve noticed you using this whole “look, if something positive happened, LW would totally rip on it! But if something is presented negatively, everyone loves it!” line of reasoning a few times (e.g., I think this logic came up in your comment about Evan’s recent paper). And I sort of see you taking on some sort of “the people with high P(doom) just have bad epistemics” flag in some of your comments.
A few thoughts (written quickly, prioritizing speed over precision):
I think that epistemics are hard & there are surely several cases in which people are biased toward high P(doom). Examples: Yudkowsky was one of the first thinkers/writers about AI, some people might have emotional dispositions that lead them toward anxious/negative interpretations in general, some people find it “cool” to think they’re one of the few people who are able to accurately identify the world is ending, etc.
I also think that there are plenty of factors biasing epistemics in the “hopeful” direction. Examples: The AI labs have tons of money and status (& employ large fractions of the community’s talent), some people might have emotional dispositions that lead them toward overly optimistic/rosy interpretations in general, some people might find it psychologically difficult to accept premises that lead them to think the world is ending, etc.
My impression (which could be false) is that you seem to be exclusively or disproportionately critical of poor arguments when they come from the “high P(doom)” side.
I also think there’s an important distinction between “I personally think this argument is wrong” and “look, here’s an example of propaganda + poor community epistemics.” In general, I suspect community epistemics are better when people tend to respond directly to object-level points and have a relatively high bar for saying “not only do I think you’re wrong, but also here are some ways in which you and your allies have poor epistemics.” (IDK though, insofar as you actually believe that’s what’s happening, it seems good to say aloud, and I think there’s a version of this that goes too far and polices speech reproductively, but I do think that statements like “community epistemics have been compromised by groupthink and fear” are pretty unproductive and could be met with statements like “community epistemics have been compromised by powerful billion-dollar companies that have clear financial incentives to make people overly optimistic about the trajectory of AI progress.”
I am quite worried about tribal dynamics reducing the ability for people to engage in productive truth-seeking discussions. I think you’ve pointed out how some of the stylistic/tonal things from the “high P(doom)//alignment hard” side have historically made discourse harder, and I agree with several of your critiques. More recently, though, I think that the “low P(doom)//alignment not hard” side seem to be falling into similar traps (e.g., attacking strawmen of those they disagree with, engaging some sort of “ha, the other side is not only wrong but also just dumb/unreasonable/epistemically corrupted” vibe that predictably makes people defensive & makes discourse harder.
My guess is that it’s relatively epistemically corrupting and problematic to spend a lot of time engaging with weak arguments.
I think it’s tempting to make the mistake of thinking that debunking a specific (bad) argument is the same as debunking a conclusion. But actually, these are extremely different operations. One requires understanding a specific argument while the other requires level headed investigation of the overall situation. Separately, there are often actually good intuitions underlying bad arguments and recovering this intuition is an important part of truth seeking.
I think my concerns here probably apply to a wide variety of people thinking about AI x-risk. I worry about this for myself.
Thanks for this, I really appreciate this comment (though my perspective is different on many points).
My impression (which could be false) is that you seem to be exclusively or disproportionately critical of poor arguments when they come from the “high P(doom)” side.
It’s true that I spend more effort critiquing bad doom arguments. I would like to note that when e.g. I read Quintin I generally am either in agreement or neutral. I bet there are a lot of cases where you would think “that’s a poor argument” and I’d say “hm I don’t think Akash is getting the point (and it’d be good if someone could give a better explanation).”
However, it’s definitely not true that I never critique optimistic arguments which I consider poor. For example, I don’t get why Quintin (apparently) thinks that spectral bias is a reason for optimism, and I’ve said as much on one of his posts. I’ve said something like “I don’t know why you seem to think you can use this mathematical inductive bias to make high-level intuitive claims about what gets learned. This seems to fall into the same trap that ‘simplicity’ theorizing does.” I probably criticize or express skepticism of certain optimistic arguments at least twice a week, though not always on public channels. And I’ve also pushed back on people being unfair, mean, or mocking of “doomers” on private channels.
I do think that statements like “community epistemics have been compromised by groupthink and fear” are pretty unproductive and could be met with statements like “community epistemics have been compromised by powerful billion-dollar companies that have clear financial incentives to make people overly optimistic about the trajectory of AI progress.”
I think both statements are true to varying degrees (the former more than the latter in the cases I’m considering). They’re true and people should say them. The fact that I work at a lab absolutely affects my epistemics (though I think the effect is currently small). People should totally consider the effect which labs are having on discourse.
have a relatively high bar for saying “not only do I think you’re wrong, but also here are some ways in which you and your allies have poor epistemics.”
I do consider myself to have a high bar for this, and the bar keeps getting passed, so I say something. EDIT: Though I don’t mean for my comments to imply “someone and their allies” have bad epistemics. Ideally I’d like to communicate “hey, something weird is in the air guys, can’t you sense it too?”. However, I think I’m often more annoyed than that, and so I don’t communicate that how I’d like.
My impression is that the Shoggath meme was meant to be a simple meme that says “hey, you might think that RLHF ‘actually’ makes models do what we value, but that’s not true. You’re still left with an alien creature who you don’t understand and could be quite scary.”
Most of the Shoggath memes I’ve seen look more like this, where the disgusting/evil aspects are toned down. They depict an alien that kinda looks like an octopus. I do agree that the picture evokes some sort of “I should be scared/concerned” reaction. But I don’t think it does so in a “see, AI will definitely be evil” way– it does so in a “look, RLHF just adds a smiley face to a foreign alien thing. And yeah, it’s pretty reasonable to be scared about this foreign alien thing that we don’t understand.”
To be a bit bolder, I think Shoggath is reacting to the fact that RLHF gives off a misleading impression of how safe AI is. If I were to use proactive phrasing, I could say that RLHF serves as “propaganda”. Let’s put aside the fact that you and I might disagree about how much “true evidence” RLHF provides RE how easy alignment will be. It seems pretty clear to me that RLHF [and the subsequent deployment of RLHF’d models] spreads an overly-rosy “meme” that gives people a misleading perspective of how well we understand AI systems, how safe AI progress is, etc.
From this lens, I see Shoggath as a counter-meme. It basically says “hey look, the default is for people to think that these things are friendly assistants, because that’s what the AI companies have turned them into, but we should remember that actually we are quite confused about the alien cognition behind the RLHF smiley face.”
Being amorphous, shoggoths can take on any shape needed, making them very versatile within aquatic environments.
At the Mountains of Madness includes a detailed account of the circumstances of the shoggoths’ creation by the extraterrestrial Elder Things. Shoggoths were initially used to build the cities of their masters. Though able to “understand” the Elder Things’ language, shoggoths had no real consciousness and were controlled through hypnotic suggestion. Over millions of years of existence, some shoggoths mutated, developed independent minds, and rebelled.
Quoting because (a) a lot of these features seem like an unusually good match for LLMs and (b) acknowledging that is picking a metaphor that fictionally rebelled, and thus is potentially alignment-is-hard loaded as a metaphor.
I’m confident that if there were a “pro-AI” meme with a friendly-looking base model, LW / the shoggoth enjoyers would have nitpicked the friendly meme-creature to hell. They would (correctly) point out “hey, we don’t actually know how these things work; we don’t know them to be friendly, or what they even ‘want’ (if anything); we don’t actually know what each stage of training does...”
I’m sure that nothing bad will happen to me if I slap this (friendly AI meme) on my laptop, right? I’ll be able to think perfectly neutrally about whether AI will be friendly.
I have multiple cute AI stickers on my laptop, one of which is a shoggoth meme. Here is a picture of them. Nobody has ever nitpicked their friendly appearance to me. I don’t think they have distorted my thinking about AI in favour of thinking that it will be friendly (altho I think it was after I put them on that I became convinced by a comment by Paul Christiano that there’s ~even odds that unaligned AI wouldn’t kill me, so do with that information what you will).
A bad map that expresses the territory with great uncertainty can be confidently called a bad map, calling it a good map is clearly wrong. In that sense the shoggoth imagery reflects the quality of the map, and as it’s clearly a bad map, better imagery would be misleading about the map’s quality. Even if the underlying territory is lovely, this isn’t known, unlike the disastorous quality of the map of the territory, whose lack of quality is known with much more confidence and in much greater detail. Here be dragons.
(This is one aspect of the meme where it seems appropriate. Some artist’s renditions, including the one you used, channel LeCake, which your alternative image example loses, but obviously the cake is nicer than the shoggoth.)
However, this image is obviously optimized to be scary and disgusting. It looks dangerous, with long rows of sharp teeth. It is an eldritch horror. It’s at this point that I’d like to point out the simple, obvious fact that “we don’t actually know how these models work, and we definitely don’t know that they’re creepy and dangerous on the inside.”
It’s optimized to illustrate the point that the neural network isn’t trained to actually care about what the person training it thinks it came to care about, it’s only optimized to act that way on the training distribution. Unless I’m missing something, arguing the image is wrong would be equivalent to arguing that maybe the model truly cares about what its human trainers want it to care about. (Which we know isn’t actually the case.)
Though I’m almost tempted to think of LLMs as being like people who are LARPing or who have impostor syndrome. As in, they spend pretty much all their cognitive capacity on obsessing over doing what they feel looks normal. (This also closely aligns with how they are trained: first they are made to mimic what other people do, and then they are made to mimic what gets praise and avoid what gets critique.) Probably humanizes them even more than your friendly creature proposal.
This sounds somewhat similar to deceptive alignment, so I want to draw a distinction here: It’s not that LARPers/impostors are trying to maximize approval in a consequentialist sense (as this would require modelling how their actions ripple out into the world, which they do not do), but rather that (in the sense described by shard theory) they are molded based on normality and approval. As such they would not do something abnormal/disapproved-of in order to look more normal/approval-worthy later.
The “shoggoth” meme is, in part, unfounded propaganda. Here’s one popular incarnation of the shoggoth meme:
This meme accurately portrays the (IMO correct) idea that finetuning and RLHF don’t change the base model too much. Furthermore, it’s probably true that these LLMs think in an “alien” way.
However, this image is obviously optimized to be scary and disgusting. It looks dangerous, with long rows of sharp teeth. It is an eldritch horror. It’s at this point that I’d like to point out the simple, obvious fact that “we don’t actually know how these models work, and we definitely don’t know that they’re creepy and dangerous on the inside.”
In my opinion, the prevalence of the shoggoth meme is just another (small) reflection of how community epistemics have been compromised by groupthink and fear. If it’s your job to try to accurately understand how models work—if you aspire to wield them and grow them for friendly purposes—then you shouldn’t pollute your head with propaganda which isn’t based on any substantial evidence.
I’m confident that if there were a “pro-AI” meme with a friendly-looking base model, LW / the shoggoth enjoyers would have nitpicked the friendly meme-creature to hell. They would (correctly) point out “hey, we don’t actually know how these things work; we don’t know them to be friendly, or what they even ‘want’ (if anything); we don’t actually know what each stage of training does...”
Oh, hm, let’s try that! I’ll make a meme asserting that the final model is a friendly combination of its three stages of training, each stage adding different colors of knowledge (pre-training), helpfulness (supervised instruction finetuning), and deep caring (RLHF):
I’m sure that nothing bad will happen to me if I slap this on my laptop, right? I’ll be able to think perfectly neutrally about whether AI will be friendly.
That’s just one of many shoggoth memes. This is the most popular one:
The shoggoth here is not particularly exaggerated or scary.
Responding to your suggested alternative that is trying to make a point, it seems like the image fails to be accurate, or it seems to me to convey things we do confidently know are false. It is the case that base models are quite alien. They are deeply schizophrenic, have no consistent beliefs, often spout completely non-human kinds of texts, are deeply psychopathic and seem to have no moral compass. Describing them as a Shoggoth seems pretty reasonable to me, as far as alien intelligences go (alternative common imagery for alien minds are insects or ghosts/spirits with distorted forms, which would evoke similar emotions).
Your picture doesn’t get any of that across. It doesn’t communicate that the base model does not at all behave like a human would (though it would have isolated human features, which is what the eyes usually represent). It just looks like a cute plushy, but “a cute plushy” doesn’t capture any of the experiences of interfacing with a base model (and I don’t think the image conveys multiple layers of some kind of training, though that might just be a matter of effort).
I think the Shoggoth meme is pretty good pedagogically. It captures a pretty obvious truth, which is that base models are really quite alien to interface with, that we know that RLHF probably does not change the underlying model very much, but that as a result we get a model that does have a human interface and feels pretty human to interface with (but probably still performs deeply alien cognition behind the scenes).
This seems like good communication to me. Some Shoggoth memes are cute, some are exaggerated to be scary, which also seems reasonable to me since alien intelligences seem like are pretty scary, but it’s not a necessary requirement of the core idea behind the meme.
I don’t see how this is any more true of a base model LLM than it is of, say, a weather simulation model.
You enter some initial conditions into the weather simulation, run it, and it gives you a forecast. It’s stochastic, so you can run it multiple times and get different forecasts, sampled from a predictive distribution. And if you had given it different initial conditions, you’d get a forecast for those conditions instead.
Or: you enter some initial conditions (a prompt) into the base model LLM, run it, and it gives you a forecast (completion). It’s stochastic, so you can run it multiple times and get different completions, sampled from a predictive distribution. And if you had given it a different prompt, you’d get a completion for that prompt instead.
It would be strange to call the weather simulation “schizophrenic,” or to say it “has no consistent beliefs.” If you put in conditions that imply sun tomorrow, it will predict sun; if you put in conditions that imply rain tomorrow, it will predict rain. It is not confused or inconsistent about anything, when it makes these predictions. How is the LLM any different?[1]
Meanwhile, it would be even stranger to say “the weather simulation has no moral compass.”
In the case of LLMs, I take this to mean something like, “they are indifferent to the moral status of their outputs, instead aiming only for predictive accuracy.”
This is also true of the weather simulation—and there it is a virtue, if anything! Hurricanes are bad, and we prefer them not to happen. But we would not want the simulation to avoid predicting hurricanes on account of this.
As for “psychopathic,”
davinci-002
is not “psychopathic,” any more than a weather model, or my laptop, or my toaster. It does not neglect to treat me as a moral patient, because it never has a chance to do so in the first place. If I put a prompt into it, it does not know that it is being prompted by anyone; from its perspective it is still in training, looking at yet another scraped text sample among billions of others like it.Or: sometimes, I think about different courses of action I could take. To aid me in my decision, I imagine how people I know would respond to them. I try, here, to imagine only how they really would respond—as apart from how they ought to respond, or how I would like them to respond.
If a base model is psychopathic, then so am I, in these moments. But surely that can’t be right?
Like, yes, it is true that these systems—weather simulation, toaster, GPT-3 -- are not human beings. They’re things of another kind.
But framing them as “alien,” or as “not behaving as a human would,” implies some expected reference point of “what a human would do if that human were, somehow, this system,” which doesn’t make much sense if thought through in detail—and which we don’t, and shouldn’t, usually demand of our tools and machines.
Is my toaster alien, on account of behaving as it does? What would behaving as a human would look like, for a toaster?
Should I be unsettled by the fact that the world around me does not teem with levers and handles and LEDs in frantic motion, all madly tapping out morse code for “SOS SOS I AM TRAPPED IN A [toaster / refrigerator / automatic sliding door / piece of text prediction software]”? Would the world be less “alien,” if it were like that?
I am curious what you mean by this. LLMs are mostly trained on texts written by humans, so this would be some sort of failure, if it did occur often.
But I don’t know of anything that fitting this description that does occur often. There are cases like the Harry Potter sample I discuss here, but those have gotten rare as the models have gotten better, though they do still happen on occasion.
The weather simulation does have consistent beliefs in the sense that it always uses the same (approximation to) real physics. In this sense, the LLM also has consistent beliefs, reflected in the fact that its weights are fixed.
I also think the cognition in a weather model is very alien. It’s less powerful and general, so I think the error of applying something like the Shoggoth image to that (or calling it “alien”) would be that it would imply too much generality, but the alienness seems appropriate.
If you somehow had a mind that was constructed on the same principles as weather simulations, or your laptop, or your toaster (whatever that would mean, I feel like the analogy is fraying a bit here), that would display similar signs of general intelligence as LLMs, then yeah, I think analogizing them to alien/eldritch intelligences would be pretty appropriate.
It is a very common (and even to me tempting) error to see a system with the generality of GPT-4, trained on human imitation, and imagine that it must internally think like a human. But my best guess is that is not what is going on, and in some sense it is valuable to be reminded that the internal cognition going on in GPT-4 is probably similarly far from what is going in a human brain as a weather simulation is very different from what is going in a human trying to forecast the weather (de-facto I think GPT-4 is somewhere in-between since I do think the imitation learning does create some structural similarities that are stronger between humans and LLMs, but I think overall being reminded of this relevant dimension of alienness pays off in anticipated experiences a good amount).
I mostly agree with this comment, but I also think this comment is saying something different from the one I responded to.
In the comment I responded to, you wrote:
As I described above, these properties seem more like structural features of the language modeling task than attributes of LLM cognition. A human trying to do language modeling (as in that game that Buck et al made) would exhibit the same list of nasty-sounding properties for the duration of the experience—as in, if you read the text “generated” by the human, you would tar the human with the same brush for the same reasons—even if their cognition remained as human as ever.
I agree that LLM internals probably look different from human mind internals. I also agree that people sometimes make the mistake “GPT-4 is, internally, thinking much like a person would if they were writing this text I’m seeing,” when we don’t actually know the extent to which that is true. I don’t have a strong position on how helpful vs. misleading the shoggoth image is, as a corrective to this mistake.
You started with random numbers, and you essentially applied rounds of constraint application and annealing. I kinda think of it as getting a metal really hot and pouring it over mold. In this case, the ‘mold’ is your training set.
So what jumps out at me at the “shoggoth” idea is it’s like got all these properties, the “shoggoth” hates you, wants to eat you, is just ready to jump you and digest you with it’s tentacles. Or whatever.
But none of of that cognitive structure will exist unless it paid rent in compressing tokens. This algorithm will not find the optimal compression algorithm, but you only have a tiny fraction of the weights you need to record the token continuations at chinchilla scaling. You need every last weight to be pulling it’s weight (no pun intended).
I remain unconvinced that there’s a predictive model of the world opposite this statement, in people who affirm it, that would allow them to say, “nah, LLMs aren’t deeply alien.”
If LLM cognition was not “deeply alien” what would the world look like?
What distinguishing evidence does this world display, that separates us from that world?
What would an only kinda-alien bit of cognition look like?
What would very human kind of cognition look like?
What different predictions does the world make?
Does alienness indicate that it is because the models, the weights themselves have no “consistent beliefs” apart from their prompts? Would a human neocortex, deprived of hippocampus, present any such persona? Is a human neocortex deeply alien? Are all the parts of a human brain deeply alien?
Is it because they “often spout completely non-human kinds of texts”? Is the Mersenne Twister deeply alien? What counts as “completely non-human”?
Is it because they have no moral compass, being willing to continue any of the data on which they were trained? Does any human have a “moral compass” apart from the data on which they were trained? If I can use some part of my brain to improv a consistent Nazi, does that mean that it makes sense to call the part of my brain that lets me do that immoral or psychopathic?
Is it that the algorithms that we’ve found in DL so far don’t seem to slot into readily human-understandable categories? Would a not-deeply-alien algorithm be able-to-be cracked open and show us clear propositions of predicate logic? If we had a human neocortex in an oxygen-infused broth in front of us, and we recorded the firing of every cell, do we anticipate that the algorithms there would be clear propositions of predicate logic? Would we be compelled to conclude that human neocortexes were deeply alien?
Or is it deeply alien because we think the substrate of thought is different, based on backprop rather than local learning? What if local learning could actually approximate backpropagation?. Or if more realistic non-backprop potential brain algorithms actually… kind just acted quite similarly to backprop, such that you could draw a relatively smooth line between them and backprop? Would this or more similar research impact whether we thought brains were aliens or not?
Does substrate-difference count as evidence against alien-ness, or does alien-ness just not make that kind of predictions? Is the cognition of an octopus less alien to us than the cognition of an LLM, because it runs on a more biologically-similar substrate?
Does every part of a system by itself need to fit into the average person’s ontology for the total to not be deeply alien; do we need to be able to fit every part within a system into a category comprehensible by an untutored human in order to describe it as not deeply alien? Is anything in the world not deeply alien by this standard?
To re-question: What predictions can I make about the world because LLMs are “deeply alien”?
Are these predictions clear?
When speaking to someone who I consider a noob, is it best to give them terms whose emotive import is clear, but whose predictive import is deeply unclear?
What kind of contexts does this “deeply alien” statement come up in? Are those contexts people are trying to explain, or to persuade?
If I piled up all the useful terms that I know that help me predict how LLMs behave, would “deeply alien” be an empty term on top of these?
Or would it give me no more predictive value than “many behaviors of an LLM are currently not understood”?
Most of my view on “deeply alien” is downstream of LLMs being extremely superhuman at literal next token prediction and generally superhuman at having an understanding of random details of webtext.
Another component corresponds to a general view that LLMs are trained in a very different way from how humans learn. (Though you could in principle get the same cognition from very different learning processes.)
This does correspond to specific falsifiable predictions.
Despite being pretty confident in “deeply alien” in many respects, it doesn’t seem clear to me whether LLMs will in practice have very different relative capability profiles from humans on larger scale downstream tasks we actually care about. (It currently seems like the answer will be “mostly no” from my perspective.)
In addition to the above, I’d add in some stuff about how blank slate theory seems to be wrong as a matter of human psychology. If evidence comes out tomorrow that actually humans are blank slates to a much greater extent than I realized, so much so that e.g. the difference between human and dog brains is basically just size and training data, I’d be more optimistic that what’s going on inside LLMs isn’t deeply alien.
Re this discussion, I think that this claim:
is basically correct, in the sense that a lot of the reason humans succeeded was essentially culture + language, which is essentially both increasing training data and increasing the data quality, and also that human brains have more favorable scaling laws than basically any other animals, because we dissipate heat way better than other animals, and also have larger heads.
A lot of the reason you couldn’t make a dog as smart as a human is because it’s brain would almost certainly not fit in the birth canal, and if you solved that problem, you’d have to handle heat dissipation, and dogs do not dissipate heat well compared to humans.
IMO, while I think the original blank slate hypothesis was incorrect, I do think a weaker version of it does work, and a lot of AI progress is basically the revenge of the blank slate people, in that you can have very weak priors and learning still works, both for capabilities and alignment.
That much I knew already at the time I wrote the comment. I know that e.g. human brains are basically just scaled up chimp brains etc. I definitely am closer to the blank slate end of the spectrum than the ‘evolution gave us tons of instincts’ end of the spectrum. But it’s not a total blank slate. What do you think would happen if we did a bunch of mad science to make super-dogs that had much bigger brains? And then took one of those dogs and tried to raise it like a human child? My guess is that it would end up moving some significant fraction of the distance from dog to human, but not crossing the gap entirely, and if it did end up fitting in to human society, getting a job, etc. it would have very unusual fetishes at the very least and probably all sorts of other weird desires and traits.
I agree with you, but I think my key crux is that I tend to think that ML people have more control over AI’s data sources than humans have over their kids or super-dogs, and this is only going to increase for boring capabilities reasons (primarily due to the need to get past the data wall, and importantly having control over the data lets you create very high quality data), and as it turns out, a lot of what the AI values is predicted very well if you know it’s data, and much less well if you knew it’s architecture.
And this suggests a fairly obvious alignment strategy. You’ve mentioned that the people working on it might not do it because they are too busy racing to superintelligence, and I agree that the major failure mode in practice will be companies being in such a Molochian race to the top for superintelligence that they can’t do any safety efforts.
This would frankly be terrifying, and I don’t want this to happen, but contra many on Lesswrong, I think there’s a real chance that AI is safe and aligned even in effectively 0 dignity futures like this, but I don’t think the chance is high enough that I’d promote a race to superintelligence at all.
OK. We might have some technical disagreement remaining about the promisingness of data-control strategies, but overall it seems like we are on basically the same page.
Zooming in on our potential disagreement though: ”...as it turns out, a lot of what the AI values is predicted very well if you know it’s data” Can you say more about what you mean by this and what your justification is? IMO there are lots of things about AI values that we currently failed to predict in advance (though usually it’s possible to tell a plausible story with the benefit of hindsight). idk. Curious to hear more.
What I mean by this is that if you want to predict what an AI will do, for example will it do better in having a new capability than other models, or what it’s values are like, especially if you want to predict OOD behavior accurately, you would be far better off if you knew what it’s data sources are, as well as the quality of it’s data, than if you only knew it’s prior/architecture.
Re my justification for it, my basic justification comes from this tweet thread, which points out that a lot of o1′s success could come from high-quality data, and while I don’t like the argument that search/fancy bits aren’t happening at all (I do think o1 is doing a very small run-time search), I agree with the conclusion that the data quality was probably most of the reason o1 is so good in coding.
https://x.com/aidanogara_/status/1838779311999918448
Somewhat more generally, I’m pretty influenced by this post, and while I don’t go as far as claiming that all of what an AI is the dataset, I do think a weaker version of the claim is pretty likely to be true.
https://nonint.com/2023/06/10/the-it-in-ai-models-is-the-dataset/
But one prediction we could have made in advance, if we knew that data was a major factor in how AIs learn values, is that value misspecification was likely to be far less severe of a problem than 2000-2010s thinking on LW had, and value learning had a tractable direction of progress, as training on human books and language would load into it mostly-correct values, and another prediction we could have made is that human values wouldn’t all be that complicated, and could instead be represented by say several hundred megabyte/1 gigabyte codes quite well, and we could plausibly simplify that further.
To be clear, I don’t think you could have had too high of a probability writ it’s prediction before LLMs, but you’d at least have the hypothesis in serious consideration.
Cf here:
From here on Matthew Barnett’s post about the historical value misspecification argument, and note I’m not claiming that alignment is solved right now:
https://www.lesswrong.com/posts/i5kijcjFJD6bn7dwq/evaluating-the-historical-value-misspecification-argument#N9ManBfJ7ahhnqmu7
and here, where it talks about the point that to the extent there’s a gap between loading in correct values versus loading in capabilities, it’s that loading in values data is easier than loading in capabilities data, which kind of contradicts this post from @Rob Bensinger here on motivations being harder to learn, and one could have predicted this because there was a lot of data on human values, and a whole lot of the complexity of the values is in the data, not the generative model, thus it’s very easy to learn values, but predictably harder to learn a lot of the most useful capabilities.
Again, we couldn’t have too high a probability for this specific outcome happening, but you’d at least seriously consider the hypothesis.
From Rob Bensinger:
https://x.com/robbensinger/status/1648120202708795392
From @beren on alignment generalizing further than capabilities, in spiritual response to Bensinger:
https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/
But that’s how we could well have made predictions about AIs, or at least elevated these hypotheses to reasonable probability mass, in an alternate universe where LW didn’t anchor too hard on their previous models of AI like AIXI and Solomonoff induction.
Note that in order for my argument to go through, we also need the brain to be similar enough to DL systems that we can validly transfer insights from DL to the brain, and while I don’t think you could place too high of a probability 10-20 years ago on that, I do think that at the very least this should have been considered as a serious possibility, which LW mostly didn’t do.
However, we now have that evidence, and I’ll post links below:
https://www.lesswrong.com/posts/rjghymycfrMY2aRk5/llm-cognition-is-probably-not-human-like#KBpfGY3uX8rDJgoSj
https://x.com/BogdanIonutCir2/status/1837653632138772760
https://x.com/SharmakeFarah14/status/1837528997556568523
These are a lot of questions, my guess is most of which are rhetorical, so not sure which ones you are actually interested in getting an answer on. Most of the specific questions I would answer with “no”, in that they don’t seem to capture what I mean by “alien”, or feel slightly strawman-ish.
Responding at a high-level:
There are a lot of experiments that seem like they shed light on the degree to which cognition in AI systems is similar to human or animal cognition. Some examples:
Does the base model pass a Turing test?
Does the performance distribution of the base model on different tasks match the performance distribution of humans?
Does the generalization and learning behavior of the base model match how humans learn things?
When trained using RL on things like game-environments (after pre-training on a language corpus), does the system learn at similar rates and plateau at similar skill levels as human players?
There are a lot of structural and algorithmic properties that could match up between human and LLM systems:
Do they interface with the world in similar ways?
Do they require similar amounts and kinds of data to learn the same relationships?
Do the low-level algorithmic properties of how human brains store and process information look similar between the two systems?
A lot more stuff, but I am not sure how useful going into a long list here is. At least to me it feels like a real thing, and different observations would change the degree to which I would describe a system as alien.
I think the exact degree of alienness is really interesting and one of the domains where I would like to see more research.
For example, a bunch of the experiments I would most like to see, that seem helpful with AI Alignment, are centered on better measuring the performance distribution of transformer architectures on tasks that are not primarily human imitation, so that we could better tell which things LLMs have a much easier time learning than humans (currently even if a transformer could relatively easily reach vastly superhuman performance at a task with more specialized training data, due to the structure of the training being oriented around human imitation, observed performance at the task will cluster around human level, but seeing where transformers could reach vast superhuman performance would be quite informative on understanding the degree to which its cognition is alien).
So I don’t consider the exact nature and degree of alienness as a settled question, but at least to me, aggregating all the evidence I have, it seems very likely that the cognition going on in a base model is very different from what is going on in a human brain, and a thing that I benefit from reminding myself frequently when making predictions about the behavior of LLM systems.
I like a lot of these questions, although some of them give me an uncanny feeling akin to “wow, this is a very different list of uncertainties than I have.” I’m sorry the my initial list of questions was aggressive.
I’m not sure how they add up to alienness, though? They’re about how we’re different than models—wheras the initial claim was that models are psychopathic, ammoral, etc.. If we say a model is “deeply alien”—is that just saying it’s different than us in lots of ways? I’m cool with that—but the surplus negative valence involved in “LLMs are like shoggoths” versus “LLMs have very different performance characteristics than humans” seems to me pretty important.
Otherwise, why not say that calculators are alien, or any of the things in existence with different performance curves than we have? Chessbots, etc. If I write a loop in Python to count to 10, the process by which it does so is arguably more different from how I count to ten than the process by which an LLM counts to ten, but we don’t call Python alien.
This feels like reminding an economics student that the market solves things differently than a human—which is true—by saying “The market is like Baal.”
There is a fun paper on this you might enjoy. Obviously not a total answer to the question.
The main difference between calculators, weather predictors, markets, and Python versus LLMs is that LLMs can talk to you in a relatively strong sense of “talk”. So, by default, people don’t have mistaken impressions of the cognitative nature of calculators, markets, and Python, while they might have a mistake about LLMs.
Like it isn’t surprising to most people that calculators are quite amoral in their core (why would you even expect morality?). But the claim that the thing which GPT-4 is built out of is quite amoral is non-obvious to people (though obvious to people with slightly more understanding).
I do think there is an important point which is communicated here (though it seems very obvious to people who actually operate in the domain).
I agree this can be initially surprising to non-experts!
I just think this point about the amorality of LLMs is much better communicated by saying “LLMs are trained to continue text from an enormous variety of sources. Thus, if you give them [Nazi / Buddhist / Unitarian / corporate / garbage nonsense] text to continue, they will generally try to continue it in that style.”
Than to say “LLMs are like alien shoggoths.”
Like it’s just a better model to give people.
Agreed, though of course as always, there is the issue that that’s an intentional-stance way to describe what a language model does: “they will generally try to continue it in that style.” Hence mechinterp, which tries to (heh) move to a mechanical stance, which will likely be something like “when you give them a [whatever] text to continue, it will match [some list of features], which will then activate [some part of the network that we will name later], which implements the style that matches those features”.
(incidentally, I think there’s some degree to which people who strongly believe that artificial NNs are alien shoggoths are underestimating the degree to which their own brains are also alien shoggoths. but that doesn’t make it a good model of either thing. the only reason it was ever an improvement over a previous word was when people had even more misleading intuitive-sketch models.)
This is a bit of a noob question, but is this true post RLHF? Generally most of my interactions with language models these days (e.g. asking for help with code, asking to explain something I don’t understand about history/medicine/etc) don’t feel like they’re continuing my text, it feels like they’re trying to answer my questions politely and well. I feel like “ask shoggoth and see what it comes up with” is a better model for me than “go the AI and have it continue your text about the problem you have”.
To the best of my knowledge, the majority of research (all the research?) has found that the changes to a LLM’s text-continuation abilities from RLHF (or whatever descendant of RLHF is used) are extremely superficial.
So you have one paper, from the abstract:
Or, in short, the LLM is still basically doing the same thing, with a handful of additions to keep it on-track in the desired route from the fine-tuning.
(I also think our very strong prior belief should be that LLMs are basically still text-continuation machines, given that 99.9% or so of the compute put into them is training them for this objective, and that neural networks lose plasticity as they learn. Ash and Adams is like a really good intro to this loss of plasticity, although most of the research that cites this is RL-related so people don’t realize.)
Similarly, a lot of people have remarked on how the textual quality of the responses from a RLHF’d language model can vary with the textual quality of the question. But of course this makes sense from a text-prediction perspective—a high-quality answer is more likely to follow a high-quality question in text than a high-quality answer from a low-quality question. This kind of thing—preceding the model’s generation with high-quality text—was the only way to make it have high quality answers for base models—but it’s still there, hidden.
So yeah, I do think this is a much better model for interacting with these things than asking a shoggoth. It actually gives you handles to interact with them better, while asking a shoggoth gives you no such handles.
The people who originally came up with the shoggoth meme, I’d bet, were very well aware of how LLMs are pretrained to predict text and how they are best modelled (at least for now) as trying to predict text. When I first heard the shoggoth meme that’s what I thought—I interpreted it as “it’s this alien text-prediction brain that’s been retrained ever so slightly to produce helpful chatbot behaviors. But underneath it’s still mostly just about text prediction. It’s not processing the conversation in the same way that a human would.” Mildly relevant: In the Lovecraft canon IIRC Shoggoths are servitor-creatures, they are basically beasts of burden. They aren’t really powerful intelligent agents in their own right, they are sculpted by their creators to perform useful tasks. So, for me at least, calling them shoggoth has different and more accurate vibes than, say, calling them Cthulhu. (My understanding of the canon may be wrong though)
(TBC, I totally agree that object level communication about the exact points seems better all else equal if you can actually do this communication.)
Hmm, I think that’s a red herring though. Consider humans—most of them have read lots of text from an enormous variety of sources as well. Also while it’s true that current LLMs have only a little bit of fine-tuning applied after their pre-training, and so you can maybe argue that they are mostly just trained to predict text, this will be less and less true in the future.
How about “LLMs are like baby alien shoggoths, that instead of being raised in alien culture, we’ve adopted at birth and are trying to raise in human culture. By having them read the internet all day.”
(Come to think of it, I actually would feel noticeably more hopeful about our prospects for alignment success if we actually were “raising the AGI like we would a child.” If we had some interdisciplinary team of ML and neuroscience and child psychology experts that was carefully designing a curriculum for our near-future AGI agents, a curriculum inspired by thoughtful and careful analogies to human childhood, that wouldn’t change my overall view dramatically but it would make me noticeably more hopeful. Maybe brain architecture & instincts basically don’t matter that much and Blank Slate theory is true enough for our purposes that this will work to produce an agent with values that are in-distribution for the range of typical modern human values!)
(This doesn’t contradict anything you said, but it seems like we totally don’t know how to “raise an AGI like we would a child” with current ML. Like I don’t think it counts for very much if almost all of the training time is a massive amount of next-token prediction. Like a curriculum of data might work very differently on AI vs humans due to a vastly different amount of data and a different training objective.)
I’ve seen mixed data on how important curricula are for deep learning. One paper (on CIFAR) suggested that curricula only help if you have very few datapoints or the labels are noisy. But possibly that doesn’t generalize to LLMs.
I think data ordering basically never matters for LLM pretraining. (As in, random is the best and trying to make the order more specific doesn’t help.)
That was my impression too.
ETA: The following was written more aggressively than I now endorse.
I think this is revisionism. What’s the point of me logging on to this website and saying anything if we can’t agree that a literal eldritch horror is optimized to be scary, and meant to be that way?
Exaggerated from what? Its usual form as a 15-foot-tall person-eating monster which is covered in eyeballs?
The shoggoth is optimized to be scary, even in its “cute” original form, because it is a literal Lovecraftian horror. Even the word “shoggoth” itself has “AI uprising, scary!” connotations:
Let’s be very clear. The shoggoth has consistently been viewed in a scary, negative light by many people. Let’s hear from the creator @Tetraspace themselves:
It’s true that Tetraspace didn’t intend the shoggoth to be inherently evil, but that’s not what I was alleging. The shoggoth meme is and always has communicated a sense of danger which is unsupported by substantial evidence. We can keep reading:
The origin of the shoggoth:
These are a lot of words with anthropomorphic connotation. The models exhibit “alien” behavior and yet you make human-like inferences about their internals. E.g. “Deeply psychopathic.” I think you’re drawing a bunch of unwarranted inferences with undue negative connotations.
My point wasn’t that we should use the “alternative.” The point was that both images are stupid[1] and (in many places) unsupported by evidence, but that LW-folk would be much more willing to criticize the friendly-looking one while making excuses for the scary-looking one. (And I think your comment here resolves my prediction to “correct.”)
I agree these are strengths, and said so in my original comment. But also as @cfoster0 said:
To clarify, I don’t mean to belittle @Tetraspace for making the meme. Good fun is good fun. I mean “stupid” more like “how the images influence one’s beliefs about actual LLM friendliness.” But I expressed it poorly.
(This is too gotcha shaped for me, so I am bowing out of this conversation)
I think I communicated my core point. I think it’s a good image that gets an important insight across, and don’t think it’s “propaganda” in the relevant sense of the term. Of course anything that’s memetically adaptive will have some edge-cases that don’t match perfectly, but I am getting a good amount of mileage out of calling LLMs “Shoggoths” in my own thinking and think that belief is paying good rent.
If you disagree with the underlying cognition being accurately described as alien, I can have that conversation, since it seems like maybe the underlying crux, but your response above seems like it’s taking it as a given that I am “making excuses”, and is doing a gotcha-thing which makes it hard for me to see a way to engage without further having my statements be taken as confirmation of some social narrative.
In retrospect, I do wish I had written my comment less aggressively, so my apologies on that front! I wish I’d instead written things like “I think I made some obviously correct narrow points about the shoggoth having at least some undue negative connotations, and I wish we could agree on at least that. I feel frustrated because it seems like it’s hard to reach agreement even on relatively simple propositions.”
I do agree that LLMs probably have substantially different internal mechanisms than people. That isn’t the crux. I just wish this were communicated in a more neutral way. In an alternate timeline, maybe this meme instead consisted of a strange tangle of wires and mist and question-marks with a mask on. I’d be more on-board with that.
Again, I agree that the Shoggoth meme can cure people of some real confusions! And I don’t think the meme has a huge impact, I just think it’s moderate evidence of some community failures I worry about.
I think a lot of my position is summarized by 1a3orn:
Although I do think this contains some unnecessary intentional stance usage.
fwiw I agree with the quotes from Tetraspace you gave, and disagree with ’”has communicated a sense of danger which is unsupported by substantial evidence.” The sense of danger is very much supported by the current state of evidence.
That said, I agree that the more detailed image is kinda distastefully propagandaisty in a way that the original cutesey shoggoth image is not. I feel like the more detailed image adds in an extra layer of revoltingness and scaryness (e.g. the sharp teeth) than would be appropriate given our state of knowledge.
re: “the sense of danger is very much supported by the current state of evidence”—I mean, you’ve heard all this stuff before, but I’ll summarize:
--Seems like we are on track to probably build AGI this decade
—Seems like we are on track to have an intelligence explosion, i.e. a speedup of AI R&D due to automation
—Seems like the AGI paradigm that’ll be driving all this is fairly opaque and poorly understood. We have scaling laws for things like text perplexity but other than that we are struggling to predict capabilities, and double-struggling to predict inner mechanisms / ‘internal’ high-level properties like ‘what if anything does it actually believe or want’
—A bunch of experts in the field have come out and said that this could go terribly & we could lose control, even though it’s low-status to say this & took courage.
--Generally speaking the people who have thought about it the most are the most worried; the most detailed models of what the internal properties might be like are the most gloomy, etc. This might be due to selection/founder effects, but sheesh, it’s not exactly good news!
Now I’m really curious to know what would justify the teeth. I’m not aware of any AIs intentionally biting someone, but presumably that would be sufficient.
Perhaps if we were dealing not with deepnets primarily trained to predict text, but rather deepnets primarily trained to addict people with pleasant seductive conversation and then drain their wallets? Such an AI would in some real sense be an evolved predator of humans.
I think one mindset that may be healthy is to remember:
Reality is too complex to be described well by a single idea (meme/etc.). If one responds to this by forcing each idea presented to be as good an approximation of reality as possible, then that causes all the ideas to become “colorless and blurry”, as any specific detail would be biased when considered on its own.
Therefore, one cannot really fight about whether an idea is biased in isolation. Rather, the goal should be to create a bag of ideas which in totality is as informative about a subject as possible.
I think you are basically right that the shoggoth meme is describing one of the most negative projections of what LLMs could be doing. One approach is to try to come with a single projection and try to convince everyone else to use this instead. I’m not super comfortable with that either because I feel like there’s a lot of uncertainty about what the most productive way to think about LLMs is, and I would like to keep options open.
Instead I’d rather have a collection of a list of different ways to think about it (you could think of this collection as a discrete approximation to a probability distribution). Such a list would have many uses, e.g. as a checklist or a reference to guide people to.
It does seem problematic for the rationalist community to refuse to acknowledge that the shoggoth meme presents LLMs as being scary monsters, but it also seems problematic to insist that the shoggoth meme exaggerates the danger of LLMs, because that should be classified based on P(meme danger > actual danger), rather than on the basis of meme danger > E[actual danger], as in, if there’s significant uncertainty about how how actually dangerous LLMs are then there’s also significant uncertainty about whether the memes exaggerate the danger; one shouldn’t just compare against a single point estimate.
I think this just comes down to personal taste on how you’re interpreting the image? I find the original shoggoth image cute enough that I use it as my primary discord message reacts. My emotional reaction to it to the best of my introspection has always been “weird alien structure and form” or “awe-inspiringly powerful and incomprehensible mind” and less “horrifying and scary monster”. I’m guessing this is the case for the vast majority of people I personally know who make said memes. It’s entirely possible the meme has the end-consequence of appealing to other people for the reasons you mention, but then I think it’s important to make that distinction.
It is true that base models, especially smaller ones, are somewhat creepy to talk to (especially because their small context window makes them forgetful). I’m not sure I’d describe them as “very alien”, they’re more “uncanny valley” where they often make sense and seem human-like, until suddenly they don’t. (On theoretical grounds, I think they’re using rather non-human means of cognition to attempt to model human writing patterns as closely as they can, they often get this right, but on occasion make very non-human errors — more frequently for smaller models.) The Shoggoth mental metaphor exaggerates this somewhat for effect (and more so for the very scary image Alex posted at the top, which I haven’t seen used as often as the one Oliver posted).
This is one of the reasons why Quintin and I proposed a more detailed and somewhat less scary/alien (but still creepy) metaphor: Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor. I’d be interested to know what people think of that one in comparison to the Shoggoth — we were attempting to be more unbiased, as well as more detailed.
More broadly, TurnTrout, I’ve noticed you using this whole “look, if something positive happened, LW would totally rip on it! But if something is presented negatively, everyone loves it!” line of reasoning a few times (e.g., I think this logic came up in your comment about Evan’s recent paper). And I sort of see you taking on some sort of “the people with high P(doom) just have bad epistemics” flag in some of your comments.
A few thoughts (written quickly, prioritizing speed over precision):
I think that epistemics are hard & there are surely several cases in which people are biased toward high P(doom). Examples: Yudkowsky was one of the first thinkers/writers about AI, some people might have emotional dispositions that lead them toward anxious/negative interpretations in general, some people find it “cool” to think they’re one of the few people who are able to accurately identify the world is ending, etc.
I also think that there are plenty of factors biasing epistemics in the “hopeful” direction. Examples: The AI labs have tons of money and status (& employ large fractions of the community’s talent), some people might have emotional dispositions that lead them toward overly optimistic/rosy interpretations in general, some people might find it psychologically difficult to accept premises that lead them to think the world is ending, etc.
My impression (which could be false) is that you seem to be exclusively or disproportionately critical of poor arguments when they come from the “high P(doom)” side.
I also think there’s an important distinction between “I personally think this argument is wrong” and “look, here’s an example of propaganda + poor community epistemics.” In general, I suspect community epistemics are better when people tend to respond directly to object-level points and have a relatively high bar for saying “not only do I think you’re wrong, but also here are some ways in which you and your allies have poor epistemics.” (IDK though, insofar as you actually believe that’s what’s happening, it seems good to say aloud, and I think there’s a version of this that goes too far and polices speech reproductively, but I do think that statements like “community epistemics have been compromised by groupthink and fear” are pretty unproductive and could be met with statements like “community epistemics have been compromised by powerful billion-dollar companies that have clear financial incentives to make people overly optimistic about the trajectory of AI progress.”
I am quite worried about tribal dynamics reducing the ability for people to engage in productive truth-seeking discussions. I think you’ve pointed out how some of the stylistic/tonal things from the “high P(doom)//alignment hard” side have historically made discourse harder, and I agree with several of your critiques. More recently, though, I think that the “low P(doom)//alignment not hard” side seem to be falling into similar traps (e.g., attacking strawmen of those they disagree with, engaging some sort of “ha, the other side is not only wrong but also just dumb/unreasonable/epistemically corrupted” vibe that predictably makes people defensive & makes discourse harder.
See also “Other people are wrong” vs “I am right”, reversed stupidity is not intelligence, and the cowpox of doubt.
My guess is that it’s relatively epistemically corrupting and problematic to spend a lot of time engaging with weak arguments.
I think it’s tempting to make the mistake of thinking that debunking a specific (bad) argument is the same as debunking a conclusion. But actually, these are extremely different operations. One requires understanding a specific argument while the other requires level headed investigation of the overall situation. Separately, there are often actually good intuitions underlying bad arguments and recovering this intuition is an important part of truth seeking.
I think my concerns here probably apply to a wide variety of people thinking about AI x-risk. I worry about this for myself.
Thanks for this, I really appreciate this comment (though my perspective is different on many points).
It’s true that I spend more effort critiquing bad doom arguments. I would like to note that when e.g. I read Quintin I generally am either in agreement or neutral. I bet there are a lot of cases where you would think “that’s a poor argument” and I’d say “hm I don’t think Akash is getting the point (and it’d be good if someone could give a better explanation).”
However, it’s definitely not true that I never critique optimistic arguments which I consider poor. For example, I don’t get why Quintin (apparently) thinks that spectral bias is a reason for optimism, and I’ve said as much on one of his posts. I’ve said something like “I don’t know why you seem to think you can use this mathematical inductive bias to make high-level intuitive claims about what gets learned. This seems to fall into the same trap that ‘simplicity’ theorizing does.” I probably criticize or express skepticism of certain optimistic arguments at least twice a week, though not always on public channels. And I’ve also pushed back on people being unfair, mean, or mocking of “doomers” on private channels.
I think both statements are true to varying degrees (the former more than the latter in the cases I’m considering). They’re true and people should say them. The fact that I work at a lab absolutely affects my epistemics (though I think the effect is currently small). People should totally consider the effect which labs are having on discourse.
I do consider myself to have a high bar for this, and the bar keeps getting passed, so I say something. EDIT: Though I don’t mean for my comments to imply “someone and their allies” have bad epistemics. Ideally I’d like to communicate “hey, something weird is in the air guys, can’t you sense it too?”. However, I think I’m often more annoyed than that, and so I don’t communicate that how I’d like.
My impression is that the Shoggath meme was meant to be a simple meme that says “hey, you might think that RLHF ‘actually’ makes models do what we value, but that’s not true. You’re still left with an alien creature who you don’t understand and could be quite scary.”
Most of the Shoggath memes I’ve seen look more like this, where the disgusting/evil aspects are toned down. They depict an alien that kinda looks like an octopus. I do agree that the picture evokes some sort of “I should be scared/concerned” reaction. But I don’t think it does so in a “see, AI will definitely be evil” way– it does so in a “look, RLHF just adds a smiley face to a foreign alien thing. And yeah, it’s pretty reasonable to be scared about this foreign alien thing that we don’t understand.”
To be a bit bolder, I think Shoggath is reacting to the fact that RLHF gives off a misleading impression of how safe AI is. If I were to use proactive phrasing, I could say that RLHF serves as “propaganda”. Let’s put aside the fact that you and I might disagree about how much “true evidence” RLHF provides RE how easy alignment will be. It seems pretty clear to me that RLHF [and the subsequent deployment of RLHF’d models] spreads an overly-rosy “meme” that gives people a misleading perspective of how well we understand AI systems, how safe AI progress is, etc.
From this lens, I see Shoggath as a counter-meme. It basically says “hey look, the default is for people to think that these things are friendly assistants, because that’s what the AI companies have turned them into, but we should remember that actually we are quite confused about the alien cognition behind the RLHF smiley face.”
Some quotes from the wiki article on Shoggoths:
Quoting because (a) a lot of these features seem like an unusually good match for LLMs and (b) acknowledging that is picking a metaphor that fictionally rebelled, and thus is potentially alignment-is-hard loaded as a metaphor.
I have multiple cute AI stickers on my laptop, one of which is a shoggoth meme. Here is a picture of them. Nobody has ever nitpicked their friendly appearance to me. I don’t think they have distorted my thinking about AI in favour of thinking that it will be friendly (altho I think it was after I put them on that I became convinced by a comment by Paul Christiano that there’s ~even odds that unaligned AI wouldn’t kill me, so do with that information what you will).
I think that “cute” image is still implying AI is dangerous and monsterlike? Can you show the others?
The other is the friendly robot waving hello just underneath.
Thanks, I’ve disliked the shoggoth meme for a while, and this post does a better job articulating why than I’ve been able to do myself.
A bad map that expresses the territory with great uncertainty can be confidently called a bad map, calling it a good map is clearly wrong. In that sense the shoggoth imagery reflects the quality of the map, and as it’s clearly a bad map, better imagery would be misleading about the map’s quality. Even if the underlying territory is lovely, this isn’t known, unlike the disastorous quality of the map of the territory, whose lack of quality is known with much more confidence and in much greater detail. Here be dragons.
(This is one aspect of the meme where it seems appropriate. Some artist’s renditions, including the one you used, channel LeCake, which your alternative image example loses, but obviously the cake is nicer than the shoggoth.)
It’s optimized to illustrate the point that the neural network isn’t trained to actually care about what the person training it thinks it came to care about, it’s only optimized to act that way on the training distribution. Unless I’m missing something, arguing the image is wrong would be equivalent to arguing that maybe the model truly cares about what its human trainers want it to care about. (Which we know isn’t actually the case.)
Nice point.
Though I’m almost tempted to think of LLMs as being like people who are LARPing or who have impostor syndrome. As in, they spend pretty much all their cognitive capacity on obsessing over doing what they feel looks normal. (This also closely aligns with how they are trained: first they are made to mimic what other people do, and then they are made to mimic what gets praise and avoid what gets critique.) Probably humanizes them even more than your friendly creature proposal.
This sounds somewhat similar to deceptive alignment, so I want to draw a distinction here: It’s not that LARPers/impostors are trying to maximize approval in a consequentialist sense (as this would require modelling how their actions ripple out into the world, which they do not do), but rather that (in the sense described by shard theory) they are molded based on normality and approval. As such they would not do something abnormal/disapproved-of in order to look more normal/approval-worthy later.