Dreams of AI alignment: The danger of suggestive names
Let’s not forget the old, well-read post: Dreams of AI Design. In that essay, Eliezer correctly points out errors in imputing meaning to nonsense by using suggestive names to describe the nonsense.
Artificial intelligence meets natural stupidity (an old memo) is very relevant to understanding the problems facing this community’s intellectual contributions. Emphasis added:
A major source of simple-mindedness in AI programs is the use of mnemonics like “UNDERSTAND” or “GOAL” to refer to programs and data structures. This practice has been inherited from more traditional programming applications, in which it is liberating and enlightening to be able to refer to program structures by their purposes. Indeed, part of the thrust of the structured programming movement is to program entirely in terms of purposes at one level before implementing them by the most convenient of the (presumably many) alternative lower-level constructs.
… If a researcher tries to write an “understanding” program, it isn’t because he has thought of a better way of implementing this well-understood task, but because he thinks he can come closer to writing the first implementation. If he calls the main loop of his program “UNDERSTAND”, he is (until proven innocent) merely begging the question. He may mislead a lot of people, most prominently himself, and enrage a lot of others.
What he should do instead is refer to this main loop as
G0034
, and see if he can convince himself or anyone else thatG0034
implements some part of understanding. Or he could give it a name that reveals its intrinsic properties, likeNODE-NET-INTERSECTION-FINDER
, it being the substance of his theory that finding intersections in networks of nodes constitutes understanding...When you say (GOAL . . ), you can just feel the enormous power at your fingertips. It is, of course, an illusion.[1]
Of course, Conniver has some glaring wishful primitives, too. Calling “multiple data bases” CONTEXTS was dumb. It implies that, say, sentence understanding in context is really easy in this system...
Consider the following terms and phrases:
“LLMs are trained to predict/simulate”
“LLMs are predictors” (and then trying to argue the LLM only predicts human values instead of acting on them!)
“Attention mechanism” (in self-attention)
“AIs are incentivized to” (when talking about the reward or loss functions, thus implicitly reversing the true causality; reward optimizes the AI, but AI probably won’t optimize the reward)
“Reward” (implied to be favorable-influence-in-decision-making)
“{Advantage, Value} function”
“The purpose of RL is to train an agent to maximize expected reward over time” (perhaps implying an expectation and inner consciousness on the part of the so-called “agent”)
“Agents” (implying volition in our trained artifact… generally cuz we used a technique belonging to the class of algorithms which humans call ‘reinforcement learning’)
“Power-seeking” (AI “agents”)
“Shoggoth”
“Optimization pressure”
“Utility”
As opposed to (thinking of it) as “internal unit of decision-making incentivization, which is a function of internal representations of expected future events; minted after the resolution of expected future on-policy inefficiencies relative to the computational artifact’s current decision-making influences”
“Discount rate” (in deep RL, implying that an external future-learning-signal multiplier will ingrain itself into the AI’s potential inner plan-grading-function which is conveniently assumed to be additive-over-timesteps, and also there’s just one such function and also it’s Markovian)
“Inner goal / mesa objective / optimization daemon (yes that was a real name)”
“Outer optimizer” (perhaps implying some amount of intentionality; a sense that ‘more’ optimization is ‘better’, even at the expense of generalization of the trained network)
“Optimal” (as opposed to equilibrated-under-policy-updates)
“Objectives” (when conflating a “loss function as objective” and “something which strongly controls how the AI makes choices”)
“Training” (in ML)
Yup!
“Learning” (in ML)
“Simplicity prior”
Consider the abundance of amateur theorizing about whether “schemers” will be “simpler” than “saints”, or whether they will be supplanted by “sycophants.” Sometimes conducted in ignorance of actual inductive bias research, which is actually a real subfield of ML.
Lest this all seem merely amusing, meditate on the fate of those who have tampered with words before. The behaviorists ruined words like “behavior”, “response”, and, especially, “learning”. They now play happily in a dream world, internally consistent but lost to science. And think about this: if “mechanical translation” had been called “word-by-word text manipulation”, the people doing it might still be getting government money.
Some of these terms are useful. Some of the academic imports are necessary for successful communication. Some of the terms have real benefits.
That doesn’t stop them from distorting your thinking. At least in your private thoughts, you can do better. You can replace “optimal” with “artifact equilibrated under policy update operations” or “set of sequential actions which have subjectively maximal expected utility relative to [entity X]’s imputed beliefs”, and the nice thing about brains is these long sentences can compress into single concepts which you can understand in but a moment.
It’s easy to admit the mistakes of our past selves (whom we’ve conveniently grown up from by time of recounting). It’s easy for people (such as my past self and others in this community) to sneer at out-group folks when they make such mistakes, the mistakes’ invalidity laid bare before us.
It’s hard when you’ve[2] read Dreams of AI Design and utterly failed to avoid same mistakes yourself. It’s hard when your friends are using the terms, and you don’t want to be a blowhard about it and derail the conversation by explaining your new term. It’s hard when you have utterly failed to resist inheriting the invalid connotations of other fields (“optimal”, “reward”, “attention mechanism”).
I think we have failed, thus far. I’m sad about that. When I began posting in 2018, I assumed that the community was careful and trustworthy. Not easily would undeserved connotations sneak into our work and discourse. I no longer believe that and no longer hold that trust.
To be frank, I think a lot of the case for AI accident risk comes down to a set of subtle word games.
When I try to point out such (perceived) mistakes, I feel a lot of pushback, and somehow it feels combative. I do get somewhat combative online sometimes (and wish I didn’t, and am trying different interventions here), and so maybe people combat me in return. But I perceive defensiveness even to the critiques of Matthew Barnett, who seems consistently dispassionate.
Maybe it’s because people perceive me as an Optimist and therefore my points must be combated at any cost.
Maybe people really just naturally and unbiasedly disagree this much, though I doubt it.
But the end result is that I have given up on communicating with most folk who have been in the community longer than, say, 3 years. I don’t know how to disabuse people of this trust, which seems unearned.
All to say: Do not trust this community’s concepts and memes, if you have the time. Do not trust me, if you have the time. Verify.
See also: Against most, but not all, AI risk analogies.
- ^
How many times has someone expressed “I’m worried about ‘goal-directed optimizers’, but I’m not sure what exactly they are, so I’m going to work on deconfusion.”? There’s something weird about this sentiment, don’t you think? I can’t quite put my finger on what, and I wanted to get this post out.
- ^
Including myself, and I suspect basically every LW enthusiast interested in AI.
- Many arguments for AI x-risk are wrong by 5 Mar 2024 2:31 UTC; 165 points) (
- “Deep Learning” Is Function Approximation by 21 Mar 2024 17:50 UTC; 98 points) (
- Dual Wielding Kindle Scribes by 21 Feb 2024 17:17 UTC; 57 points) (
- 5 Mar 2024 2:02 UTC; 4 points) 's comment on Counting arguments provide no evidence for AI doom by (
- 20 Feb 2024 4:07 UTC; 1 point) 's comment on And All the Shoggoths Merely Players by (
I agree that many terms are suggestive and you have to actually dissolve the term and think about the actual action of what is going on in the exact training process to understand things. If people don’t break down the term and understand the process at least somewhat mechanistically, they’ll run into trouble.
I think relevant people broadly agree about terms being suggestive and agree that this is bad; they don’t particularly dispute this. (Though probably a bunch of people think it’s less important than you do. I think these terms aren’t that bad once you’ve worked with them in a technical/ML context to a sufficient extent that you detach preexisting conceptions and think about the actual process.)
But, this is pretty different from a much stronger claim you make later:
I don’t think that “a lot of the case for AI accident risk comes down to a set of subtle word games”. (At least not in the cases for risk which seem reasonably well made to me.) And, people do really “disagree this much” about whether the case for AI accident risk comes down to word games. (But they don’t disagree this much about issues with terms being suggestive.)
It seems important to distinguish between “how bad is current word usage in terms of misleading suggestiveness” and “is the current case for AI accident risk coming down to subtle work games”. (I’m not claiming you don’t distinguish between these, I’m just claiming that arguments for the first aren’t really much evidence for the second here and that readers might miss this.)
For instance, do you think that this case for accident risk comes down to subtle word games? I think there are bunch of object level ways this threat model could be incorrect, but this doesn’t seem downstream of word usage.
Separately, I agree that many specific cases for AI accident risk seem pretty poor to me. (Though the issue still doesn’t seem like a word games issue as opposed to generically sloppy reasoning or generally having bad empirical predictions.) And then ones which aren’t poor remain somewhat vague, though this is slowly improving over time.
So I basically agree with:
Edit: except that I do think the general take of “holy shit AI (and maybe the singularity), that might be a really big deal” seems pretty solid. And, from there I think there is a pretty straightforward and good argument for at least thinking about the accident risk case.
I’m not sure whether the case for risk in general depends on word-games, but the case for x-risk from GPTs sure seems to. I think people came up with those word-games partly in response to people arguing that GPTs give us general AI without x-risk?
On this, to be specific, I don’t think that suggestive use of reward is important here for the correct interpretation of the argument (though the suggestiveness of reward might lead people to thinking the argument is stronger than it actually is).
See e.g. here for further discussion.
I propose that, while the object level thing is important and very much something I’d like to see addressed, it might be best separated from discussion of the communication and reasoning issues relating to imprecise words.
I wasn’t meaning to support that claim via this essay. I was mentioning another belief of mine.
I think so.
Yup, I think a substantial portion of it does hinge on word games!
I think some of them don’t, because some of them (I think?) invented these terms and continue to use them. I think this website would look far different if people were careful about their definitions and word choices.
In this particular case, Ajeya does seem to lean on the word “reward” pretty heavily when reasoning about how an AI will generalize. Without that word, it’s harder to justify privileging specific hypotheses about what long-term goals an agent will pursue in deployment. I’ve previously complained about this here.
Ryan, curious if you agree with my take here.
I disagree.
I think Ajeya is reasonably careful about the word reward. (Though I think I roughly disagree with the overall vibe of the post with respect to this in various ways. In particular, the “number in the datacenter” case seems super unlikely.)
See e.g. the section starting with:
More generally, I feel like the overall section here (which is the place where the reward related argument comes into force) is pretty careful about this and explains a more general notion of possible correlates that is pretty reasonable.
ETA: As in, you could replace reward with “thing that resulted in reinforcement in an online RL context” and the argument would stand totally fine.
As far as your shortform, I think the responses from Paul and Ajeya are pretty reasonable.
(Another vibe disagreement I have with “without specific countermeasures” is that I think that very basic countermeasures might defeat the “pursue correlate of thing that resulted in reinforcement in an online RL context” as long as humans would have been able to recognize the dangerous actions from the AI as bad. Thus, probably some sort of egregious auditing/oversight errors are required to die for from this exact threat model to be serious issue. The main countermeasure is just training another copy of the model as a monitor based on a dataset of bad actions we label. If our concern is “AIs learn to pursue what performed well in training”, then there isn’t a particular reason for this monitor to fail (though the policy might try to hack it with an adversarial input etc).)
After your careful analysis on AI control, what threat model is the most likely to be a problem, assuming basic competence from human use of control mechanisms?
This probably won’t be a very satisfying answer and thinking about this in more detail so I have a better short and cached response in on my list.
My general view (not assuming basic competence) is that misalignment x-risk is about half due to scheming (aka deceptive alignment) and half due to other things (more like “what failure looks like part 1”, sudden failures due to seeking-upstream-correlates-of-reward, etc).
I think control type approaches make me think that a higher fraction of the remaining failures come from an inability to understand what AIs are doing. So, somewhat less of the risk is very directly from scheming and more from “what failure looks like part 1”. That said, “what failure looks like part 1″ type failures are relatively hard to work on in advance.
Ok so failure stops being the AI models coordinating a mass betrayal but goodharting metrics to the point that nothing works right. Not fundamentally different from a command economy failing where the punishment for missing quotas is gulag, and the punishment for lying on a report is gulag but later, so...
There’s also nothing new about the failures, the USA incarceration rate is an example of what “trying too hard” looks like.
I have read many of your posts on these topics, appreciate them, and I get value from the model of you in my head that periodically checks for these sorts of reasoning mistakes.
But I worry that the focus on ‘bad terminology’ rather than reasoning mistakes themselves is misguided.
To choose the most clear cut example, I’m quite confident that when I say ‘expectation’ I mean ‘weighted average over a probability distribution’ and not ‘anticipation of an inner consciousness’. Perhaps some people conflate the two, in which case it’s useful to disabuse them of the confusion, but I really would not like it to become the case that every time I said ‘expectation’ I had to add a caveat to prove I know the difference, lest I get ‘corrected’ or sneered at.
For a probably more contentious example, I’m also reasonably confident that when I use the phrase ‘the purpose of RL is to maximise reward’, the thing I mean by it is something you wouldn’t object to, and which does not cause me confusion. And I think those words are a straightforward way to say the thing I mean. I agree that some people have mistaken heuristics for thinking about RL, but I doubt you would disagree very strongly with mine, and yet if I was to talk to you about RL I feel I would be walking on eggshells trying to use long-winded language in such a way as to not get me marked down as one of ‘those idiots’.
I wonder if it’s better, as a general rule, to focus on policing arguments rather than language? If somebody uses terminology you dislike to generate a flawed reasoning step and arrive at a wrong conclusion, then you should be able to demonstrate the mistake by unpacking the terminology into your preferred version, and it’s a fair cop.
But until you’ve seen them use it to reason poorly, perhaps it’s a good norm to assume they’re not confused about things, even if the terminology feels like it has misleading connotations to you.
There’s a difficult problem here.
Personally, when I see someone using the sorts of terms Turner is complaining about, I mentally flag it (and sometimes verbally flag it, saying something like “Not sure if it’s relevant yet, but I want to flag that we’re using <phrase> loosely here, we might have to come back to that later”). Then I mentally track both my optimistic-guess at what the person is saying, and the thing I would mean if I used the same words internally. If and when one of those mental pictures throws an error in the person’s argument, I’ll verbally express confusion and unroll the stack.
A major problem with this strategy is that it taxes working memory heavily. If I’m tired, I basically can’t do it. I would guess that people with less baseline working memory to spare just wouldn’t be able to do it at all, typically. Skill can help somewhat: it helps to be familiar with an argument already, it helps to have the general-purpose skill of keeping at least one concrete example in one’s head, it helps to ask for examples… but even with the skills, working memory is a pretty important limiting factor.
So if I’m unable to do the first-best thing at the moment, what should I fall back on? In practice I just don’t do a very good job following arguments when tired, but if I were optimizing for that… I’d probably fall back on asking for a concrete example every time someone uses one of the words Turner is complaining about. Wording would be something like “Ok pause, people use ‘optimizer’ to mean different things, can you please give a prototypical example of the sort of thing you mean so I know what we’re talking about?”.
… and of course when reading something, even that strategy is a pain in the ass, because I have to e.g. leave a comment asking for clarification and then the turn time is very slow.
I’m sympathetic to your comment, but let me add some additional perspective.
While using (IMO) imprecise or misleading language doesn’t guarantee you’re reasoning improperly, it is evidence from my perspective. As you say, that doesn’t mean one should “act” on that evidence by “correcting” the person, and often I don’t. Just the other day I had a long conversation where I and the other person both talked about the geometry of the mapping from reward functions to optimal policies in MDPs.
I think I do generally only criticize terminology when I perceive an actual reasoning mistake. This might be surprising, but that’s probably because I perceive reasoning mistakes all over the place in ways which seem tightly intertwined with language and word-games.
Exception: if someone has signed up to be mentored by me, I will mention “BTW I find it to help my thinking to use word X instead of Y, do what you want.”
You might have glossed over the part where I tried to emphasize “at least try to do this in the privacy of your own mind, even if you use these terms to communicate with other people.” This part interfaces with your “eggshells” concern.
It’s important to realize that such language creates a hostile environment for reasoning, especially for new researchers. Statistically, some people will be misled, and the costs can be great. To be concrete, I probably wasted about 3,000 hours of my life due to these “word games.”
Nearly all language has undue technical connotations. For example, “reinforcement” is not a perfectly neutral technical word, but it sure is better than “reward.” Furthermore, I think that we can definitely do better than using extremely loaded terms like “saints.”
Well, not quite what I was trying to advocate. I didn’t conclude that many people are confused about things because I saw their words and thought they were bad. I concluded that many people are confused about things because I repeatedly:
saw their words,
thought the words were bad,
talked with the person and perceived reasoning mistakes mirroring the badness in their words,
and then concluded they are confused!
I particularly wish people would taboo the word “optimize” more often. Referring to a process as “optimization” papers over questions like:
What feedback loop produces the increase or decrease in some quantity that is described as “optimization?” What steps does the loop have?
In what contexts does the feedback loop occur?
How might the effects of the feedback loop change between iterations? Does it always have the same effect on the quantity?
What secondary effects does the feedback loop have?
There’s a lot hiding behind the term “optimization,” and I think a large part of why early AI alignment research made so little progress was because people didn’t fully appreciate how leaky of an abstraction it is.
I empathize with this, and have complained similarly (e.g. here).
I have also been trying to figure out why I feel quite a strong urge to push back on posts like this one. E.g. in this case I do in fact agree that only a handful of people actually understand AI risk arguments well enough to avoid falling into “suggestive names” traps. But I think there’s a kind of weak man effect where if you point out enough examples of people making these mistakes, it discredits even those people who avoid the trap.
Maybe another way of saying this: of course most people are wrong about a bunch of this stuff. But the jump from that to claiming the community or field has failed isn’t a valid one, because the success of a field is much more dependent on max performance than mean performance.
Not saying that it’s fun or even obviously net-positive for all participants, but I think combative communication is better than no communication, as far as truth-seeking goes.
Sure, but what if what’s left is risky enough? Maybe utility maximization is a bad model of future AI (maybe because it’s hard to predict the technology that doesn’t exist yet) - but what’s the alternative? Isn’t labelling some empirical graph that ignores warning signs “awesomeness” and extrapolating is more of a word game?
Do you have some concrete examples where you’ve explained how some substantial piece of the case for AI accident risk is a matter of word games?
This is a pretty good essay, and I’m glad you wrote it. I’ve been thinking similar thoughts recently, and have been attempting to put them into words. I have found myself somewhat more optimistic and uncertain about my models of alignment due to these realizations.
Anyway, on to my disagreements.
I don’t think that “Dreams of AI Design” was an adequate essay to get people to understand this. These distinctions are subtle, and as you might tell, not an epistemological skill that comes native to us. “Dreams of AI Design” is about confusing the symbol with the substance --
'5
with5
, in Lisp terms, or the variable namefive
with the value of5
(in more general programming language terms). It is about ensuring that all the symbols you use to think with actually are mapping onto some substance. It is not about the more subtle art of noticing that you are incorrectly equivocating between a pre-theoretic concept such as “optimization pressure” and the actual process of gradient updates. I suspect that Eliezer may have made at least one such mistake that may have made him significantly more pessimistic about our chances of survival. I know I’ve made this mistake dozens of times. I mean, my username is “mesaoptimizer”, and I don’t endorse that term or concept anymore as a way of thinking about the relevant parts of the alignment problem.I’ve started to learn to be less neurotic about ensuring that people’s vaguely defined terms actually map onto something concrete, mainly because I have started to value the fact that these vaguely defined terms, if not incorrectly equivocated, hold valuable information that we might otherwise lose. Perhaps you might find this helpful.
I empathize with those pushing back, because to a certain extent it seems like what you are stating seems obvious to someone who has learned to translate these terms into the more concrete locally relevant formulations ad-hoc, and given such an assumption, it seems like you are making a fuss about something that doesn’t really matter and in fact even reaching for examples to prove your point. On the other hand, I expect that ad-hoc adjustment to such terms is insufficient to actually do productive alignment research—I believe that the epistemological skill you are trying to point at is extremely important for people working in this domain.
I’m uncertain about how confused senior alignment researchers are when it comes to these words and concepts. It is likely that some may have cached some mistaken equivocations and are therefore too pessimistic and fail to see certain alignment approaches panning out, or too optimistic and think that we have a non-trivial probability of getting our hands on a science accelerator. And deference causes a cascade of everyone (by inference or by explicit communication) also adopting these incorrect equivocations.
I agree with you that people get sloppy with these terms, and this seems bad. But there’s something important to me about holding space for uncertainty, too. I think that we understand practically every term on this list exceedingly poorly. Yes, we can point to things in the world, and sometimes even the mechanisms underlying them, but we don’t know what we mean in any satisfyingly general way. E.g. “agency” does not seem well described to me as “trained by reinforcement learning.” I don’t really know what it is well described by, and that’s the point. Pretending otherwise only precludes us from trying to describe it better.
I think there’s a lot of room for improvement in how we understand minds, i.e., I expect science is possible here. So I feel wary of mental moves such as these, e.g., replacing “optimal” with “set of sequential actions which have subjectively maximal expected utility relative to [entity X]‘s imputed beliefs,” as if that settled the matter. Because I think it gives a sense that we know what we’re talking about when I don’t think we do. Is a utility function the right way to model an agent? Can we reliably impute beliefs? How do we know we’re doing that right, or that when we say ‘belief’ it maps to something that is in fact like a belief? What is a belief? Why actions instead of world states? And so on.
It seems good to aim for precision and gears-level understanding wherever possible. But I don’t want this to convince us that we aren’t confused. Yes, we could replace the “tool versus agent” debate with things like “was it trained via RL or not,” or what have you, but it wouldn’t be very satisfying because ultimately that isn’t the thing we’re trying to point at. We don’t have good definitions of mind-type things yet, and I don’t want us to forget that.
My prescription for the problem you’re highlighting here: track a prototypical example.
Trying to unpack e.g. “optimization pressure” into a good definition—even an informal definition—is hard. Most people who attempt to do that will get it wrong, in the sense that their proffered definition will not match their own intuitive usage of the term (even in cases where their own intuitive usage is coherent), or their implied usage in an argument. But their underlying intuitions about “optimization pressure” are often still be correct, even if those intuitions are not yet legible. Definitions, though an obvious strategy, are not a very good one for distinguish coherent-word-usage from incoherent-word-usage.
(Note that OP’s strategy of unpacking words into longer phrases is basically equivalent to using definitions, for our purposes.)
So: how can we track coherence of word usage, practically?
Well, we can check coherence constructively, i.e. by exhibiting an example which matches the usage. If someone is giving an argument involving “optimization pressure”, I can ask for a prototypical example, and then walk through the argument in the context of the example to make sure that it’s coherent.
For instance, maybe someone says “unbounded optimization wants to take over the world”. I ask for an example. They say “Well, suppose we have a powerful AI which wants to make as many left shoes as possible. If there’s some nontrivial pool of resources in the universe which it hasn’t turned toward shoe-making, then it could presumably make more shoes by turning those resources toward shoe-making, so it will seek to do so.”. And then I can easily pick out a part of the example to drill in on—e.g. maybe I want to focus on the implicit “all else equal” and argue that all else will not be equal, or maybe I want to drill into the “AI which wants” part and express skepticism about whether anything like current AI will “want” things in the relevant way (in which case I’d probably ask for an example of an AI which might want to make as many left shoes as possible), or [whatever else].
The key thing to notice is that I can easily express those counterarguments in the context of the example, and it will be relatively clear to both myself and my conversational partner what I’m saying. Contrast that to e.g. just trying to use very long phrases in place of “unbounded optimization” or “wants”, which makes everything very hard to follow.
In one-way “conversation” (e.g. if I’m reading a post), I’d track such a prototypical example in my head. (Well, really a few prototypical examples, but 80% of the value comes from having any.) Then I can relatively-easily tell when the argument given falls apart for my example.
Somewhat related: how do we not have separate words for these two meanings of ‘maximise’?
literally set something to its maximum value
try to set it to a big value, the bigger the better
Even what I’ve written for (2) doesn’t feel like it unambiguously captures the generally understood meaning of ‘maximise’ in common phrases like ‘RL algorithms maximise reward’ or ‘I’m trying to maximise my income’. I think the really precise version would be ‘try to affect something, having a preference ordering over outcomes which is monotonic in their size’.
But surely this concept deserves a single word. Does anyone know a good word for this, or feel like coining one?
I love this post. I have been one who has perceived your comments as potentially combative before. I have a hunch that it was an impression you were implicitly asserting “no, I should be the one who is trusted!” that set off alarm bells—and this post unambiguously communicates the opposite. In particular:
I agree, and I generally think this has been a problem for a long time. I will believe it is possible to avoid this problem at scale when I meet a group of people who have empirically demonstrated themselves to have scaled it, and at the moment I don’t know off the top of my head how I’d recognize that other than a years-long track record. Fields of mathematics? I suppose there are fields of science that seem to do at least acceptably. But I don’t know of anyone doing really well without the help of mathematically precise definitions, and some more stuff besides. In the meantime,
… well, wait, hold on—I find myself not quite sure how to echo the sentiment I see in this post in a way I can agree I should implement for myself. Perhaps something along the lines of, I could return to my intermittent habit of avoiding words that have been heavily used. But perhaps more important than that is sticking to words whose meaning I can construct mathematically, in the sense of constructivist logic? I’m actually not sure the takeaway here is obvious in the first place now. Hmm.
This community inherited the concept of “goal-directed optimizers” and attempted formalizations of it from academia (e.g., vNM decision theory, AIXI). These academic ideas also clearly describe aspects of reality (e.g., decision theory having served as the foundation of economics for several decades now).
Given this, are we not supposed to be both worried (due to the threatening implications of modeling future AIs as goal-directed optimizers) and also confused (due to existing academic theories having various open problems)? Or what is the “not weird” response or course of action here?
One “not weird” response, IMO, is to say “well, maybe that’s not the best way to think about trained networks and their effects”, and just keep the frame in the back of one’s mind as one works on other problems in alignment. Even though those ideas do have meaningful uses, that doesn’t mean they’re relevant for this particular field of inquiry, or a reasonable way of making progress given the research frontier.
This seems fine if you’re trying to understand how current or near-future ML models work and how to make them safer, but I think in the longer run it seems inevitable that we eventually end up with AIs that are more or less well-descried as “goal-directed optimizers”, so studying this concept probably won’t be “wasted” even if it’s not directly useful now.
Aside from a technical alignment perspective, it also seems strategically important to better understand how to model future goal-directed AIs, for example whether their decision/game theories will allow unaligned AIs to asymmetrically extort aligned AIs (or have more bargaining power because they have less to lose than aligned AI), or whether acausal trade will be a thing. This seems important input into various near-term decisions such as how much risk of unaligned AI we should tolerate.
Personally I prioritize studying metaphilosophy above topics directly related to “goal-directed optimizers” such as decision theory, as I see the former as a bit more urgent and neglected than the latter, but also find it hard to sympathize with describing the study of the latter as “weird”.
I disagree. A few related notes:
Maybe we’re talking past each other? Consider the following:
Easy to defend but weak claim: “People will eventually have AIs which act autonomously to achieve certain tasks.”
(I believe this)
Hard to defend but load-bearing claim: “Real-world AIs will (eventually) have specific kinds of internal consequentialist structure posited by classic alignment theory, as opposed to all of the other kinds of structure they could have.”
(I don’t believe this)
You might be amazed that I seem to deny the defensible claim, while I might be amazed that you seem to believe the load-bearing claim without visible argumentation?
It is totally possible (and probable) that this particular cluster of untested speculative theory will end up being irrelevant and not bound to reality.
There is no necessary reason for all of the “goal-directed optimizer” thinking—wait, let’s taboo that and say “worldview-421 speculation”—There is no necessary reason for all of the worldview-421 speculation to end up being relevant or realistic. So it could totally be a waste of time, and I think it probably is.
Happy to share my reasons/arguments:
I think I’m in part a goal-directed optimizer. I want to eventually offload all of the cognition involved in being a goal-directed optimizer to a superintelligence, as opposed to having some part of it being bottlenecked by my (suboptimal/slow/unstable/unsafe) biological brain. I think this describes or will probably describe many other humans.
Competitive pressures may drive people to do this even if they aren’t ready or wouldn’t want to in the absence of such pressures.
Some people (such as e/accs) seem happy to build any kind of AGI without considerations of safety (or think they’ll be automatically safe) and therefore may build a goal-directed optimizer either because they’re the easiest kind of AGI to stumble upon, or because they’re copying human cognitive architecture or training methods as a shortcut to trying to invent/discover new ones.
Even if no AI can ever be describe as “goal-directed optimizer”, larger systems composed of humans and AIs can probably be described as such, so they are worth studying from a broader “safety” perspective even if not a narrower “alignment” perspective.
Coherence-based arguments, which I also put some weight on (but perhaps less than others)
I forgot to mention one more argument, namely that something like a goal-directed optimizer is my best guess of what a philosophically and technologically mature, reflectively stable general intelligence will look like, since it’s the only motivational structure we know that looks anywhere close to reflective stability.
I want to be careful not to overstate how close, or to rule out the possibility of discovering some completely different reflectively stable motivational structure in the future, but in our current epistemic state, reflective stability by itself already seems enough to motivate the theoretical study of goal-directed optimizers.
Thanks for sharing! :)
To clarify on my end: I think AI can definitely become an autonomous long-horizon planner, especially if we train it to be that.
That event may or may not have the consequences suggested by existing theory predicated on e.g. single-objective global utility maximizers, which predicts consequences which are e.g. notably different from the predictions of a shard-theoretic model of how agency develops. So I think there are important modeling decisions in ‘literal-minded genie’ vs ‘shard-based generalization’ vs [whatever the truth actually is]… even if each individual axiom sounds reasonable in any given theory. (I wrote this quickly, sorry if it isn’t clear)
Do you not think that a shard-based agent likely eventually turns into something like an EU maximizer (e.g. once all the shards work out a utility function that represents a compromise between their values, or some shard/coalition overpowers others and takes control)? Or how do you see the longer term outcome of shard-based agents? (I asked this question and a couple of others here but none of the main shard-theory proponents engaged with it, perhaps because they didn’t see the comment?)
I do think that a wide range of shard-based mind-structures will equilibrate into EU optimizers, but I also think this is a somewhat mild statement. My stance is that utility functions represent a yardstick by which decisions are made. “Utility was made by the agent, for the agent” as it were—and not “the agent is made to optimize the utility.” What this means is:
Suppose I start off caring about dogs and diamonds in a shard-like fashion, with certain situations making me seek out dogs and care for them (in the usual intuitive way); and similarly for diamonds. However, there will be certain situations in which the dog-shard “interferes with” the diamond-shard, such that the dog-shard e.g. makes me daydream about dogs while doing my work and thereby do worse in life overall. If I didn’t engage in this behavior, then in general I’d probably be able to get more dog-caring and diamond-acquisition. So from the vantage point of this mind and its shards, it is subjectively better to not engage in such “incoherent” behavior which is a strictly dominated strategy in expectation (i.e. leads to fewer dogs and diamonds).
Therefore, given time and sufficient self-modification ability, these shards will want to equilibrate to an algorithm which doesn’t step on its own toes like this.
This doesn’t mean, of course, that these shards decide to implement a utility function with absurd results by the initial decision-making procedure. For example, tiling the universe (half with dog-squiggles, half with diamond-squiggles) would not be a desirable outcome under the initial decision-making process. Insofar as such an outcome could be foreseen as a consequence of making decisions a proposed utility function, the shards would disprefer that utility function.[1]
So any utility function chosen should “add up to normalcy” when optimized, or at least be different in a way which is not foreseeably weird and bad by the initial shards’ reckoning. On this view, one would derive a utility function as a rule of thumb for how to make decisions effectively and (nearly) Pareto-optimally in relevant scenarios.[2]
(You can perhaps understand why, given this viewpoint, I am unconcerned/weirded out by Yudkowskian sentiments like “Unforeseen optima are extremely problematic given high amounts of optimization power.”)
This elides any practical issues with self-modification, and possible value drift from e.g. external sources, and so on. I think they don’t change the key conclusions here. I think they do change conclusions for other questions though.
Again, if I’m imagining the vantage point of dog+diamond agent, it wouldn’t want to waste tons of compute deriving a policy for weird situations it doesn’t expect to run into. The most important place to become more coherent is the expected on-policy future.
What do you think that algorithm will be? Why would it not be some explicit EU-maximization-like algorithm, with a utility function that fully represents both of their values? (At least eventually?) It seems like the best way to guarantee that the two shards will never step on each others’ toes ever again (no need to worry about running into unforeseen situations), and also allows the agent to easily merge with other similar agents in the future (thereby avoiding stepping on even more toes).
(Not saying I know for sure this is inevitable, as there could be all kinds of obstacles to this outcome, but it still seems like our best guess of what advanced AI will eventually look like?)
I agree with this statement, but what about:
Shards just making a mistake and picking a bad utility function. (The individual shards aren’t necessarily very smart and/or rational?)
The utility function is fine for the AI but not for us. (Would the AI shards’ values exactly match our shards, including relative power/influence, and if not, why would their utility function be safe for us?)
Competitive pressures forcing shard-based AIs to become more optimizer-like before they’re ready, or to build other kinds of more competitive but riskier AI, similar to how it’s hard for humans to stop our own AI arms race.
Yes, you’re helping me better understand your perspective, thanks. However as indicated by my questions above, I’m still not sure why you think shard-based AI agents would be safe in general, and in particular (among other risks) why they wouldn’t turn into dangerous goal-directed optimizers at some point.
IMO, the weird/off thing is that the people saying this don’t have sufficient evidence to highlight this specific vibe bundle as being a “real / natural thing that just needs to be properly formalized”, rather than there being no “True Name” for this concept, and it turns out to be just another situationally useful high level abstraction. It’s like someone saying they want to “deconfuse” the concept of a chair.
Or like someone pointing at a specific location on a blank map and confidently declaring that there’s a dragon at that spot, but then admitting that they don’t actually know what exactly a “dragon” is, have never seen one, and only have theoretical / allegorical arguments to support their existence[1]. Don’t worry though, they’ll resolve the current state of confusion by thinking really hard about it and putting together a taxonomy of probable dragon subspecies.
If you push them on this point, they might say that actually humans have some pretty dragon-like features, so it only makes sense that real dragons would exist somewhere in creature space.
Also, dragons are quite powerful, so naturally many types of other creatures would tend to become dragons over time. And given how many creatures there are in the world, it’s inevitable that at least one would become a dragon eventually.
Are you claiming that future powerful AIs won’t be well described as pursuing goals (aka being goal-directed)? This is the read I get from the the “dragon” analogy you mention, but this can’t possibly be right because AI agents are already obviously well described as pursuing goals (perhaps rather stupidly). TBC the goals that current AI agents end up pursuing are instructions in natural language, not something more exotic.
(As far I can tell the word “optimizer” in “goal-directed optimizer” is either meaningless or redundant, so I’m ignoring that.)
Perhaps you just mean that future powerful AIs won’t ever be well described as consistently (e.g. across contexts) and effectively pursuing specific goals which they weren’t specifically trained or instructed to pursue?
Or that goal-directed behavior won’t arise emergently prior to humans being totally obsoleted by our AI successors (and possibly not even after that)?
TBC, I agree that some version of “deconfusing goal-directed behavior” is pretty similar to “deconfusing chairs” or “deconfusing consciousness”[1] (you might gain value from doing it, but only because you’ve ended up in a pretty weird epistemic state)
See also “the meta problem of consciousness”
What do you mean by “well described”?
By well described, I mean a central example of how people typically use the word.
E.g., matches most common characteristics in the cluster around the word “goal”.
In the same way as something can be well described as a chair if it has a chair like shape and people use it for sitting.
(Separately, I was confused by the original footnote. Is Alex claiming that deconfusing goal-directedness is a thing that no one has tried to do? (Seems wrong so probably not?) Or that it’s strange to be worried when the argument for worry depends on something so fuzzy that you need to deconfuse it? I think the second one after reading your comment, but I’m still unsure. Not important to respond.)
He means the second one.
Seems true in the extreme (if you have 0 idea what something is how can you reasonably be worried about it), but less strange the futher you get from that.
I suspect this should actually be something more like “longer than 3 but less than 10.” (You’re expressing resentment for the party line on AI risk, but “the community” wasn’t always all about that! There used to be a vision of systematic methods for thinking more clearly.)
Long comment, points ordered randomly, skim if you want.
1)
Can you give a few more examples of when the word “optimal” is/isn’t distorting someone’s thinking? People sometimes challenge each other’s usage of that word even when just talking about simple human endeavors like sports, games, diet, finance, etc. but I don’t get the sense that the word is the biggest danger in those domains. (Semi-related, I am reminded of this post.)
2)
When you put it like this, it sounds like the problem runs much deeper than sloppy concepts. When I think my opponents are mindkilled, I see only extreme options available, such as giving up on communicating, or budgeting huge amounts of time & effort to a careful double-crux. What you’re describing starts to feel not too dissimilar from questions like, “How do I talk my parents out of their religion so that they’ll sign up for cryonics?” In most cases it’s either hopeless or a massive undertaking, worthy of multiple sequences all on it’s own, most of which are not simply about suggestive names. Not that I expect you to write a whole new sequence in your spare time, but I do wonder if this makes you more interested in erisology and basic rationality.
3)
I myself don’t know anything about the behaviorists except that they allegedly believed that internal mental states did not exist. I certainly don’t want to make that kind of mistake. Can someone bring me up to speed on what exactly they did to the words “behavior”, “response”, and “learning”? Are those words still ruined? Was the damage ever undone?
4)
That reminds me of this passage from EY’s article in Time:
I’m curious if you think this passage is also mistaken, or if it is correctly describing a real problem with current trajectories. EY usually doesn’t bring up consciousness because it is not a crux for him, but I wonder if you think he’s wrong in this recent time that he did bring it up.
I didn’t mean to claim that this “consciousness” insinuation has or is messing up this community’s reasoning about AI alignment, just that the insinuation exists—and to train the skill of spotting possible mistakes before (and not after) they occur.
I do think that “‘expectation’ insinuates inner beliefs” matters, as it helps prop up the misconception of “agents maximize expected reward” (by adding another “supporting detail” to that story).
I don’t think most people can. If you don’t like the connotations of existing terms, I think you need to come up with new terms and they can’t be too verbose or people won’t use them.
One thing that makes these discussions tricky is that the apt-ness of these names likely depends on your object-level position. If you hold the AI optimist position, then you likely feel these names are biasing people towards and incorrect conclusion. If you hold the AI pessimist position, you likely see many of these connotations as actually a positive, in terms to pointing people towards useful metaphors, even if people occasionally slip-up and reify the terms.
Also, have you tried having a moderated conversation with someone who disagrees with you? Sometimes that can help resolve communication barriers.
I suspect that if they can’t ground it out to the word underneath, then there should be … some sort of way to make that concrete as a prediction that their model is drastically more fragile than their words make it sound. If you cannot translate your thinking into math fluently, then your thinking is probably not high enough quality yet, or so? And certainly I propose this test expecting myself to fail it plenty often enough.
@TurnTrout: I’d really, really like to see you have a discussion with someone with a similar level of education about deep learning who disagrees with you about the object level claims. If possible, I’d like it to be Bengio. I think the two of you discussing the mechanics of the problem at hand would yield extremely interesting insights. I expect the best format for it would be a series of emails back and forth, a lesswrong dialogue, or some other compatible asynchronous messaging format without outside observers until the discussion has progressed to a point where both participants feel it is ready to share. Potentially moderation could help, I expect it to be unnecessary.
I’m not saying that people can’t ground it out. I’m saying that if you try to think or communicate using really verbose terms it’ll reduce your available working memory which will limit your ability to think new thoughts.
Yes, I agree that this is an impractical phrase substitution for “optimal.” I meant to be listing “ways you can think about alignment more precisely” and then also “I wish we had better names for actual communication.” Maybe I should have made more explicit note of this earlier in the essay.
EDIT: I now see that you seem to think this is also an impractical thought substitution. I disagree with that, but can’t speak for “most” people.
On the actual object level for the word “optimal”, people already usually say “converged” for that meaning and I think that’s a good choice.
I personally dislike “converged” because it implies that the optimal policy is inevitable. If you reach that policy, then yes you have converged. However, the converse (“if you have not reached an optimal policy, then you have not converged”) is not true in general. Even in the supervised regime (with a stationary data distribution) you can have local minima or zero-determinant saddle points (i.e. flat regions in the loss landscape).
Mathematically, convergence just means that the distance to some limit point goes to 0 in the limit. There’s no implication that the limit point has to be unique, or optimal. Eg. in the case of Newton fractals, there are multiple roots and the trajectory converges to one of the roots, but which one it converges to depends on the starting point of the trajectory. Once the weight updates become small enough, we should say the net has converged, regardless of whether it achieved the “optimal” loss or not.
If even “converged” is not good enough, I’m not sure what one could say instead. Probably the real problem in such cases is people being doofuses, and probably they will continue being doofuses no matter what word we force them to use.
You raise good points. I agree that the mathematical definition of convergence does not insinuate uniqueness or optimality, thanks for reminding me of that.
Adding to this: You will also have a range of different policies which your model alternates between.
I disagree, and I will take you up on this!
“Optimization” is a real, meaningful thing to fear, because:
We don’t understand human values, or even necessarily meta-understand them.
Therefore, we should be highly open to the idea that a goal (or meta-goal) that we encode (or meta-encode) would be bad for anything powerful to base-level care about.
And most importantly, high optimization power breaks insufficiently-strong security assumptions. That, in itself, is why something like “security mindset” is useful without necessarily thinking of a powerful AI as an “enemy” in war-like terms.
Here “security assumptions” is used in a broad sense, the same way that “writing assumptions” (the ones needed to design a word-processor software) could include seemingly-trivial things like “there is an input device we can access” and “we have the right permissions on this OS”.
I’ll add another one to the list: “Human-level knowledge/human simulator”
Max Nadeau helped clarify some ways in which this framing introduced biases into my and others’ models of ELK and scalable oversight. Knowledge is hard to define and our labels/supervision might be tamperable in ways that are not intuitively related to human difficulty.
Different measurements of human difficulty only correlate at about 0.05 to 0.3, suggesting that human difficulty might not be a very meaningful concept for AI oversight, or that our current datasets for experimenting with scalable oversight don’t contain large enough gaps in difficulty to make meaningful measurements.
I should note that all of that suggestive names came from mainstream AI academy/industry community. LW failure was mostly about uncritical acceptance of them, not invention.
I broadly agree that a lot of discussion about AI x-risk is confused due to the use of suggestive terms. Of the ones you’ve listed, I would nominate “optimizer”, “mesa optimization”, “(LLMs as) simulators”, “(LLMs as) agents”, and “utility” as probably the most problematic. I would also add “deception/deceptive alignment”, “subagent”, “shard”, “myopic”, and “goal”. (It’s not a coincidence that so many of these terms seem to be related to notions of agency or subcomponents of agents; this seems to be the main place where sloppy reasoning can slide in.)
I also agree that I’ve encountered a lot of people who confidently predict Doom on the basis of subtle word games.
However, I also agree with Ryan’s comment that these confusions seem much less common when we get to actual senior AIS researchers or people who’ve worked significantly with real models. (My guess is that Alex would disagree with me on this.) Most conversations I’ve been in that used these confused terms tended to involve MATS fellows or other very junior people (I don’t interact with other more junior people much, unfortunately, so I’m not sure.) I’ve also had several conversations with people who seemed relieved at how reasonable and not confused the relevant researchers have been (e.g. with Alexander Gietelink-Oldenziel).
I suspect that a lot of the confusions stem from the way that the majority of recruitment/community building is conducted—namely, by very junior people recruiting even more junior people (e.g. via student groups). Not only is there only a very limited amount of communication bandwidth available to communicate with potential new recruits (and therefore encouraging more arguments by analogy or via suggestive words), the people doing the communication are also likely to use a lot of (in large part because they’re very junior, and likely not technical researchers).[1] There’s also historical reasons why this is the case: a lot of early EA/AIS people were philosophers, and so presented detailed philosophical arguments (often routing through longtermism) about specific AI doom scenarios that in turn were suffered lossy compression during communication, as opposed to more robust general arguments (e.g. Ryan Greenblatt’s example of “holy shit AI (and maybe the singularity), that might be a really big deal”).[2]
Similarly, on LessWrong, I suspect that the majority of commenters are not people who have deeply engaged with a lot of the academic ML literature or have spent significant time doing AIS or even technical ML work.
And I’d also point a finger at lot of the communication from MIRI in particular as the cause for these confusions, e.g. the “sharp left-turn” concept seems to be primarily communicated via metaphor and cryptic sayings, while their communications about Reward Learning and Human Values seems in retrospect to have at least been misleading if not fundamentally confused. I suspect that the relevant people involved have much better models, but I think this did not come through in their communication.
I’m not super sure what to do about it; the problem of suggestive names (or in general, of smuggling connotations into technical work) is not a unique one to this community, nor is it one that can be fixed with reading a single article or two (as your post emphasizes). I’d even argue this community does better than a large fraction of academics (even ML academics).
John mentioned using specific, concrete examples as a way to check your concepts. If we’re quoting old rationalist foundation texts, then the relevant example from “Surely You’re Joking, Mr. Feynman” is relevant:
Unfortunately, in my experience, general instructions of the form “create concrete examples when listening to a chain of reasoning involving suggestive terms” do not seem to work very well, even if examples of doing so are provided, so I’m not sure there’s a scalable solution here.
My preferred approach is to give the reader concrete examples to chew on as early as possible, but this runs into the failure mode of contingent facts about the example being taken as a general point (or even worse, the failure mode where the reader assumes that the concrete case is the general point being made). I’d consider mathematical equations (even if they are only toy examples) to be helpful as well, assuming you strip away the suggestive terms and focus only on the syntax/semantics. But I find that I also have a lot of difficulty getting other people to create examples I’d actually consider concrete. Frustratingly, many “concrete” examples I see smuggle in even more suggestive terms or connotations, and sometimes even fail to capture any of the semantics of the original idea.
So in the end, maybe I have nothing better than to repeat Alex’s advice at the end of the post:
At the end of the day, while saying “just be better” does not serve as actionable advice, there might not be an easier answer.
To be clear, I think that many student organizers and community builders in general do excellent work that is often incredibly underappreciated. I’m making a specific claim about the immediate causal reasons for why this is happening, and not assigning fault. I don’t see an easy way for community builders to do better, short of abandoning specialization and requiring everyone to be a generalist who also does techncical AIS work.
That being said, I think that it’s worth trying to make detailed arguments concretizing general concerns, in large part to make sure that the case for AI x-risk doesn’t “come down to a set of subtle word games”. (e.g. I like Ajeya’s doom story. ) After all, it’s worth concretizing a general concern, and making sure that any concrete instantiations of the concern are possible. I just think that detailed arguments (where the details matter) often get compressed in ways that end up depending on suggestive names, especially in cases with limited communication bandwith.
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
This post’s ending seems really overdramatic.