Steven Byrnes

Karma: 21,621

I’m an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, LinkedIn, and more at my website.

Steven Byrnes Apr 16, 2025, 12:54 PM
2 points
0
in reply to: tailcalled’s comment on: johnswentworth’s Shortform
Learning from strategies that stood the test of time would be tradition moreso than intelligence. I think tradition requires intelligence, but it also requires something else that’s less clear (and possibly not simple enough to be assembled manually, idk).
Right, that’s what I was gonna say. You need intelligence to sort out which traditions should be copied and which ones shouldn’t. There was a 13-billion-year “tradition” of not building e-commerce megastores, but Jeff Bezos ignored that “tradition”, and it worked out very well for him (and I’m happy about it too). Likewise, the Wright Brothers explicitly followed the “tradition” of how birds soar, but not the “tradition” of how birds flap their wings.
I do think there’s a “something else” (most [but not all] humans have an innate drive to follow and enforce social norms, more or less), but I don’t think it’s necessary. The Wright Brothers didn’t have any innate drive to copy anything about bird soaring tradition, but they did it anyway purely by intelligence.
Random street names aren’t necessarily important though?
I feel like I’ve lost the plot here. If you think there are things that are very important, but rare in the training data, and that LLMs consequently fail to learn, can you give an example?
Often the rare important things are very well known (after all, they are important, so people put a lot of effort into knowing them), they just can’t efficiently be derived from empirical data (except essentially by copying someone else’s conclusion blindly, and that leaves you vulnerable to deception).
I guess you’re using “empirical data” in a narrow sense. If Joe tells me X, I have gained “empirical data” that Joe told me X. And then I can apply my intelligence to interpret that “data”. For example, I can consider a number of hypotheses: the hypothesis that Joe is correct and honest, that Joe is mistaken but honest, that Joe is trying to deceive me, that Joe said Y but I misheard him, etc. And then I can gather or recall additional evidence that favors one of those hypotheses over another. I could ask Joe to repeat himself, to address the “I misheard him” hypothesis. I could consider how often I have found Joe to be mistaken about similar things in the past. I could ask myself whether Joe would benefit from deceiving me. Etc.
This is all the same process that I might apply to other kinds of “empirical data” like if my car was making a funny sound. I.e., consider possible generative hypotheses that would match the data, then try to narrow down via additional observations, and/or remain uncertain and prepare for multiple possibilities when I can’t figure it out. This is a middle road between “trusting people blindly” versus “ignoring everything that anyone tells you”, and it’s what reasonable people actually do. Doing that is just intelligence, not any particular innate human tendency—smart autistic people and smart allistic people and smart callous sociopaths etc. are all equally capable of traveling this middle road, i.e. applying intelligence towards the problem of learning things from what other people say.
(For example, if I was having this conversation with almost anyone else, I would have quit, or not participated in the first place. But I happen to have prior knowledge that you-in-particular have unusual and well-thought-through ideas, and even they’re wrong, they’re often wrong in very unusual and interesting ways, and that you don’t tend to troll, etc.)
I feel like I’m misunderstanding you somehow. You keep saying things that (to me) seem like you could equally well argue that humans cannot possibly survive in the modern world, but here we are. Do you have some positive theory of how humans survive and thrive in (and indeed create) historically-unprecedented heterogeneous environments?

Steven Byrnes Apr 16, 2025, 11:38 AM
2 points
0
in reply to: tailcalled’s comment on: johnswentworth’s Shortform
If your model if underparameterized (which I think is true for the typical model?), then it can’t learn any patterns that only occurs once in the data. And even if the model is overparameterized, it still can’t learn any pattern that never occurs in the data.
Dunno if anything’s changed since 2023, but this says LLMs learn things they’ve seen exactly once in the data.
I can vouch that you can ask LLMs about things that are extraordinarily rare in the training data—I’d assume well under once per billion tokens—and they do pretty well. E.g. they know lots of random street names.
Humans successfully went to the moon, despite it being a quite different environment that they had never been in before. And they didn’t do that with “durability, strength, healing, intuition, tradition”, but rather with intelligence.
Speaking of which, one can apply intelligence towards the problem of being resilient to unknown unknowns, and one would come up with ideas like durability, healing, learning from strategies that have stood the test of time (when available), margins of error, backup systems, etc.

Steven Byrnes Apr 15, 2025, 11:08 PM
4 points
2
in reply to: tailcalled’s comment on: johnswentworth’s Shortform
I think you’re conflating consequentialism and understanding in a weird-to-me way. (Or maybe I’m misunderstanding.)
I think consequentialism is related to choosing one action versus another action. I think understanding (e.g. predicting the consequence of an action) is different, and that in practice understanding has to involve self-supervised learning.
(I think human brains have both [partly-] consequentialist decisions and self-supervised updating of the world-model.) (They’re not totally independent, but rather they interact via training data: e.g. [partly-] consequentialist decision-making determines how you move your eyes, and then whatever your eyes are pointing at, your model of the visual world will then update by self-supervised learning on that particular data. But still, these are two systems that interact, not the same thing.)
I think self-supervised learning is perfectly capable of discovering rare but important patterns. Just look at today’s foundation models, which seem pretty great at that.

Steven Byrnes Apr 15, 2025, 5:34 PM
LW: 2 AF: 2
0
AF
on: Evaluating the historical value misspecification argument
I’m not too interested in litigating what other people were saying in 2015, but OP is claiming (at least in the comments) that “RLHF’d foundation models seem to have common-sense human morality, including human-like moral reasoning and reflection” is evidence for “we’ve made progress on outer alignment”. If so, here are two different ways to flesh that out:
1. An RLHF’d foundation model acts as the judge / utility function; and some separate system comes up with plans that optimize it—a.k.a. “you just need to build a function maximizer that allows you to robustly maximize the utility function that you’ve specified”.
  1. I think this plan fails because RLHF’d foundation models have adversarial examples today, and will continue to have adversarial examples into the future. (To be clear, humans have adversarial examples too, e.g. drugs & brainwashing.)
2. There is no “separate system”, but rather an RLHF’d foundation model (or something like it) is the whole system that we’re talking about here. For example, we may note that, if you hook up an RLHF’d foundation model to tools and actuators, then it will actually use those tools and actuators in accordance with common-sense morality etc.
(I think 2 is the main intuition driving the OP, and 1 was a comments-section derailment.)
As for 2:
- I’m sympathetic to the argument that this system might not be dangerous, but I think its load-bearing ingredient is that pretraining leads to foundation models tending to do intelligent things primarily by emitting human-like outputs for human-like reasons, thanks in large part to self-supervised pretraining on internet text. Let’s call that “imitative learning”.
- Indeed, here’s a 2018 post where Eliezer (as I read it) implies that he hadn’t really been thinking about imitative learning before (he calls it an “interesting-to-me idea”), and suggests that imitative learning might “bypass the usual dooms of reinforcement learning”. So I think there is a real update here—if you believe that imitative learning can scale to real AGI.
- …But the pushback (from Rob and others) in the comments is mostly coming from a mindset where they don’t believe that imitative learning can scale to real AGI. I think the commenters are failing to articulating this mindset well, but I think they are in various places leaning on certain intuitions about how future AGI will work (e.g. beliefs deeply separate from goals), and these intuitions are incompatible with imitative learning being the primary source of optimization power in the AGI (as it is today, I claim).
- (A moderate position is that imitative learning will be less and less relevant as e.g. o1-style RL post-training becomes a bigger relative contribution to the trained weights; this would presumably lead to increased future capabilities hand-in-hand with increased future risk of egregious scheming. For my part, I subscribe to the more radical theory that our situation is even worse than that: I think future powerful AGI will be built via a different AI training paradigm that basically throws out the imitative learning part altogether.)
- I have a forthcoming post (hopefully) that will discuss this much more and better.

Steven Byrnes Apr 15, 2025, 1:21 PM
8 points
0
in reply to: tailcalled’s comment on: johnswentworth’s Shortform
(IMO this is kinda unrelated to the OP, but I want to continue this thread.)
Have you elaborated on this anywhere?
Perhaps you missed it, but some guy in 2022 wrote this great post which claimed that “Consequentialism, broadly defined, is a general and useful way to develop capabilities.” ;-)
I’m actually just in the course of writing something about why “consequentialism provides an extremely powerful but difficult-to-align method of converting intelligence into agency” … maybe I can send you the draft for criticism when it’s ready?

Steven Byrnes Apr 13, 2025, 11:21 PM
10 points
4
in reply to: Cole Wyeth’s comment on: Cole Wyeth’s Shortform
For context, my lower effort posts are usually more popular.
mood

Steven Byrnes Apr 13, 2025, 11:18 PM
26 points
4
in reply to: Richard_Ngo’s comment on: ricraz’s Shortform
In run-and-tumble motion, “things are going well” implies “keep going”, whereas “things are going badly” implies “choose a new direction at random”. Very different! And I suggest in §1.3 here that there’s an unbroken line of descent from the run-and-tumble signal in our worm-like common ancestor with C. elegans, to the “valence” signal that makes things seem good or bad in our human minds. (Suggestively, both run-and-tumble in C. elegans, and the human valence, are dopamine signals!)
So if some idea pops into your head, “maybe I’ll stand up”, and it seems appealing, then you immediately stand up (the human “run”); if it seems unappealing on net, then that thought goes away and you start thinking about something else instead, semi-randomly (the human “tumble”).
So positive and negative are deeply different. Of course, we should still call this an RL algorithm. It’s just that it’s an RL algorithm that involves a (possibly time- and situation-dependent) heuristic estimator of the expected value of a new random plan (a.k.a. the expected reward if you randomly tumble). If you’re way above that expected value, then keep doing whatever you’re doing; if you’re way below the threshold, re-roll for a new random plan.
As one example of how this ancient basic distinction feeds into more everyday practical asymmetries between positive and negative motivations, see my discussion of motivated reasoning here, including in §3.3.3 the fact that “it generally feels easy and natural to brainstorm / figure out how something might happen, when you want it to happen. Conversely, it generally feels hard and unnatural to figure out how something might happen, when you want it to not happen.”

Steven Byrnes Apr 12, 2025, 7:00 PM
9 points
0
on: What is autism?
I kinda think of the main clusters of symptoms as: (1) sensory sensitivity, (2) social symptoms, (3) different “learning algorithm hyperparameters”.
More specifically, (1) says: innate sensory reactions (e.g. startle reflex, orienting reflex) are so strong that they’re often overwhelming. (2) says: innate social reactions (e.g. the physiological arousal triggered by eye contact) are so strong that they’re often overwhelming. (3) includes atypical patterns of learning & memory including the gestalt pattern of childhood language acquisition which is common but not universal among autistic kids.
People respond to (1) in various ways, including cutting off the scratchy tags at the back of their shirts, squeeze machines, weighted blankets, etc., plus maybe stimming (although I’m not sure if that’s the right explanation for stimming).
People respond to (2) by (I think) relating to other people in a way that generally avoids triggering certain innate social reactions. This includes (famously) avoiding eye contact, but I think also includes various hard-to-describe unconscious attention-control strategies. So at the end of the day, neurotypical people will have an unconscious innate snap reaction to (e.g.) learning that someone is angry at them, whereas autistic people won’t have that snap reaction, because they have an unconscious coping strategy to avoid triggering it, that they’ve used since early childhood, because the reaction is so unpleasant. Of course, they’ll still understand intellectually perfectly well that the person is angry. As one consequence of that, autistic people (naturally) have trouble modeling how neurotypical people will react to different social situations, and conversely, neurotypical people will misunderstand and misinterpret the social behaviors of autistic people.
Anyway, it seems intuitively sensible that a single underlying cause, namely something like “trigger-happy neurons” (see discussion of the valproic acid model here), often leads to all three of the (1-3) symptom clusters, along with the other common symptoms like proneness-to-seizures and 10-minute screaming tantrums. At the same time, I think people can get subsets of those clusters of symptoms for various different underlying reasons. For example, one of my kids is a late talker with very strong (3), but little-if-any (1-2). He has an autism diagnosis. (I’m pretty sure he wouldn’t have gotten one 20 years ago.) My other kid has strong nerdy autistic-like “special interests”—and I expect him to wind up as an adult who (like me) has many autistic friends—but I think he’s winding up with those behaviors from a rather different root cause.
Much more at my old post Intense World Theory of Autism.
I’m also interested in book recommendations or recommendations for other resources where I can learn more.
I thought NeuroTribes was really great, that’s my one recommendation if I had to pick one. If I had to pick three, I would also throw in the two John Elder Robison books I read. In Look Me in the Eye, he talks about growing up with (what used to be called) Asperger’s; even more interestingly, in Switched On he describes his experience with Transcranial Magnetic Stimulation, which led (in my interpretation) to his reintroduction to those innate social reactions that (as mentioned above) he had learned at a very young age to generally avoid triggering via unconscious attention-control coping strategies, since the reactions were overwhelming and unpleasant.

Steven Byrnes Apr 11, 2025, 8:09 PM
3 points
0
in reply to: Towards_Keeperhood’s comment on: steve2152′s Shortform
Thanks! Oddly enough, in that comment I’m much more in agreement with the model you attribute to yourself than the model you attribute to me. ¯\_(ツ)_/¯
the value function doesn’t understand much of the content there, and only uses some simple heuristics for deciding how to change its value estimate
Think of it as a big table that roughly-linearly assigns good or bad vibes to all the bits and pieces that comprise a thought, and adds them up into a scalar final answer. And a plan is just another thought. So “I’m gonna get that candy and eat it right now” is a thought, and also a plan, and it gets positive vibes from the fact that “eating candy” is part of the thought, but it also gets negative vibes from the fact that “standing up” is part of the thought (assume that I’m feeling very tired right now). You add those up into the final value / valence, which might or might not be positive, and accordingly you might or might not actually get the candy. (And if not, some random new thought will pop into your head instead.)
Why does the value function assign positive vibes to eating-candy? Why does it assign negative vibes to standing-up-while-tired? Because of the past history of primary rewards via (something like) TD learning, which updates the value function.
Does the value function “understand the content”? No, the value function is a linear functional on the content of a thought. Linear functionals don’t understand things. :)
(I feel like maybe you’re going wrong by thinking of the value function and Thought Generator as intelligent agents rather than “machines that are components of a larger machine”?? Sorry if that’s uncharitable.)
[the value function] only uses some simple heuristics for deciding how to change its value estimate. E.g. a heuristic might be “when there’s a thought that the world model thinks is valid and it is associated to the (self-model-invoking) thought “this is bad for accomplishing my goals”, then it lowers its value estimate.
The value function is a linear(ish) functional whose input is a thought. A thought is an object in some high-dimensional space, related to the presence or absence of all the different concepts comprising it. Some concepts are real-world things like “candy”, other concepts are metacognitive, and still other concepts are self-reflective. When a metacognitive and/or self-reflective concept is active in a thought, the value function will correspondingly assign extra positive or negative vibes—just like if any other kind of concept is active. And those vibes depending on the correlations of those concepts with past rewards via (something like) TD learning.
So “I will fail at my goals” would be a kind of thought, and TD learning would gradually adjust the value function such that this thought has negative valence. And this thought can co-occur with or be a subset of other thoughts that involve failing at goals, because the Thought Generator is a machine that learns these kinds of correlations and implications, thanks to a different learning algorithm that sculpts it into an ever-more-accurate predictive world-model.

Steven Byrnes Apr 11, 2025, 5:50 PM
14 points
11
in reply to: Cole Wyeth’s comment on: Reactions to METR task length paper are insane
I think the interest rate thing provides so little evidence either way that it’s misleading to even mention it. See the EAF comments on that post, and also Zvi’s rebuttal. (Most of that pushback also generalizes to your comment about the S&P.) (For context, I agree that AGI in ≤2030 is unlikely.)

Steven Byrnes Apr 9, 2025, 2:56 AM
LW: 7 AF: 4
0
AF
in reply to: Towards_Keeperhood’s comment on: steve2152′s Shortform
Thanks! Basically everything you wrote importantly mismatches my model :( I think I can kinda translate parts; maybe that will be helpful.
Background (§8.4.2): The thought generator settles on a thought, then the value function assigns a “valence guess”, and the brainstem declares an actual valence, either by copying the valence guess (“defer-to-predictor mode”), or overriding it (because there’s meanwhile some other source of ground truth, like I just stubbed my toe).
Sometimes thoughts are self-reflective. E.g. “the idea of myself lying in bed” is a different thought from “the feel of the pillow on my head”. The former is self-reflective—it has me in the frame—the latter is not (let’s assume).
All thoughts can be positive or negative valence (motivating or demotivating). So self-reflective thoughts can be positive or negative valence, and non-self-reflective thoughts can also be positive or negative valence. Doesn’t matter, it’s always the same machinery, the same value function / valence guess / thought assessor. That one function can evaluate both self-reflective and non-self-reflective thoughts, just as it can evaluate both sweater-related thoughts and cloud-related thoughts.
When something seems good (positive valence) in a self-reflective frame, that’s called ego-syntonic, and when something seems bad in a self-reflective frame, that’s called ego-dystonic.
Now let’s go through what you wrote:
1. humans have a self-model which can essentially have values different from the main value function
I would translate that into: “it’s possible for something to seem good (positive valence) in a self-reflective frame, but seem bad in a non-self-reflective frame. Or vice-versa.” After all, those are two different thoughts, so yeah of course they can have two different valences.
2. the policy suggestions of the self-model/homunculus can be more coherent than the value function estimates
I would translate that into: “there’s a decent amount of coherence / self-consistency in the set of thoughts that seem good or bad in a self-reflective frame, and there’s less coherence / self-consistency in the set of things that seem good or bad in a non-self-reflective frame”.
(And there’s a logical reason for that; namely, that hard thinking and brainstorming tends to bring self-reflective thoughts to mind — §8.5.5 — and hard thinking and brainstorming is involved in reducing inconsistency between different desires.)
3. The learned value function can learn to trust the self-model if acting according to the self-model is consistently correlated with higher-than-expected reward.
This one is more foreign to me. A self-reflective thought can have positive or negative valence for the same reasons that any other thought can have positive or negative valence—because of immediate rewards, and because of the past history of rewards, via TD learning, etc.
One thing is: someone can develop a learned metacognitive habit to the effect of “think self-reflective thoughts more often” (which is kinda synonymous with “don’t be so impulsive”). They would learn this habit exactly to the extent and in the circumstances that it has led to higher reward / positive valence in the past.
4. Say we have a smart reflective human where the value function basically trusts the self-model a lot, then the self-model could start optimizing its own values, while the (stupid) value function believes it’s best to just trust the self-model and that this will likely lead to reward.
If someone gets in the habit of “think self-reflective thoughts all the time” a.k.a. “don’t be so impulsive”, then their behavior will be especially strongly determined by which self-reflective thoughts are positive or negative valence.
But “which self-reflective thoughts are positive or negative valence” is still determined by the value function / valence guess function / thought assessor in conjunction with ground-truth rewards / actual valence—which in turn involves the reward function, and the past history of rewards, and TD learning, blah blah. Same as any other kind of thought.
…I won’t keep going with your other points, because it’s more of the same idea.
Does that help explain where I’m coming from?

Steven Byrnes Apr 8, 2025, 2:35 AM
4 points
2
in reply to: gwern’s comment on: Auditing language models for hidden objectives
I am a human, but if you ask me whether I want to ditch my family and spend the rest of my life in an Experience Machine, my answer is no.
(I do actually think there’s a sense in which “people optimize reward”, but it’s a long story with lots of caveats…)

Steven Byrnes Apr 5, 2025, 9:21 PM
10 points
8
on: Prediction Markets Are Mediocre
I downvoted because the conclusion “prediction markets are mediocre” does not follow from the premise “here is one example of one problem that I imagine abundant legal well-capitalized prediction markets would not have completely solved (even though I acknowledge that they would have helped move things in the right direction on the margin)”.

Steven Byrnes Apr 4, 2025, 12:45 PM
7 points
2
in reply to: faul_sname’s comment on: AI 2027: What Superintelligence Looks Like
That excerpt says “compute-efficient” but the rest of your comment switches to “sample efficient”, which is not synonymous, right? Am I missing some context?

Steven Byrnes Apr 4, 2025, 12:40 PM
12 points
2
in reply to: OVERmind’s comment on: AI 2027: What Superintelligence Looks Like
Pretty sure “DeepCent” is a blend of DeepSeek & Tencent—they have a footnote: “We consider DeepSeek, Tencent, Alibaba, and others to have strong AGI projects in China. To avoid singling out a specific one, our scenario will follow a fictional “DeepCent.””. And I think the “brain” in OpenBrain is supposed to be reminiscent of the “mind” in DeepMind.
ETA: Scott Alexander tweets with more backstory on how they settled on “OpenBrain”: “You wouldn’t believe how much work went into that stupid name…”

Steven Byrnes Apr 3, 2025, 5:34 PM
LW: 2 AF: 2
0
AF
in reply to: Towards_Keeperhood’s comment on: steve2152′s Shortform
I was just imagining a fully omnicient oracle that could tell you for each action how good that action is according to your extrapolated preferences, in which case you could just explore a bit and always pick the best action according to that oracle.
OK, let’s attach this oracle to an AI. The reason this thought experiment is weird is because the goodness of an AI’s action right now cannot be evaluated independent of an expectation about what the AI will do in the future. E.g., if the AI says the word “The…”, is that a good or bad way for it to start its sentence? It’s kinda unknowable in the absence of what its later words will be.
So one thing you can do is say that the AI bumbles around and takes reversible actions, rolling them back whenever the oracle says no. And the oracle is so good that we get CEV that way. This is a coherent thought experiment, and it does indeed make inner alignment unnecessary—but only because we’ve removed all the intelligence from the so-called AI! The AI is no longer making plans, so the plans don’t need to be accurately evaluated for their goodness (which is where inner alignment problems happen).
Alternately, we could flesh out the thought experiment by saying that the AI does have a lot of intelligence and planning, and that the oracle is doing the best it can to anticipate the AI’s behavior (without reading the AI’s mind). In that case, we do have to worry about the AI having bad motivation, and tricking the oracle by doing innocuous-seeming things until it suddenly deletes the oracle subroutine out of the blue (treacherous turn). So in that version, the AI’s inner alignment is still important. (Unless we just declare that the AI’s alignment is unnecessary in the first place, because we’re going to prevent treacherous turns via option control.)
However, I think most people underestimate how many ways there are for the AI to do the right thing for the wrong reasons (namely they think it’s just about deception), and I think it’s not:
Yeah I mostly think this part of your comment is listing reasons that inner alignment might fail, a.k.a. reasons that goal misgeneralization / malgeneralization can happen. (Which is a fine thing to do!)
If someone thinks inner misalignment is synonymous with deception, then they’re confused. I’m not sure how such a person would have gotten that impression. If it’s a very common confusion, then that’s news to me.
Inner alignment can lead to deception. But outer alignment can lead to deception too! Any misalignment can lead to deception, regardless of whether the source of that misalignment was “outer” or “inner” or “both” or “neither”.
“Deception” is deliberate by definition—otherwise we would call it by another term, like “mistake”. That’s why it has to happen after there are misaligned motivations, right?
Overall, I think the outer-vs-inner framing has some implicit connotation that for inner alignment we just need to make it internalize the ground-truth reward
OK, so I guess I’ll put you down as a vote for the terminology “goal misgeneralization” (or “goal malgeneralization”), rather than “inner misalignment”, as you presumably find that the former makes it more immediately obvious what the concern is. Is that fair? Thanks.
I think we need to make AI have a particular utility function. We have a training distribution where we have a ground-truth reward signal, but there are many different utility functions that are compatible with the reward on the training distribution, which assign different utilities off-distribution.
You could avoid talking about utility functions by saying “the learned value function just predicts reward”, and that may work while you’re staying within the distribution we actually gave reward on, since there all the utility functions compatible with the ground-truth reward still agree. But once you’re going off distribution, what value you assign to some worldstates/plans depends on what utility function you generalized to.
I think I fully agree with this in spirit but not in terminology!
I just don’t use the term “utility function” at all in this context. (See §9.5.2 here for a partial exception.) There’s no utility function in the code. There’s a learned value function, and it outputs whatever it outputs, and those outputs determine what plans seem good or bad to the AI, including OOD plans like treacherous turns.
I also wouldn’t say “the learned value function just predicts reward”. The learned value function starts randomly initialized, and then it’s updated by TD learning or whatever, and then it eventually winds up with some set of weights at some particular moment, which can take inputs and produce outputs. That’s the system. We can put a comment in the code that says the value function is “supposed to” predict reward, and of course that code comment will be helpful for illuminating why the TD learning update code is structured the way is etc. But that “supposed to” is just a code comment, not the code itself. Will it in fact predict reward? That’s a complicated question about algorithms. In distribution, it will probably predict reward pretty accurately; out of distribution, it probably won’t; but with various caveats on both sides.
And then if we ask questions like “what is the AI trying to do right now” or “what does the AI desire”, the answer would mainly depend on the value function.
Actually, it may be useful to distinguish two kinds of this “utility vs reward mismatch”:
1. Utility/reward being insufficiently defined outside of training distribution (e.g. for what programs to run on computronium).
2. What things in the causal chain producing the reward are the things you actually care about? E.g. that the reward button is pressed, that the human thinks you did something well, that you did something according to some proxy preferences.
I’ve been lumping those together under the heading of “ambiguity in the reward signal”.
The second one would include e.g. ambiguity between “reward for button being pressed” vs “reward for human pressing the button” etc.
The first one would include e.g. ambiguity between “reward for being-helpful-variant-1” vs “reward for being-helpful-variant-2”, where the two variants are indistinguishable in-distribution but have wildly differently opinions about OOD options like brainwashing or mind-uploading.
Another way to think about it: the causal chain intuition is also an OOD issue, because it only becomes a problem when the causal chains are always intact in-distribution but they can come apart in new ways OOD.

Steven Byrnes Apr 3, 2025, 1:36 PM
LW: 5 AF: 4
0
AF
in reply to: Towards_Keeperhood’s comment on: steve2152′s Shortform
Thanks! But I don’t think that’s a likely failure mode. I wrote about this long ago in the intro to Thoughts on safety in predictive learning.
In my view, the big problem with model-based actor-critic RL AGI, the one that I spend all my time working on, is that it tries to kill us via using its model-based RL capabilities in the way we normally expect—where the planner plans, and the actor acts, and the critic criticizes, and the world-model models the world …and the end-result is that the system makes and executes a plan to kill us. I consider that the obvious, central type of alignment failure mode for model-based RL AGI, and it remains an unsolved problem.
I think (??) you’re bringing up a different and more exotic failure mode where the world-model by itself is secretly harboring a full-fledged planning agent. I think this is unlikely to happen. One way to think about it is: if the world-model is specifically designed by the programmers to be a world-model in the context of an explicit model-based RL framework, then it will probably be designed in such a way that it’s an effective search over plausible world-models, but not an effective search over a much wider space of arbitrary computer programs that includes self-contained planning agents. See also §3 here for why a search over arbitrary computer programs would be a spectacularly inefficient way to build all that agent stuff (TD learning in the critic, roll-outs in the planner, replay, whatever) compared to what the programmers will have already explicitly built into the RL agent architecture.
So I think this kind of thing (the world-model by itself spawning a full-fledged planning agent capable of treacherous turns etc.) is unlikely to happen in the first place. And even if it happens, I think the problem is easily mitigated; see discussion in Thoughts on safety in predictive learning. (Or sorry if I’m misunderstanding.)

Steven Byrnes Apr 3, 2025, 1:14 PM
LW: 3 AF: 3
0
AF
in reply to: Towards_Keeperhood’s comment on: steve2152′s Shortform
Thanks!
I think “inner alignment” and “outer alignment” (as I’m using the term) is a “natural breakdown” of alignment failures in the special case of model-based actor-critic RL AGI with a “behaviorist” reward function (i.e., reward that depends on the AI’s outputs, as opposed to what the AI is thinking about). As I wrote here:
Suppose there’s an intelligent designer (say, a human programmer), and they make a reward function R hoping that they will wind up with a trained AGI that’s trying to do X (where X is some idea in the programmer’s head), but they fail and the AGI is trying to do not-X instead. If R only depends on the AGI’s external behavior (as is often the case in RL these days), then we can imagine two ways that this failure happened:
1. The AGI was doing the wrong thing but got rewarded anyway (or doing the right thing but got punished)
2. The AGI was doing the right thing for the wrong reasons but got rewarded anyway (or doing the wrong thing for the right reasons but got punished).
I think it’s useful to catalog possible failures based on whether they involve (1) or (2), and I think it’s reasonable to call them “failures of outer alignment” and “failures of inner alignment” respectively, and I think when (1) is happening rarely or not at all, we can say that the reward function is doing a good job at “representing” the designer’s intention—or at any rate, it’s doing as well as we can possibly hope for from a reward function of that form. The AGI still might fail to acquire the right motivation, and there might be things we can do to help (e.g. change the training environment), but replacing R (which fires exactly to the extent that the AGI’s external behavior involves doing X) by a different external-behavior-based reward function R’ (which sometimes fires when the AGI is doing not-X, and/or sometimes doesn’t fire when the AGI is doing X) seems like it would only make things worse. So in that sense, it seems useful to talk about outer misalignment, a.k.a. situations where the reward function is failing to “represent” the AGI designer’s desired external behavior, and to treat those situations as generally bad.
(A bit more related discussion here.)
That definitely does not mean that we should be going for a solution to outer alignment and a separate unrelated solution to inner alignment, as I discussed briefly in §10.6 of that post, and TurnTrout discussed at greater length in Inner and outer alignment decompose one hard problem into two extremely hard problems. (I endorse his title, but I forget whether I 100% agreed with all the content he wrote.)
I find your comment confusing, I’m pretty sure you misunderstood me, and I’m trying to pin down how …
One thing is, I’m thinking that the AGI code will be an RL agent, vaguely in the same category as MuZero or AlphaZero or whatever, which has an obvious part of its source code labeled “reward”. For example, AlphaZero-chess has a reward of +1 for getting checkmate, −1 for getting checkmated, 0 for a draw. Atari-playing RL agents often use the in-game score as a reward function. Etc. These are explicitly parts of the code, so it’s very obvious and uncontroversial what the reward is (leaving aside self-hacking), see e.g. here where an AlphaZero clone checks whether a board is checkmate.
Another thing is, I’m obviously using “alignment” in a narrower sense than CEV (see the post—“the AGI is ‘trying’ to do what the programmer had intended for it to try to do…”)
Another thing is, if the programmer wants CEV (for the sake of argument), and somehow (!!) writes an RL reward function in Python whose output perfectly matches the extent to which the AGI’s behavior advances CEV, then I disagree that this would “make inner alignment unnecessary”. I’m not quite sure why you believe that. The idea is: actor-critic model-based RL agents of the type I’m talking about evaluate possible plans using their learned value function, not their reward function, and these two don’t have to agree. Therefore, what they’re “trying” to do would not necessarily be to advance CEV, even if the reward function were perfect.
If I’m still missing where you’re coming from, happy to keep chatting :)

Steven Byrnes Apr 3, 2025, 3:55 AM
LW: 15 AF: 10
2
AF
on: steve2152′s Shortform
In [Intro to brain-like-AGI safety] 10. The alignment problem and elsewhere, I’ve been using “outer alignment” and “inner alignment” in a model-based actor-critic RL context to refer to:
“Outer alignment” entails having a ground-truth reward function that spits out rewards that agree with what we want. “Inner alignment” is having a learned value function that estimates the value of a plan in a way that agrees with its eventual reward.
For some reason it took me until now to notice that:
- my “outer misalignment” is more-or-less synonymous with “specification gaming”,
- my “inner misalignment” is more-or-less synonymous with “goal misgeneralization”.
(I’ve been regularly using all four terms for years … I just hadn’t explicitly considered how they related to each other, I guess!)
I updated that post to note the correspondence, but also wanted to signal-boost this, in case other people missed it too.
~~
[You can stop reading here—the rest is less important]
If everybody agrees with that part, there’s a further question of “…now what?”. What terminology should I use going forward? If we have redundant terminology, should we try to settle on one?
One obvious option is that I could just stop using the terms “inner alignment” and “outer alignment” in the actor-critic RL context as above. I could even go back and edit them out of that post, in favor of “specification gaming” and “goal misgeneralization”. Or I could leave it. Or I could even advocate that other people switch in the opposite direction!
One consideration is: Pretty much everyone using the terms “inner alignment” and “outer alignment” are not using them in quite the way I am—I’m using them in the actor-critic model-based RL context, they’re almost always using them in the model-free policy optimization context (e.g. evolution) (see §10.2.2). So that’s a cause for confusion, and point in favor of my dropping those terms. On the other hand, I think people using the term “goal misgeneralization” are also almost always using them in a model-free policy optimization context. So actually, maybe that’s a wash? Either way, my usage is not a perfect match to how other people are using the terms, just pretty close in spirit. I’m usually the only one on Earth talking explicitly about actor-critic model-based RL AGI safety, so I kinda have no choice but to stretch existing terms sometimes.
Hmm, aesthetically, I think I prefer the “outer alignment” and “inner alignment” terminology that I’ve traditionally used. I think it’s a better mental picture. But in the context of current broader usage in the field … I’m not sure what’s best.
(Nate Soares dislikes the term “misgeneralization”, on the grounds that “misgeneralization” has a misleading connotation that “the AI is making a mistake by its own lights”, rather than “something is bad by the lights of the programmer”. I’ve noticed a few people trying to get the variation “goal malgeneralization” to catch on instead. That does seem like an improvement, maybe I’ll start doing that too.)

Steven Byrnes Mar 30, 2025, 6:51 PM
3 points
0
in reply to: LWLW’s comment on: LWLW’s Shortform
(Not really answering your question, just chatting.)
What’s your source for “JVN had ‘the physical intuition of a doorknob’”? Nothing shows up on google. I’m not sure quite what that phrase is supposed to mean, so context would be helpful. I’m also not sure what “extremely poor perceptual abilities” means exactly.
You might have already seen this, but Poincaré writes about “analysts” and “geometers”:
It is impossible to study the works of the great mathematicians, or even those of the lesser, without noticing and distinguishing two opposite tendencies, or rather two entirely different kinds of minds. The one sort are above all preoccupied with logic; to read their works, one is tempted to believe they have advanced only step by step, after the manner of a Vauban who pushes on his trenches against the place besieged, leaving nothing to chance. The other sort are guided by intuition and at the first stroke make quick but sometimes precarious conquests, like bold cavalrymen of the advance guard.
The method is not imposed by the matter treated. Though one often says of the first that they are analysts and calls the others geometers, that does not prevent the one sort from remaining analysts even when they work at geometry, while the others are still geometers even when they occupy themselves with pure analysis. It is the very nature of their mind which makes them logicians or intuitionalists, and they can not lay it aside when they approach a new subject.
Not sure exactly how that relates, if at all. (What category did Poincaré put himself in? It’s probably in the essay somewhere, I didn’t read it that carefully. I think geometer, based on his work? But Tao is extremely analyst, I think, if we buy this categorization in the first place.)
I’m no JVN/Poincaré/Tao, but if anyone cares, I think I’m kinda aphantasia-adjacent, and I think that fact has something to do with why I’m naturally bad at drawing, and why, when I was a kid doing math olympiad problems, I was worse at Euclidean geometry problems than my peers who got similar overall scores.