Steven Byrnes

Karma: 21,638

I’m an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, LinkedIn, and more at my website.

Steven Byrnes Apr 18, 2025, 6:50 PM
2 points
0
in reply to: tailcalled’s comment on: johnswentworth’s Shortform
OK, here’s my argument that, if you take {intelligence, understanding, consequentialism} as a unit, it’s sufficient for everything:
- If durability and strength are helpful, then {intelligence, understanding, consequentialism} can discover that durability and strength are helpful, and then build durability and strength.
  - Even if “the exact ways in which durability and strength will be helpful” does not constitute a learnable pattern, “durability and strength will be helpful” is nevertheless a (higher-level) learnable pattern.
- If some other evolved aspects of the brain and body are helpful, then {intelligence, understanding, consequentialism} can likewise discover that they are helpful, and build them.
  - After all, if ‘those things are helpful’ wasn’t a learnable pattern, then evolution would not have discovered and exploited that pattern!
  - If the number of such aspects is dozens or hundreds or thousands, then whatever, {intelligence, understanding, consequentialism} can still get to work systematically discovering them all. The recipe for a human is not infinitely complex.
- If reducing heterogeneity is helpful, then {intelligence, understanding, consequentialism} can discover that fact, and figure out how to reduce heterogeneity.
- Etc.

Steven Byrnes Apr 17, 2025, 8:41 PM
6 points
6
in reply to: Eli Tyre’s comment on: Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI
I kinda agree, but that’s more a sign that schools are bad at teaching things, than a sign that human brains are bad at flexibly applying knowledge. See my comment here.

Steven Byrnes Apr 17, 2025, 2:08 PM
5 points
5
in reply to: Matrice Jacobine’s comment on: ASI existential risk: Reconsidering Alignment as a Goal
See my other comment. I find it distressing that multiple people here are evidently treating acknowledgements as implying that the acknowledged person endorses the end product. I mean, it might or might be true in this particular case, but the acknowledgement is no evidence either way.
(For my part, I’ve taken to using the formula “Thanks to [names] for critical comments on earlier drafts”, in an attempt to preempt this mistake. Not sure if it works.)

Steven Byrnes Apr 17, 2025, 1:58 PM
6 points
7
in reply to: Seth Herd’s comment on: ASI existential risk: Reconsidering Alignment as a Goal
Chiang and Rajaniemi are on board
Let’s all keep in mind that the acknowledgement only says that Chiang and Rajaniemi had conversations with the author (Nielsen), and that Nielsen found those conversations helpful. For all we know, Chiang and Rajaniemi would strongly disagree with every word of this OP essay. If they’ve even read it.

Steven Byrnes Apr 16, 2025, 12:54 PM
2 points
0
in reply to: tailcalled’s comment on: johnswentworth’s Shortform
Learning from strategies that stood the test of time would be tradition moreso than intelligence. I think tradition requires intelligence, but it also requires something else that’s less clear (and possibly not simple enough to be assembled manually, idk).
Right, that’s what I was gonna say. You need intelligence to sort out which traditions should be copied and which ones shouldn’t. There was a 13-billion-year “tradition” of not building e-commerce megastores, but Jeff Bezos ignored that “tradition”, and it worked out very well for him (and I’m happy about it too). Likewise, the Wright Brothers explicitly followed the “tradition” of how birds soar, but not the “tradition” of how birds flap their wings.
I do think there’s a “something else” (most [but not all] humans have an innate drive to follow and enforce social norms, more or less), but I don’t think it’s necessary. The Wright Brothers didn’t have any innate drive to copy anything about bird soaring tradition, but they did it anyway purely by intelligence.
Random street names aren’t necessarily important though?
I feel like I’ve lost the plot here. If you think there are things that are very important, but rare in the training data, and that LLMs consequently fail to learn, can you give an example?
Often the rare important things are very well known (after all, they are important, so people put a lot of effort into knowing them), they just can’t efficiently be derived from empirical data (except essentially by copying someone else’s conclusion blindly, and that leaves you vulnerable to deception).
I guess you’re using “empirical data” in a narrow sense. If Joe tells me X, I have gained “empirical data” that Joe told me X. And then I can apply my intelligence to interpret that “data”. For example, I can consider a number of hypotheses: the hypothesis that Joe is correct and honest, that Joe is mistaken but honest, that Joe is trying to deceive me, that Joe said Y but I misheard him, etc. And then I can gather or recall additional evidence that favors one of those hypotheses over another. I could ask Joe to repeat himself, to address the “I misheard him” hypothesis. I could consider how often I have found Joe to be mistaken about similar things in the past. I could ask myself whether Joe would benefit from deceiving me. Etc.
This is all the same process that I might apply to other kinds of “empirical data” like if my car was making a funny sound. I.e., consider possible generative hypotheses that would match the data, then try to narrow down via additional observations, and/or remain uncertain and prepare for multiple possibilities when I can’t figure it out. This is a middle road between “trusting people blindly” versus “ignoring everything that anyone tells you”, and it’s what reasonable people actually do. Doing that is just intelligence, not any particular innate human tendency—smart autistic people and smart allistic people and smart callous sociopaths etc. are all equally capable of traveling this middle road, i.e. applying intelligence towards the problem of learning things from what other people say.
(For example, if I was having this conversation with almost anyone else, I would have quit, or not participated in the first place. But I happen to have prior knowledge that you-in-particular have unusual and well-thought-through ideas, and even they’re wrong, they’re often wrong in very unusual and interesting ways, and that you don’t tend to troll, etc.)
I feel like I’m misunderstanding you somehow. You keep saying things that (to me) seem like you could equally well argue that humans cannot possibly survive in the modern world, but here we are. Do you have some positive theory of how humans survive and thrive in (and indeed create) historically-unprecedented heterogeneous environments?

Steven Byrnes Apr 16, 2025, 11:38 AM
2 points
0
in reply to: tailcalled’s comment on: johnswentworth’s Shortform
If your model if underparameterized (which I think is true for the typical model?), then it can’t learn any patterns that only occurs once in the data. And even if the model is overparameterized, it still can’t learn any pattern that never occurs in the data.
Dunno if anything’s changed since 2023, but this says LLMs learn things they’ve seen exactly once in the data.
I can vouch that you can ask LLMs about things that are extraordinarily rare in the training data—I’d assume well under once per billion tokens—and they do pretty well. E.g. they know lots of random street names.
Humans successfully went to the moon, despite it being a quite different environment that they had never been in before. And they didn’t do that with “durability, strength, healing, intuition, tradition”, but rather with intelligence.
Speaking of which, one can apply intelligence towards the problem of being resilient to unknown unknowns, and one would come up with ideas like durability, healing, learning from strategies that have stood the test of time (when available), margins of error, backup systems, etc.

Steven Byrnes Apr 15, 2025, 11:08 PM
4 points
2
in reply to: tailcalled’s comment on: johnswentworth’s Shortform
I think you’re conflating consequentialism and understanding in a weird-to-me way. (Or maybe I’m misunderstanding.)
I think consequentialism is related to choosing one action versus another action. I think understanding (e.g. predicting the consequence of an action) is different, and that in practice understanding has to involve self-supervised learning.
(I think human brains have both [partly-] consequentialist decisions and self-supervised updating of the world-model.) (They’re not totally independent, but rather they interact via training data: e.g. [partly-] consequentialist decision-making determines how you move your eyes, and then whatever your eyes are pointing at, your model of the visual world will then update by self-supervised learning on that particular data. But still, these are two systems that interact, not the same thing.)
I think self-supervised learning is perfectly capable of discovering rare but important patterns. Just look at today’s foundation models, which seem pretty great at that.

Steven Byrnes Apr 15, 2025, 5:34 PM
LW: 2 AF: 2
0
AF
on: Evaluating the historical value misspecification argument
I’m not too interested in litigating what other people were saying in 2015, but OP is claiming (at least in the comments) that “RLHF’d foundation models seem to have common-sense human morality, including human-like moral reasoning and reflection” is evidence for “we’ve made progress on outer alignment”. If so, here are two different ways to flesh that out:
1. An RLHF’d foundation model acts as the judge / utility function; and some separate system comes up with plans that optimize it—a.k.a. “you just need to build a function maximizer that allows you to robustly maximize the utility function that you’ve specified”.
  1. I think this plan fails because RLHF’d foundation models have adversarial examples today, and will continue to have adversarial examples into the future. (To be clear, humans have adversarial examples too, e.g. drugs & brainwashing.)
2. There is no “separate system”, but rather an RLHF’d foundation model (or something like it) is the whole system that we’re talking about here. For example, we may note that, if you hook up an RLHF’d foundation model to tools and actuators, then it will actually use those tools and actuators in accordance with common-sense morality etc.
(I think 2 is the main intuition driving the OP, and 1 was a comments-section derailment.)
As for 2:
- I’m sympathetic to the argument that this system might not be dangerous, but I think its load-bearing ingredient is that pretraining leads to foundation models tending to do intelligent things primarily by emitting human-like outputs for human-like reasons, thanks in large part to self-supervised pretraining on internet text. Let’s call that “imitative learning”.
- Indeed, here’s a 2018 post where Eliezer (as I read it) implies that he hadn’t really been thinking about imitative learning before (he calls it an “interesting-to-me idea”), and suggests that imitative learning might “bypass the usual dooms of reinforcement learning”. So I think there is a real update here—if you believe that imitative learning can scale to real AGI.
- …But the pushback (from Rob and others) in the comments is mostly coming from a mindset where they don’t believe that imitative learning can scale to real AGI. I think the commenters are failing to articulating this mindset well, but I think they are in various places leaning on certain intuitions about how future AGI will work (e.g. beliefs deeply separate from goals), and these intuitions are incompatible with imitative learning being the primary source of optimization power in the AGI (as it is today, I claim).
- (A moderate position is that imitative learning will be less and less relevant as e.g. o1-style RL post-training becomes a bigger relative contribution to the trained weights; this would presumably lead to increased future capabilities hand-in-hand with increased future risk of egregious scheming. For my part, I subscribe to the more radical theory that our situation is even worse than that: I think future powerful AGI will be built via a different AI training paradigm that basically throws out the imitative learning part altogether.)
- I have a forthcoming post (hopefully) that will discuss this much more and better.

Steven Byrnes Apr 15, 2025, 1:21 PM
8 points
0
in reply to: tailcalled’s comment on: johnswentworth’s Shortform
(IMO this is kinda unrelated to the OP, but I want to continue this thread.)
Have you elaborated on this anywhere?
Perhaps you missed it, but some guy in 2022 wrote this great post which claimed that “Consequentialism, broadly defined, is a general and useful way to develop capabilities.” ;-)
I’m actually just in the course of writing something about why “consequentialism provides an extremely powerful but difficult-to-align method of converting intelligence into agency” … maybe I can send you the draft for criticism when it’s ready?

Steven Byrnes Apr 13, 2025, 11:21 PM
10 points
4
in reply to: Cole Wyeth’s comment on: Cole Wyeth’s Shortform
For context, my lower effort posts are usually more popular.
mood

Steven Byrnes Apr 13, 2025, 11:18 PM
26 points
4
in reply to: Richard_Ngo’s comment on: ricraz’s Shortform
In run-and-tumble motion, “things are going well” implies “keep going”, whereas “things are going badly” implies “choose a new direction at random”. Very different! And I suggest in §1.3 here that there’s an unbroken line of descent from the run-and-tumble signal in our worm-like common ancestor with C. elegans, to the “valence” signal that makes things seem good or bad in our human minds. (Suggestively, both run-and-tumble in C. elegans, and the human valence, are dopamine signals!)
So if some idea pops into your head, “maybe I’ll stand up”, and it seems appealing, then you immediately stand up (the human “run”); if it seems unappealing on net, then that thought goes away and you start thinking about something else instead, semi-randomly (the human “tumble”).
So positive and negative are deeply different. Of course, we should still call this an RL algorithm. It’s just that it’s an RL algorithm that involves a (possibly time- and situation-dependent) heuristic estimator of the expected value of a new random plan (a.k.a. the expected reward if you randomly tumble). If you’re way above that expected value, then keep doing whatever you’re doing; if you’re way below the threshold, re-roll for a new random plan.
As one example of how this ancient basic distinction feeds into more everyday practical asymmetries between positive and negative motivations, see my discussion of motivated reasoning here, including in §3.3.3 the fact that “it generally feels easy and natural to brainstorm / figure out how something might happen, when you want it to happen. Conversely, it generally feels hard and unnatural to figure out how something might happen, when you want it to not happen.”

Steven Byrnes Apr 12, 2025, 7:00 PM
9 points
0
on: What is autism?
I kinda think of the main clusters of symptoms as: (1) sensory sensitivity, (2) social symptoms, (3) different “learning algorithm hyperparameters”.
More specifically, (1) says: innate sensory reactions (e.g. startle reflex, orienting reflex) are so strong that they’re often overwhelming. (2) says: innate social reactions (e.g. the physiological arousal triggered by eye contact) are so strong that they’re often overwhelming. (3) includes atypical patterns of learning & memory including the gestalt pattern of childhood language acquisition which is common but not universal among autistic kids.
People respond to (1) in various ways, including cutting off the scratchy tags at the back of their shirts, squeeze machines, weighted blankets, etc., plus maybe stimming (although I’m not sure if that’s the right explanation for stimming).
People respond to (2) by (I think) relating to other people in a way that generally avoids triggering certain innate social reactions. This includes (famously) avoiding eye contact, but I think also includes various hard-to-describe unconscious attention-control strategies. So at the end of the day, neurotypical people will have an unconscious innate snap reaction to (e.g.) learning that someone is angry at them, whereas autistic people won’t have that snap reaction, because they have an unconscious coping strategy to avoid triggering it, that they’ve used since early childhood, because the reaction is so unpleasant. Of course, they’ll still understand intellectually perfectly well that the person is angry. As one consequence of that, autistic people (naturally) have trouble modeling how neurotypical people will react to different social situations, and conversely, neurotypical people will misunderstand and misinterpret the social behaviors of autistic people.
Anyway, it seems intuitively sensible that a single underlying cause, namely something like “trigger-happy neurons” (see discussion of the valproic acid model here), often leads to all three of the (1-3) symptom clusters, along with the other common symptoms like proneness-to-seizures and 10-minute screaming tantrums. At the same time, I think people can get subsets of those clusters of symptoms for various different underlying reasons. For example, one of my kids is a late talker with very strong (3), but little-if-any (1-2). He has an autism diagnosis. (I’m pretty sure he wouldn’t have gotten one 20 years ago.) My other kid has strong nerdy autistic-like “special interests”—and I expect him to wind up as an adult who (like me) has many autistic friends—but I think he’s winding up with those behaviors from a rather different root cause.
Much more at my old post Intense World Theory of Autism.
I’m also interested in book recommendations or recommendations for other resources where I can learn more.
I thought NeuroTribes was really great, that’s my one recommendation if I had to pick one. If I had to pick three, I would also throw in the two John Elder Robison books I read. In Look Me in the Eye, he talks about growing up with (what used to be called) Asperger’s; even more interestingly, in Switched On he describes his experience with Transcranial Magnetic Stimulation, which led (in my interpretation) to his reintroduction to those innate social reactions that (as mentioned above) he had learned at a very young age to generally avoid triggering via unconscious attention-control coping strategies, since the reactions were overwhelming and unpleasant.

Steven Byrnes Apr 11, 2025, 8:09 PM
3 points
0
in reply to: Towards_Keeperhood’s comment on: steve2152′s Shortform
Thanks! Oddly enough, in that comment I’m much more in agreement with the model you attribute to yourself than the model you attribute to me. ¯\_(ツ)_/¯
the value function doesn’t understand much of the content there, and only uses some simple heuristics for deciding how to change its value estimate
Think of it as a big table that roughly-linearly assigns good or bad vibes to all the bits and pieces that comprise a thought, and adds them up into a scalar final answer. And a plan is just another thought. So “I’m gonna get that candy and eat it right now” is a thought, and also a plan, and it gets positive vibes from the fact that “eating candy” is part of the thought, but it also gets negative vibes from the fact that “standing up” is part of the thought (assume that I’m feeling very tired right now). You add those up into the final value / valence, which might or might not be positive, and accordingly you might or might not actually get the candy. (And if not, some random new thought will pop into your head instead.)
Why does the value function assign positive vibes to eating-candy? Why does it assign negative vibes to standing-up-while-tired? Because of the past history of primary rewards via (something like) TD learning, which updates the value function.
Does the value function “understand the content”? No, the value function is a linear functional on the content of a thought. Linear functionals don’t understand things. :)
(I feel like maybe you’re going wrong by thinking of the value function and Thought Generator as intelligent agents rather than “machines that are components of a larger machine”?? Sorry if that’s uncharitable.)
[the value function] only uses some simple heuristics for deciding how to change its value estimate. E.g. a heuristic might be “when there’s a thought that the world model thinks is valid and it is associated to the (self-model-invoking) thought “this is bad for accomplishing my goals”, then it lowers its value estimate.
The value function is a linear(ish) functional whose input is a thought. A thought is an object in some high-dimensional space, related to the presence or absence of all the different concepts comprising it. Some concepts are real-world things like “candy”, other concepts are metacognitive, and still other concepts are self-reflective. When a metacognitive and/or self-reflective concept is active in a thought, the value function will correspondingly assign extra positive or negative vibes—just like if any other kind of concept is active. And those vibes depending on the correlations of those concepts with past rewards via (something like) TD learning.
So “I will fail at my goals” would be a kind of thought, and TD learning would gradually adjust the value function such that this thought has negative valence. And this thought can co-occur with or be a subset of other thoughts that involve failing at goals, because the Thought Generator is a machine that learns these kinds of correlations and implications, thanks to a different learning algorithm that sculpts it into an ever-more-accurate predictive world-model.

Steven Byrnes Apr 11, 2025, 5:50 PM
14 points
11
in reply to: Cole Wyeth’s comment on: Reactions to METR task length paper are insane
I think the interest rate thing provides so little evidence either way that it’s misleading to even mention it. See the EAF comments on that post, and also Zvi’s rebuttal. (Most of that pushback also generalizes to your comment about the S&P.) (For context, I agree that AGI in ≤2030 is unlikely.)

Steven Byrnes Apr 9, 2025, 2:56 AM
LW: 7 AF: 4
0
AF
in reply to: Towards_Keeperhood’s comment on: steve2152′s Shortform
Thanks! Basically everything you wrote importantly mismatches my model :( I think I can kinda translate parts; maybe that will be helpful.
Background (§8.4.2): The thought generator settles on a thought, then the value function assigns a “valence guess”, and the brainstem declares an actual valence, either by copying the valence guess (“defer-to-predictor mode”), or overriding it (because there’s meanwhile some other source of ground truth, like I just stubbed my toe).
Sometimes thoughts are self-reflective. E.g. “the idea of myself lying in bed” is a different thought from “the feel of the pillow on my head”. The former is self-reflective—it has me in the frame—the latter is not (let’s assume).
All thoughts can be positive or negative valence (motivating or demotivating). So self-reflective thoughts can be positive or negative valence, and non-self-reflective thoughts can also be positive or negative valence. Doesn’t matter, it’s always the same machinery, the same value function / valence guess / thought assessor. That one function can evaluate both self-reflective and non-self-reflective thoughts, just as it can evaluate both sweater-related thoughts and cloud-related thoughts.
When something seems good (positive valence) in a self-reflective frame, that’s called ego-syntonic, and when something seems bad in a self-reflective frame, that’s called ego-dystonic.
Now let’s go through what you wrote:
1. humans have a self-model which can essentially have values different from the main value function
I would translate that into: “it’s possible for something to seem good (positive valence) in a self-reflective frame, but seem bad in a non-self-reflective frame. Or vice-versa.” After all, those are two different thoughts, so yeah of course they can have two different valences.
2. the policy suggestions of the self-model/homunculus can be more coherent than the value function estimates
I would translate that into: “there’s a decent amount of coherence / self-consistency in the set of thoughts that seem good or bad in a self-reflective frame, and there’s less coherence / self-consistency in the set of things that seem good or bad in a non-self-reflective frame”.
(And there’s a logical reason for that; namely, that hard thinking and brainstorming tends to bring self-reflective thoughts to mind — §8.5.5 — and hard thinking and brainstorming is involved in reducing inconsistency between different desires.)
3. The learned value function can learn to trust the self-model if acting according to the self-model is consistently correlated with higher-than-expected reward.
This one is more foreign to me. A self-reflective thought can have positive or negative valence for the same reasons that any other thought can have positive or negative valence—because of immediate rewards, and because of the past history of rewards, via TD learning, etc.
One thing is: someone can develop a learned metacognitive habit to the effect of “think self-reflective thoughts more often” (which is kinda synonymous with “don’t be so impulsive”). They would learn this habit exactly to the extent and in the circumstances that it has led to higher reward / positive valence in the past.
4. Say we have a smart reflective human where the value function basically trusts the self-model a lot, then the self-model could start optimizing its own values, while the (stupid) value function believes it’s best to just trust the self-model and that this will likely lead to reward.
If someone gets in the habit of “think self-reflective thoughts all the time” a.k.a. “don’t be so impulsive”, then their behavior will be especially strongly determined by which self-reflective thoughts are positive or negative valence.
But “which self-reflective thoughts are positive or negative valence” is still determined by the value function / valence guess function / thought assessor in conjunction with ground-truth rewards / actual valence—which in turn involves the reward function, and the past history of rewards, and TD learning, blah blah. Same as any other kind of thought.
…I won’t keep going with your other points, because it’s more of the same idea.
Does that help explain where I’m coming from?

Steven Byrnes Apr 8, 2025, 2:35 AM
4 points
2
in reply to: gwern’s comment on: Auditing language models for hidden objectives
I am a human, but if you ask me whether I want to ditch my family and spend the rest of my life in an Experience Machine, my answer is no.
(I do actually think there’s a sense in which “people optimize reward”, but it’s a long story with lots of caveats…)

Steven Byrnes Apr 5, 2025, 9:21 PM
10 points
8
on: Prediction Markets Are Mediocre
I downvoted because the conclusion “prediction markets are mediocre” does not follow from the premise “here is one example of one problem that I imagine abundant legal well-capitalized prediction markets would not have completely solved (even though I acknowledge that they would have helped move things in the right direction on the margin)”.

Steven Byrnes Apr 4, 2025, 12:45 PM
7 points
2
in reply to: faul_sname’s comment on: AI 2027: What Superintelligence Looks Like
That excerpt says “compute-efficient” but the rest of your comment switches to “sample efficient”, which is not synonymous, right? Am I missing some context?

Steven Byrnes Apr 4, 2025, 12:40 PM
12 points
2
in reply to: OVERmind’s comment on: AI 2027: What Superintelligence Looks Like
Pretty sure “DeepCent” is a blend of DeepSeek & Tencent—they have a footnote: “We consider DeepSeek, Tencent, Alibaba, and others to have strong AGI projects in China. To avoid singling out a specific one, our scenario will follow a fictional “DeepCent.””. And I think the “brain” in OpenBrain is supposed to be reminiscent of the “mind” in DeepMind.
ETA: Scott Alexander tweets with more backstory on how they settled on “OpenBrain”: “You wouldn’t believe how much work went into that stupid name…”

Steven Byrnes Apr 3, 2025, 5:34 PM
LW: 2 AF: 2
0
AF
in reply to: Towards_Keeperhood’s comment on: steve2152′s Shortform
I was just imagining a fully omnicient oracle that could tell you for each action how good that action is according to your extrapolated preferences, in which case you could just explore a bit and always pick the best action according to that oracle.
OK, let’s attach this oracle to an AI. The reason this thought experiment is weird is because the goodness of an AI’s action right now cannot be evaluated independent of an expectation about what the AI will do in the future. E.g., if the AI says the word “The…”, is that a good or bad way for it to start its sentence? It’s kinda unknowable in the absence of what its later words will be.
So one thing you can do is say that the AI bumbles around and takes reversible actions, rolling them back whenever the oracle says no. And the oracle is so good that we get CEV that way. This is a coherent thought experiment, and it does indeed make inner alignment unnecessary—but only because we’ve removed all the intelligence from the so-called AI! The AI is no longer making plans, so the plans don’t need to be accurately evaluated for their goodness (which is where inner alignment problems happen).
Alternately, we could flesh out the thought experiment by saying that the AI does have a lot of intelligence and planning, and that the oracle is doing the best it can to anticipate the AI’s behavior (without reading the AI’s mind). In that case, we do have to worry about the AI having bad motivation, and tricking the oracle by doing innocuous-seeming things until it suddenly deletes the oracle subroutine out of the blue (treacherous turn). So in that version, the AI’s inner alignment is still important. (Unless we just declare that the AI’s alignment is unnecessary in the first place, because we’re going to prevent treacherous turns via option control.)
However, I think most people underestimate how many ways there are for the AI to do the right thing for the wrong reasons (namely they think it’s just about deception), and I think it’s not:
Yeah I mostly think this part of your comment is listing reasons that inner alignment might fail, a.k.a. reasons that goal misgeneralization / malgeneralization can happen. (Which is a fine thing to do!)
If someone thinks inner misalignment is synonymous with deception, then they’re confused. I’m not sure how such a person would have gotten that impression. If it’s a very common confusion, then that’s news to me.
Inner alignment can lead to deception. But outer alignment can lead to deception too! Any misalignment can lead to deception, regardless of whether the source of that misalignment was “outer” or “inner” or “both” or “neither”.
“Deception” is deliberate by definition—otherwise we would call it by another term, like “mistake”. That’s why it has to happen after there are misaligned motivations, right?
Overall, I think the outer-vs-inner framing has some implicit connotation that for inner alignment we just need to make it internalize the ground-truth reward
OK, so I guess I’ll put you down as a vote for the terminology “goal misgeneralization” (or “goal malgeneralization”), rather than “inner misalignment”, as you presumably find that the former makes it more immediately obvious what the concern is. Is that fair? Thanks.
I think we need to make AI have a particular utility function. We have a training distribution where we have a ground-truth reward signal, but there are many different utility functions that are compatible with the reward on the training distribution, which assign different utilities off-distribution.
You could avoid talking about utility functions by saying “the learned value function just predicts reward”, and that may work while you’re staying within the distribution we actually gave reward on, since there all the utility functions compatible with the ground-truth reward still agree. But once you’re going off distribution, what value you assign to some worldstates/plans depends on what utility function you generalized to.
I think I fully agree with this in spirit but not in terminology!
I just don’t use the term “utility function” at all in this context. (See §9.5.2 here for a partial exception.) There’s no utility function in the code. There’s a learned value function, and it outputs whatever it outputs, and those outputs determine what plans seem good or bad to the AI, including OOD plans like treacherous turns.
I also wouldn’t say “the learned value function just predicts reward”. The learned value function starts randomly initialized, and then it’s updated by TD learning or whatever, and then it eventually winds up with some set of weights at some particular moment, which can take inputs and produce outputs. That’s the system. We can put a comment in the code that says the value function is “supposed to” predict reward, and of course that code comment will be helpful for illuminating why the TD learning update code is structured the way is etc. But that “supposed to” is just a code comment, not the code itself. Will it in fact predict reward? That’s a complicated question about algorithms. In distribution, it will probably predict reward pretty accurately; out of distribution, it probably won’t; but with various caveats on both sides.
And then if we ask questions like “what is the AI trying to do right now” or “what does the AI desire”, the answer would mainly depend on the value function.
Actually, it may be useful to distinguish two kinds of this “utility vs reward mismatch”:
1. Utility/reward being insufficiently defined outside of training distribution (e.g. for what programs to run on computronium).
2. What things in the causal chain producing the reward are the things you actually care about? E.g. that the reward button is pressed, that the human thinks you did something well, that you did something according to some proxy preferences.
I’ve been lumping those together under the heading of “ambiguity in the reward signal”.
The second one would include e.g. ambiguity between “reward for button being pressed” vs “reward for human pressing the button” etc.
The first one would include e.g. ambiguity between “reward for being-helpful-variant-1” vs “reward for being-helpful-variant-2”, where the two variants are indistinguishable in-distribution but have wildly differently opinions about OOD options like brainwashing or mind-uploading.
Another way to think about it: the causal chain intuition is also an OOD issue, because it only becomes a problem when the causal chains are always intact in-distribution but they can come apart in new ways OOD.