Steven Byrnes

Karma: 21,755

I’m an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, LinkedIn, and more at my website.

Steven Byrnes Apr 26, 2025, 3:37 PM
LW: 4 AF: 4
0
AF
in reply to: Noosphere89’s comment on: “The Era of Experience” has an unsolved technical alignment problem
I expect “the usual agent debugging loop” (§2.2) to keep working. If o3-type systems can learn that “winding up with the right math answer is good”, then they can learn “flagrantly lying and cheating are bad” in the same way. Both are readily-available feedback signals, right? So I think o3’s dishonesty is reflecting a minor problem in the training setup that the big AI companies will correct in the very near future without any new ideas, if they haven’t already. Right? Or am I missing something?
That said, I also want to re-emphasize that both myself and Silver & Sutton are thinking of future advances that give RL a much more central and powerful role than it has even in o3-type systems.
(E.g. Silver & Sutton write: “…These approaches, while powerful, often bypassed core RL concepts: RLHF side-stepped the need for value functions by invoking human experts in place of machine-estimated values, strong priors from human data reduced the reliance on exploration, and reasoning in human-centric terms lessened the need for world models and temporal abstraction. However, it could be argued that the shift in paradigm has thrown out the baby with the bathwater…”)

Steven Byrnes Apr 26, 2025, 3:15 PM
LW: 3 AF: 2
0
AF
in reply to: Wei Dai’s comment on: “The Era of Experience” has an unsolved technical alignment problem
under your theory, what prevents reward hacking through forming a group and then just directly maxing out on mutually liking/admiring each other?
It’s hard for me to give a perfectly confident answer because I don’t understand everything about human social instincts yet :) But here are some ways I’m thinking about that:
- Best friends, and friend groups, do exist, and people do really enjoy and value them.
- The “drive to feel liked / admired” is particularly geared towards feeling liked / admired by people who feel important to you, i.e. where interacting with them is high stakes (induces physiological arousal). (I got that wrong the first time, but changed my mind here.) Physiological arousal is in turn grounded in other ways—e.g., pain, threats, opportunities, etc. So anyway, if Zoe has extreme liking / admiration for me, so much that she’ll still think I’m great and help me out no matter what I say or do, then from my perspective, it may start feeling lower-stakes for me to interact with Zoe. Without that physiological arousal, I stop getting so much reward out of her admiration (“I take her for granted”) and go looking for a stronger hit of drive-to-feel-liked/admired by trying to impress someone else who feels higher-stakes for me to interact with. (Unless there’s some extra sustainable source of physiological arousal in my relationship with Zoe, e.g. we’re madly in love, or she’s the POTUS, or whatever.)
- As I wrote in Social status part 1/2: negotiations over object-level preferences, there’s a zero-sum nature of social leading vs following. If I want to talk about trains and Zoe wants to talk about dinosaurs, we can’t both get everything we want; one of us is going to have our desires frustrated, at least to some extent. So if I’m doing exactly what I want, and Zoe is strongly following / deferring to me on everything, then that’s a credible and costly signal that Zoe likes / admires me (or fears me). And we can’t really both be sending those signals to each other simultaneously. Also, sending those signals runs up against every other aspect of my own reward function—eating-when-hungry, curiosity drive, etc., because those determine the object-level preferences that I would be neglecting in favor of following Zoe’s whims.
When applying these ideas to AI, how do you plan to deal with the potential problem of distributional shifts happening faster than we can edit the reward function?
As I always say, my claim is “understanding human social instincts seems like it would probably be useful for Safe & Beneficial AGI”, not “here is my plan for Safe & Beneficial AGI…”. :) That said, copying from §5.2 here:
…Alternatively, there’s a cop-out option! If the outcome of the [distributional shifts] is so hard to predict, we could choose to endorse the process while being agnostic about the outcome.
There is in fact a precedent for that—indeed, it’s the status quo! We don’t know what the next generation of humans will choose to do, but we’re nevertheless generally happy to entrust the future to them. The problem is the same—the next generation of humans will [figure out new things and invent new technologies], causing self-generated distribution shifts, and ending up going in unpredictable directions. But this is how it’s always been, and we generally feel good about it. [Not that we’ve ever had a choice!]
So the idea here is, we could make AIs with social and moral drives that we’re happy about, probably because those drives are human-like in some respects [but probably not in all respects!], and then we could define “good” as whatever those AIs wind up wanting, upon learning and reflection.

Steven Byrnes Apr 26, 2025, 2:28 PM
2 points
0
in reply to: Oliver Daniels’s comment on: “The Era of Experience” has an unsolved technical alignment problem
“This problem has a solution (and one that can be realistically implemented)” is another important crux, I think. As I wrote here: “For one thing, we don’t actually know for sure that this technical problem is solvable at all, until we solve it. And if it’s not in fact solvable, then we should not be working on this research program at all. If it’s not solvable, the only possible result of this research program would be “a recipe for summoning demons”, so to speak. And if you’re scientifically curious about what a demon-summoning recipe would look like, then please go find something else to be scientifically curious about instead.”
I have other retorts too, but I’m not sure it’s productive for me to argue against a position that you don’t endorse yourself, but rather are imagining that someone else has. We can see if anyone shows up here who actually endorses something like that.
Anyway, if Silver were to reply “oops, yeah, the reward function plan that I described doesn’t work, in the future I’ll say it’s an unsolved problem”, then that would be a big step in the right direction. It wouldn’t be remotely sufficient, but it would be a big step in the right direction, and worth celebrating.

Steven Byrnes Apr 26, 2025, 1:37 AM
LW: 4 AF: 3
0
AF
in reply to: cousin_it’s comment on: “The Era of Experience” has an unsolved technical alignment problem
The RL algorithms that people talk in AI traditionally feature an exponentially-discounted sum of future rewards, but I don’t think there’s any exponentially-discounted sums of future rewards in biology (more here). Rather, you have an idea (“I’m gonna go to the candy store”), and the idea seems good or bad, and if it seems sufficiently good, then you do it! (More here.) It can seem good for lots of different reasons. One possible reason is: the idea is immediately associated with (non-behaviorist) primary reward. Another possible reason is: the idea involves some concept that seems good, and the concept seems good in turn because it has tended to immediately precede primary reward in the past. Thus, when the idea “I’m gonna go to the candy store” pops into your head, that incidentally involves the “eating candy” concept also being rather active in your head (active right now, as you entertain that idea), and the “eating candy” concept is motivating (because it has tended to immediately precede primary reward), so the idea seems good and off you go to the store.
“We predict our future feelings” is an optional thing that might happen, but it’s just a special case of the above, the way I think about it.
what kind of learning could lead to this behavior? Maybe RL in some cases, maybe imitation learning in some cases, or maybe it needs the agent to be structured a certain way.
This doesn’t really parse for me … The reward function is an input to learning, it’s not itself learned, right? (Well, you can put separate learning algorithms inside the reward function if you want to.) Anyway, I’m all in on model-based RL. I don’t think imitation learning is a separate thing for humans, for reasons discussed in §2.3.

Steven Byrnes Apr 25, 2025, 11:40 PM
LW: 4 AF: 4
0
AF
in reply to: Wei Dai’s comment on: “The Era of Experience” has an unsolved technical alignment problem
My model is simpler, I think. I say: The human brain is some yet-to-be-invented variation on actor-critic model-based reinforcement learning. The reward function (a.k.a. “primary reward” a.k.a. “innate drives”) has a bunch of terms: eating-when-hungry is good, suffocation is bad, pain is bad, etc. Some of the terms are in the category of social instincts, including something that amounts to “drive to feel liked / admired”.
(Warning: All of these English-language descriptions like “pain is bad” is an approximate gloss on what’s really going on, which is only describable in terms like “blah blah innate neuron circuits in the hypothalamus are triggering each other and gating each other etc.”. See my examples of laughter and social instincts. But “pain is bad” is a convenient shorthand.)
So for the person getting tortured, keeping the secret is negative reward in some ways (because pain), and positive reward in other ways (because of “drive to feel liked / admired”). At the end of the day, they’ll do what seems most motivating, which might or might not be to reveal the secret, depending on how things balance out.
So in particular, I disagree with your claim that, in the torture scenario, “reward hacking” → reveal the secret. The social rewards are real rewards too.
evolution…could have made our neural architecture (e.g. connections) and learning algorithms (e.g. how/which weights are updated in response to training signals) quite complex, in order to avoid types of “reward hacking” that had the highest negative impacts in our EEA.
I’m unaware of any examples where neural architectures and learning algorithms are micromanaged to avoid reward hacking. Yes, avoiding reward hacking is important, but I think it’s solvable (in EEA) by just editing the reward function. (Do you have any examples in mind?)

Steven Byrnes Apr 25, 2025, 6:54 PM
LW: 5 AF: 3
0
AF
in reply to: cousin_it’s comment on: “The Era of Experience” has an unsolved technical alignment problem
Solving the Riemann hypothesis is not a “primary reward” / “innate drive” / part of the reward function for humans. What is? Among many other things, (1) the drive to satisfy curiosity / alleviate confusion, and (2) the drive to feel liked / admired. And solving the Riemann hypothesis leads to both of those things. I would surmise that (1) and/or (2) is underlying people’s desire to solve the Riemann hypothesis, although that’s just a guess. They’re envisioning solving the Riemann hypothesis and thus getting the (1) and/or (2) payoff.
So one way that people “reward hack” for (1) and (2) is that they find (1) and (2) motivating and work hard and creatively towards triggering them in all kinds of ways, e.g. crossword puzzles for (1) and status-seeking for (2).
Relatedly, if you tell the mathematician “I’ll give you a pill that I promise will lead to you experiencing a massive hit of both ‘figuring out something that feels important to you’ and ‘reveling in the admiration of people who feel important to you’. But you won’t actually solve the Riemann hypothesis.” They’ll say “Well, hmm, sounds like I’ll probably solve some other important math problem instead. Works for me! Sign me up!”
If instead you say “I’ll give you a pill that leads to a false memory of having solved the Riemann hypothesis”, they’ll say no. After all, the payoff is (1) and (2), and it isn’t there.
If instead you say “I’ll give you a pill that leads to the same good motivating feeling that you’d get from (1) and (2), but it’s not actually (1) or (2)”, they’ll say “you mean, like, cocaine?”, and you say “yeah, something like that”, and they say “no thanks”. This is the example of reward misgeneralization that I mentioned in the post—deliberately avoiding addictive drugs.
If instead you say “I’ll secretly tell you the solution to the Riemann hypothesis, and you can take full credit for it, so you get all the (2)”, … at least some people would say yes. I feel like, in the movie trope where people have a magic wish, they sometimes wish to be widely liked and famous without having to do any of the hard work to get there.
The interesting question for this one is, why would anybody not say yes? And I think the answer is: those funny non-behaviorist human social instincts. Basically, in social situations, primary rewards fire in a way that depends in a complicated way on what you’re thinking about. In particular, I’ve been using the term drive to feel liked / admired, and that’s part of it, but it’s also an oversimplification that hides a lot of more complex wrinkles. The upshot is that lots of people would not feel motivated by the prospect of feeling liked / admired under false pretenses, or more broadly feeling liked / admired for something that has neutral-to-bad vibes in their own mind.
Does that help? Sorry if I missed your point.

Steven Byrnes Apr 24, 2025, 10:40 PM
9 points
3
in reply to: Kabir Kumar’s comment on: Kabir Kumar’s Shortform
I suggest making anonymity compulsary
It’s an interesting idea, but the track records of the grantees are important information, right? And if the track record includes, say, a previous paper that the funder has already read, then you can’t submit the paper with author names redacted.
Also, ask people seeking funding to make specific, unambiguous, easily falsiable predictions of positive outcomes from their work. And track and follow up on this!
Wouldn’t it be better for the funder to just say “if I’m going to fund Group X for Y months / years of work, I should see what X actually accomplished in the last Y months / years, and assume it will be vaguely similar”? And if Group X has no comparable past experience, then fine, but that equally means that you have no basis for believing their predictions right now.
Also, what if someone predicts that they’ll do A, but then realizes it would be better if they did B? Two possibilities are: (1) You the funder trust their judgment. Then you shouldn’t be putting even minor mental barriers in the way of their pivoting. Pivoting is hard and very good and important! (2) You the funder don’t particular trust the recipient’s judgment, you were only funding it because you wanted that specific deliverable. But then the normal procedure is that the funder and recipient work together to determine the deliverables that the funder wants and that the recipient is able to provide. Like, if I’m funding someone to build a database of AI safety papers, then I wouldn’t ask them to “make falsifiable predictions about the outcomes from their work”, instead I would negotiate a contract with them that says they’re gonna build the database. Right? I mean, I guess you could call that a falsifiable prediction, of sorts, but it’s a funny way to talk about it.

Steven Byrnes Apr 24, 2025, 7:09 PM
LW: 4 AF: 3
0
AF
in reply to: cousin_it’s comment on: “The Era of Experience” has an unsolved technical alignment problem
You might find this post helpful? Self-dialogue: Do behaviorist rewards make scheming AGIs? In it, I talk a lot about whether the algorithm is explicitly thinking about reward or not. I think it depends on the setup.
(But I don’t think anything I wrote in THIS post hinges on that. It doesn’t really matter whether (1) the AI is sociopathic because being sociopathic just seems to it like part of the right and proper way to be, versus (2) the AI is sociopathic because it is explicitly thinking about the reward signal. Same result.)
subjecting it to any kind of reinforcement learning at all
When you say that, it seems to suggest that what you’re really thinking about is really the LLMs as of today, for which the vast majority of their behavioral tendencies comes from pretraining, and there’s just a bit of RL sprinkled on top to elevate some pretrained behavioral profiles over other pretrained behavioral profiles. Whereas what I am normally thinking about (as are Silver & Sutton, IIUC) is that either 100% or ≈100% of the system’s behavioral tendencies are ultimately coming from some kind of RL system. This is quite an important difference! It has all kinds of implications. More on that in a (hopefully) forthcoming post.
don’t necessarily lead to coherent behavior out of distribution
As mentioned in my other comment, I’m assuming (like Sutton & Silver) that we’re doing continuous learning, as is often the case in the RL literature (unlike LLMs). So every time the agent does some out-of-distribution thing, it stops being out of distribution! So yes there are a couple special cases in which OOD stuff is important (namely irreversible actions and deliberately not exploring certain states), but you shouldn’t be visualizing the agent as normally spending its time in OOD situations—that would be self-contradictory. :)

Steven Byrnes Apr 24, 2025, 6:37 PM
4 points
0
in reply to: Towards_Keeperhood’s comment on: “The Era of Experience” has an unsolved technical alignment problem
I think that sounds off to AI researchers. They might (reasonably) think something like “during the critical value formation period the AI won’t have the ability to force humans to give positive feedback without receiving negative feedback”.
If an AI researcher said “during the critical value formation period, AlphaZero-chess will learn that it’s bad to lose your queen, and therefore it will never be able to recognize the value of a strategic queen sacrifice”, then that researcher would be wrong.
(But also, I would be very surprised if they said that in the first place! I’ve never heard anyone in AI use the term “critical value formation periods”.)
RL algorithms can get stuck in local optima of course, as can any other ML algorithm, but I’m implicitly talking about future powerful RL algorithms, algorithms that can do innovative science, run companies, etc., which means that they’re doing a good job of exploring a wide space of possible strategies and not just getting stuck in the first thing they come across.
I think your example [“the AI can potentially get a higher score by forcing the human to give positive feedback, or otherwise exploiting edge-cases in how this feedback is operationalized and measured”] isn’t very realistic—more likely would be that the value function just learned a complex proxy for predicting reward which totally misgeneralizes in alien ways once you go significantly off distribution.
I disagree. I think you’re overgeneralizing from RL algorithms that don’t work very well (e.g. RLHF), to RL algorithms that do work very well, like human brains or the future AI algorithms that I think Sutton & Silver have in mind.
For example, if I apply your logic there to humans 100,000 years ago, it would fail to predict the fact that humans would wind up engaging in activities like: eating ice cream, playing video games, using social media, watching television, raising puppies, virtual friends, fentanyl, etc. None of those things are “a complex proxy for predicting reward which misgeneralizes”, rather they are a-priori-extraordinarily-unlikely strategies, that do strongly trigger the human innate reward function, systematically and by design.
Conversely, I think you’re overstating the role of goal misgeneralization. Specifically, goal misgeneralization usually corrects itself: If there’s an OOD action or plan that seems good to the agent because of goal misgeneralization, then the agent will do that action or plan, and then the reward function will update the value function, and bam, now it’s no longer OOD, and it’s no longer misgeneralizing in that particular way. Remember, we’re talking about agents with continuous online learning.
Goal misgeneralization is important in a couple special cases—the case where it leads to irreversible actions (like editing the reward function or self-replicating), and the case where it leads to deliberately not exploring certain states. In these special cases, the misgeneralization can’t necessarily correct itself. But usually it does, so I think your description there is wrong.
I think the post focuses way too much on specification gaming
I did mention in §2.5 that it’s theoretically possible for specification gaming and goal misgeneralization to cancel each other out, but claimed that this won’t happen for their proposal. If the authors had said “yeah of course specification gaming itself is unsolvable, but we’re going to do that cancel-each-other-out thing”, then of course I would have elaborated on that point more. I think the authors are making a more basic error so that’s what I’m focusing on.
What links here?
- Steven Byrnes's comment on “The Era of Experience” has an unsolved technical alignment problem by Steven Byrnes (Apr 24, 2025, 7:09 PM; 4 points)

“The Era of Experience” has an unsolved technical alignment problem

Steven ByrnesApr 24, 2025, 1:57 PM

94 points

21 comments23 min readLW link

Steven Byrnes Apr 22, 2025, 10:22 PM
3 points
0
in reply to: lc’s comment on: lc’s Shortform
Yeah I said that to Matt Barnett 4 months ago here. For example, one man’s “avoiding conflict by reaching negotiated settlement” may be another man’s “acceding to extortion”. Evidently I did not convince him. Shrug.

Steven Byrnes Apr 18, 2025, 6:50 PM
2 points
0
in reply to: tailcalled’s comment on: johnswentworth’s Shortform
OK, here’s my argument that, if you take {intelligence, understanding, consequentialism} as a unit, it’s sufficient for everything:
- If durability and strength are helpful, then {intelligence, understanding, consequentialism} can discover that durability and strength are helpful, and then build durability and strength.
  - Even if “the exact ways in which durability and strength will be helpful” does not constitute a learnable pattern, “durability and strength will be helpful” is nevertheless a (higher-level) learnable pattern.
- If some other evolved aspects of the brain and body are helpful, then {intelligence, understanding, consequentialism} can likewise discover that they are helpful, and build them.
  - After all, if ‘those things are helpful’ wasn’t a learnable pattern, then evolution would not have discovered and exploited that pattern!
  - If the number of such aspects is dozens or hundreds or thousands, then whatever, {intelligence, understanding, consequentialism} can still get to work systematically discovering them all. The recipe for a human is not infinitely complex.
- If reducing heterogeneity is helpful, then {intelligence, understanding, consequentialism} can discover that fact, and figure out how to reduce heterogeneity.
- Etc.

Steven Byrnes Apr 17, 2025, 8:41 PM
6 points
7
in reply to: Eli Tyre’s comment on: Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI
I kinda agree, but that’s more a sign that schools are bad at teaching things, than a sign that human brains are bad at flexibly applying knowledge. See my comment here.

Steven Byrnes Apr 17, 2025, 2:08 PM
5 points
5
in reply to: Matrice Jacobine’s comment on: ASI existential risk: Reconsidering Alignment as a Goal
See my other comment. I find it distressing that multiple people here are evidently treating acknowledgements as implying that the acknowledged person endorses the end product. I mean, it might or might be true in this particular case, but the acknowledgement is no evidence either way.
(For my part, I’ve taken to using the formula “Thanks to [names] for critical comments on earlier drafts”, in an attempt to preempt this mistake. Not sure if it works.)

Steven Byrnes Apr 17, 2025, 1:58 PM
6 points
7
in reply to: Seth Herd’s comment on: ASI existential risk: Reconsidering Alignment as a Goal
Chiang and Rajaniemi are on board
Let’s all keep in mind that the acknowledgement only says that Chiang and Rajaniemi had conversations with the author (Nielsen), and that Nielsen found those conversations helpful. For all we know, Chiang and Rajaniemi would strongly disagree with every word of this OP essay. If they’ve even read it.

Steven Byrnes Apr 16, 2025, 12:54 PM
2 points
0
in reply to: tailcalled’s comment on: johnswentworth’s Shortform
Learning from strategies that stood the test of time would be tradition moreso than intelligence. I think tradition requires intelligence, but it also requires something else that’s less clear (and possibly not simple enough to be assembled manually, idk).
Right, that’s what I was gonna say. You need intelligence to sort out which traditions should be copied and which ones shouldn’t. There was a 13-billion-year “tradition” of not building e-commerce megastores, but Jeff Bezos ignored that “tradition”, and it worked out very well for him (and I’m happy about it too). Likewise, the Wright Brothers explicitly followed the “tradition” of how birds soar, but not the “tradition” of how birds flap their wings.
I do think there’s a “something else” (most [but not all] humans have an innate drive to follow and enforce social norms, more or less), but I don’t think it’s necessary. The Wright Brothers didn’t have any innate drive to copy anything about bird soaring tradition, but they did it anyway purely by intelligence.
Random street names aren’t necessarily important though?
I feel like I’ve lost the plot here. If you think there are things that are very important, but rare in the training data, and that LLMs consequently fail to learn, can you give an example?
Often the rare important things are very well known (after all, they are important, so people put a lot of effort into knowing them), they just can’t efficiently be derived from empirical data (except essentially by copying someone else’s conclusion blindly, and that leaves you vulnerable to deception).
I guess you’re using “empirical data” in a narrow sense. If Joe tells me X, I have gained “empirical data” that Joe told me X. And then I can apply my intelligence to interpret that “data”. For example, I can consider a number of hypotheses: the hypothesis that Joe is correct and honest, that Joe is mistaken but honest, that Joe is trying to deceive me, that Joe said Y but I misheard him, etc. And then I can gather or recall additional evidence that favors one of those hypotheses over another. I could ask Joe to repeat himself, to address the “I misheard him” hypothesis. I could consider how often I have found Joe to be mistaken about similar things in the past. I could ask myself whether Joe would benefit from deceiving me. Etc.
This is all the same process that I might apply to other kinds of “empirical data” like if my car was making a funny sound. I.e., consider possible generative hypotheses that would match the data, then try to narrow down via additional observations, and/or remain uncertain and prepare for multiple possibilities when I can’t figure it out. This is a middle road between “trusting people blindly” versus “ignoring everything that anyone tells you”, and it’s what reasonable people actually do. Doing that is just intelligence, not any particular innate human tendency—smart autistic people and smart allistic people and smart callous sociopaths etc. are all equally capable of traveling this middle road, i.e. applying intelligence towards the problem of learning things from what other people say.
(For example, if I was having this conversation with almost anyone else, I would have quit, or not participated in the first place. But I happen to have prior knowledge that you-in-particular have unusual and well-thought-through ideas, and even they’re wrong, they’re often wrong in very unusual and interesting ways, and that you don’t tend to troll, etc.)
I feel like I’m misunderstanding you somehow. You keep saying things that (to me) seem like you could equally well argue that humans cannot possibly survive in the modern world, but here we are. Do you have some positive theory of how humans survive and thrive in (and indeed create) historically-unprecedented heterogeneous environments?

Steven Byrnes Apr 16, 2025, 11:38 AM
2 points
0
in reply to: tailcalled’s comment on: johnswentworth’s Shortform
If your model if underparameterized (which I think is true for the typical model?), then it can’t learn any patterns that only occurs once in the data. And even if the model is overparameterized, it still can’t learn any pattern that never occurs in the data.
Dunno if anything’s changed since 2023, but this says LLMs learn things they’ve seen exactly once in the data.
I can vouch that you can ask LLMs about things that are extraordinarily rare in the training data—I’d assume well under once per billion tokens—and they do pretty well. E.g. they know lots of random street names.
Humans successfully went to the moon, despite it being a quite different environment that they had never been in before. And they didn’t do that with “durability, strength, healing, intuition, tradition”, but rather with intelligence.
Speaking of which, one can apply intelligence towards the problem of being resilient to unknown unknowns, and one would come up with ideas like durability, healing, learning from strategies that have stood the test of time (when available), margins of error, backup systems, etc.

Steven Byrnes Apr 15, 2025, 11:08 PM
4 points
2
in reply to: tailcalled’s comment on: johnswentworth’s Shortform
I think you’re conflating consequentialism and understanding in a weird-to-me way. (Or maybe I’m misunderstanding.)
I think consequentialism is related to choosing one action versus another action. I think understanding (e.g. predicting the consequence of an action) is different, and that in practice understanding has to involve self-supervised learning.
(I think human brains have both [partly-] consequentialist decisions and self-supervised updating of the world-model.) (They’re not totally independent, but rather they interact via training data: e.g. [partly-] consequentialist decision-making determines how you move your eyes, and then whatever your eyes are pointing at, your model of the visual world will then update by self-supervised learning on that particular data. But still, these are two systems that interact, not the same thing.)
I think self-supervised learning is perfectly capable of discovering rare but important patterns. Just look at today’s foundation models, which seem pretty great at that.

Steven Byrnes Apr 15, 2025, 5:34 PM
LW: 2 AF: 2
0
AF
on: Evaluating the historical value misspecification argument
I’m not too interested in litigating what other people were saying in 2015, but OP is claiming (at least in the comments) that “RLHF’d foundation models seem to have common-sense human morality, including human-like moral reasoning and reflection” is evidence for “we’ve made progress on outer alignment”. If so, here are two different ways to flesh that out:
1. An RLHF’d foundation model acts as the judge / utility function; and some separate system comes up with plans that optimize it—a.k.a. “you just need to build a function maximizer that allows you to robustly maximize the utility function that you’ve specified”.
  1. I think this plan fails because RLHF’d foundation models have adversarial examples today, and will continue to have adversarial examples into the future. (To be clear, humans have adversarial examples too, e.g. drugs & brainwashing.)
2. There is no “separate system”, but rather an RLHF’d foundation model (or something like it) is the whole system that we’re talking about here. For example, we may note that, if you hook up an RLHF’d foundation model to tools and actuators, then it will actually use those tools and actuators in accordance with common-sense morality etc.
(I think 2 is the main intuition driving the OP, and 1 was a comments-section derailment.)
As for 2:
- I’m sympathetic to the argument that this system might not be dangerous, but I think its load-bearing ingredient is that pretraining leads to foundation models tending to do intelligent things primarily by emitting human-like outputs for human-like reasons, thanks in large part to self-supervised pretraining on internet text. Let’s call that “imitative learning”.
- Indeed, here’s a 2018 post where Eliezer (as I read it) implies that he hadn’t really been thinking about imitative learning before (he calls it an “interesting-to-me idea”), and suggests that imitative learning might “bypass the usual dooms of reinforcement learning”. So I think there is a real update here—if you believe that imitative learning can scale to real AGI.
- …But the pushback (from Rob and others) in the comments is mostly coming from a mindset where they don’t believe that imitative learning can scale to real AGI. I think the commenters are failing to articulating this mindset well, but I think they are in various places leaning on certain intuitions about how future AGI will work (e.g. beliefs deeply separate from goals), and these intuitions are incompatible with imitative learning being the primary source of optimization power in the AGI (as it is today, I claim).
- (A moderate position is that imitative learning will be less and less relevant as e.g. o1-style RL post-training becomes a bigger relative contribution to the trained weights; this would presumably lead to increased future capabilities hand-in-hand with increased future risk of egregious scheming. For my part, I subscribe to the more radical theory that our situation is even worse than that: I think future powerful AGI will be built via a different AI training paradigm that basically throws out the imitative learning part altogether.)
- I have a forthcoming post (hopefully) that will discuss this much more and better.

Steven Byrnes Apr 15, 2025, 1:21 PM
8 points
0
in reply to: tailcalled’s comment on: johnswentworth’s Shortform
(IMO this is kinda unrelated to the OP, but I want to continue this thread.)
Have you elaborated on this anywhere?
Perhaps you missed it, but some guy in 2022 wrote this great post which claimed that “Consequentialism, broadly defined, is a general and useful way to develop capabilities.” ;-)
I’m actually just in the course of writing something about why “consequentialism provides an extremely powerful but difficult-to-align method of converting intelligence into agency” … maybe I can send you the draft for criticism when it’s ready?

Steven Byrnes

“The Era of Ex­pe­rience” has an un­solved tech­ni­cal al­ign­ment problem

“The Era of Experience” has an unsolved technical alignment problem