I’m an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, LinkedIn, and more at my website.
Steven Byrnes
Here’s a nice video of a teleoperated robot vacuuming, making coffee, cleaning a table, emptying the dishwasher, washing & drying & folding & hanging laundry, making a bed, etc. They also have a video where it cooks a meal. Pretty impressive! It’s surprising how much you can do without sensitive fingers!
Their website lists a bill of materials for the teleoperated robot of ≈$30K, or ≈$20K if hypothetically there were an AGI teleoperating it (because you wouldn’t need the teleoperation UI parts, or on-board laptop). It’s a one-off made by students.
Re SMTM: negative feedback on negative feedback
Say during keeper training the keeper was rewarded for thinking in productive ways, so the value function may have learned to supply valence for thinking in productive ways.
The way I currently think of it, it doesn’t matter which goal the keeper then attacks, because the value function still assigns high valence for thinking in those fun productive ways.
You seem to be in a train-then-deploy mindset, rather than a continuous-learning mindset, I think. In my view, the value function never stops being edited to hew closely to primary rewards. The minute the value function claims that a primary reward is coming, and then no primary reward actually arrives, the value function will be edited to not make that prediction again.
For example, imagine a person who has always liked listening to jazz, but right now she’s clinically depressed, so she turns on some jazz, but finds that it doesn’t feel rewarding or enjoyable. Not only will she turn the music right back off, but she has also learned that it’s pointless to even turn it on, at least when she’s in this mood. That would be a value function update.
Now, it’s possible that the Keeper 101 course was taught by a teacher who the trainee looked up to. Then the teacher said “X is good”, where X could be a metacognitive strategy, a goal, a virtue, or whatever. The trainee may well continue believing that X is good after graduation. But that’s just because there’s a primary reward related to social instincts, and imagining yourself as being impressive to people you admire. I agree that this kind of primary reward can support lots of different object-level motivations—cultural norms are somewhat arbitrary.
How do you imagine the value function might learn to assign negative valence to “pdoom=high”?
Could be the social copying thing I mentioned above, or else the person is thinking of one of the connotations and implications of pdoom that hooks into some other primary reward, like maybe they imagine the robot apocalypse will be physically painful, and pain is bad (primary reward), or doom will mean no more friendship and satisfying-curiosity, but friendship and satisfying-curiosity are good (primary reward), etc. Or more than one of the above, and/or different for different people.
I think your setup is weird for the context where people talk about utility functions, circular preferences, and money-pumping. “Having a utility function” is trivial unless the input to the utility function is something like “the state of the world at a certain time in the future”. So in that context, I think we should be imagining something like this:
And the stereotypical circular preference would be between “I want the world to be in State {A,B,C} at a particular future time T”.
I think you’re mixing up an MDP RL scenario with a consequentialist planning scenario? MDP RL agents can make decisions based on steering towards particular future states, but they don’t have to, and they often don’t, especially if the discount rate is high.
“AI agents that care about the state of the world in the future, and take actions accordingly, with great skill” are a very important category of AI to talk about because (1) such agents are super dangerous because of instrumental convergence, (2) people will probably make such agents anyway, because people care about the state of the world in the future, and also because such AIs are very powerful and impressive etc.
Yeah I don’t understand where Eliezer is coming from in that tweet (and he’s written similar things elsewhere).
I don’t know as much about algorithms as you; my argument centers instead on the complexity of the world.
Like, here a question: is “tires are usually black”, and stuff like that, encoded directly in the source code?
If yes, then (1) that’s unrealistic, and (2) even if it happened, the source code would wind up horrifically complicated and inscrutable, because it’s a complicated world.
If no, then the source code must be defining a learning algorithm of some sort, which in turn will figure out for itself that tires are usually black. Might this learning algorithm be simple and legible? Yes! But that was also true for GPT-3 too, which Eliezer has always put in the inscrutable category.
So what is he taking about? He must have some alternative in mind, but I’m unsure what.
Thanks, that’s very helpful!
If we divide the inventing-ASI task into (A) “thinking about and writing algorithms” versus (B) “testing algorithms”, in the world of today there’s a clean division of labor where the humans do (A) and the computers do (B). But in your imagined October 2027 world, there’s fungibility between how much compute is being used on (A) versus (B). I guess I should interpret your “330K superhuman AI researcher copies thinking at 57x human speed” as what would happen if the compute hypothetically all went towards (A), none towards (B)? And really there’s gonna be some division of compute between (A) and (B), such that the amount of (A) is less than I claimed? …Or how are you thinking about that?
I’m curious about your definition for importantly useful AI actually. Under some interpretations I feel like current AI should cross that bar.
Right, but I’m positing a discontinuity between current AI and the next paradigm, and I was talking about the gap between when AI-of-that-next-paradigm is importantly useful versus when it’s ASI. For example, AI-of-that-next-paradigm might arguably already exist today but where it’s missing key pieces such that it barely works on toy models in obscure arxiv papers. Or here’s a more concrete example: Take the “RL agent” line of AI research (AlphaZero, MuZero, stuff like that), which is quite different from LLMs (e.g. “training environment” rather than “training data”, and there’s nothing quite like self-supervised pretraining (see here)). This line of research has led to great results on board games and videogames, but it’s more-or-less economically useless, and certainly useless for alignment research, societal resilience, capabilities research, etc. If it turns out that this line of research is actually much closer to how future ASI will work at a nuts-and-bolts level than LLMs are (for the sake of argument), then we have not yet crossed the “AI-of-that-next-paradigm is importantly useful” threshold in my sense.
If it helps, here’s a draft paragraph from that (hopefully) forthcoming post:
Another possible counter-argument from a prosaic-AGI person would be: “Maybe this future paradigm exists, but LLM agents will find it, not humans, so this is really part of that ‘AIs-doing-AI-R&D’ story like I’ve been saying”. I have two responses. First, I disagree with that prediction. Granted, probably LLMs will be a helpful research tool involved in finding the new paradigm, but there have always been helpful research tools, from PyTorch to arXiv to IDEs, and I don’t expect LLMs to be fundamentally different from those other helpful research tools. Second, even if it’s true that LLMs will discover the new paradigm by themselves (or almost by themselves), I’m just not sure I even care. I see the pre-paradigm-shift AI world as a lesser problem, one that LLM-focused AI alignment researchers (i.e. the vast majority of them) are already focusing on. Good luck to them. And I want to talk about what happens in the strange new world that we enter after that paradigm shift.
Next:
If so, that implies extremely fast takeoff, correct? Like on the order of days from AI that can do important things to full-blown superintelligence?
Well, even if you have an ML training plan that will yield ASI, you still need to run it, which isn’t instantaneous. I dunno, it’s something I’m still puzzling over.
…But yeah, many of my views are pretty retro, like a time capsule from like AI alignment discourse of 2009. ¯\_(ツ)_/¯
Have you seen the classic parody article “On the Impossibility of Supersized Machines”?
I think it’s possible to convey something pedagogically useful via those kinds of graphs. They can be misinterpreted, but this is true of many diagrams in life. I do dislike the “you are here” dot that Tim Urban added recently.
Are you suggesting that e.g. “R&D Person-Years 463205–463283 go towards ensuring that the AI has mastery of metallurgy, and R&D Person-Years 463283–463307 go towards ensuring that the AI has mastery of injection-molding machinery, and …”?
If no, then I don’t understand what “the world is complicated” has to do with “it takes a million person-years of R&D to build ASI”. Can you explain?
…Or if yes, that kind of picture seems to contradict the facts that:
This seems quite disanalogous to how LLMs are designed today (i.e., LLMs can already answer any textbook question about injection-molding machinery, but no human doing LLM R&D has ever worked specifically on LLM knowledge of injection-molding machinery),
This seems quite disanalogous to how the human brain was designed (i.e., humans are human-level at injection-molding machinery knowledge and operation, but Evolution designed human brains for the African Savannah, which lacked any injection-molding machinery).
Fun fact: AI-2027 estimates that getting to ASI might take the equivalent of a 100-person team of top human AI research talent working for tens of thousands of years.
I’m curious why ASI would take so much work. What exactly is the R&D labor supposed to be doing each day, that adds up to so much effort? I’m curious how people are thinking about that, if they buy into this kind of picture. Thanks :)
(Calculation details: For example, in October 2027 of the AI-2027 modal scenario, they have “330K superhuman AI researcher copies thinking at 57x human speed”, which is 1.6 million person-years of research in that month alone. And that’s mostly going towards inventing ASI, I think. Did I get that right?)
(My own opinion, stated without justification, is that LLMs are not a paradigm that can scale to ASI, but after some future AI paradigm shift, there will be very very little R&D separating “this type of AI can do anything importantly useful at all” and “full-blown superintelligence”. Like maybe dozens or hundreds of person-years, or whatever, as opposed to millions. More on this in a (hopefully) forthcoming post.)
Video & transcript: Challenges for Safe & Beneficial Brain-Like AGI
Imagine a dath ilani keeper who trained himself good heuristics for estimating expected utilities for what action to take or thought to think next, and reasons like that all the time. This keeper does not seem to me well-described as “using his model-based RL capabilities in the way we normally would expect”.
Why not? If he’s using such-and-such heuristic, then presumably that heuristic is motivating to them—assigned a positive value by the value function. And the reason it’s assigned a positive value by the value function is because of the past history of primary rewards etc.
the impressive long-term optimization happens mainly through expected utility guesses the world model makes
The candy example involves good long-term planning right? But not explicit guesses of expected utility.
…But sure, it is possible for somebody’s world-model to have a “I will have high expected utility” concept, and for that concept to wind up with high valence, in which case the person will do things consistent with (their explicit beliefs about) getting high utility (at least other things equal and when they’re thinking about it).
But then I object to your suggestion (IIUC) that what constitutes “high utility” is not strongly and directly grounded by primary rewards.
For example, if I simply declare that “my utility” is equal by definition to the fraction of shirts on Earth that have an odd number of buttons (as an example of some random thing with no connection to my primary rewards), then my value function won’t assign a positive value to the “my utility” concept. So it won’t feel motivating. The idea of “increasing my utility” will feel like a dumb pointless idea to me, and so I won’t wind up doing it.
But the decision of what plan I end up pursuing doesn’t depend on the value function.
The world-model does the “is” stuff, which in this case includes the fact that planA causes a higher expected reduction in pdoom than planB. The value function (and reward function) does the “ought” stuff, which in this case includes the notion that low pdoom is good and high pdoom is bad, as opposed to the other way around.
(Sorry if I’m misunderstanding, here or elsewhere.)
Hmm, I guess my main cause for skepticism is that I think the setup would get subverted somehow—e.g. either the debaters, or the “human simulator”, or all three in collusion, will convince the human to let them out of the box. In your classification, I guess this would be a “high-stakes context”, which I know isn’t your main focus. You talk about it a bit, but I’m unconvinced by what you wrote (if I understood it correctly) and don’t immediately see any promising directions.
Secondarily, I find it kinda hard to believe that two superhuman debaters would converge to “successfully conveying subtle hard-to-grasp truths about the alignment problem to the judge” rather than converging to “manipulation tug-of-war on the judge”.
Probably at least part of the difference / crux between us is that, compared to most people, I tend to assume that there isn’t much of a stable, usable window between “AI that’s competent enough to really help” and “AI that’s radically superhuman”, and I know that you’re explicitly assuming “not extremely superhuman”. (And that in turn is probably at least partly related to the fact that you’re thinking about LLMs and I’m thinking about other AI paradigms.) So maybe this comment isn’t too helpful, oh well.
I think chains of subgoals can potentially be very long, and I don’t think we keep the whole chain in mind to get the positive valence of a thought, so we somehow need a shortcut.
We can have hierarchical concepts. So you can think “I’m following the instructions” in the moment, instead of explicitly thinking “I’m gonna do Step 1 then Step 2 then Step 3 then Step 4 then …”. But they cash out as the same thing.
E.g. when I do some work, I think I usually don’t partially imagine the high-valence outcome of filling the galaxies with happy people living interesting lives, which I think is the main reason why I am doing the work I do (athough there are intermediate outcomes that also have some valence).
No offense but unless you have a very unusual personality, your immediate motivations while doing that work are probably mainly social rather than long-term-consequentialist. On a small scale, consequentialist motivations are pretty normal (e.g. walking up the stairs to get your sweater because you’re cold). But long-term-consequentialist actions and motivations are rare in the human world.
Normally people do things because they’re socially regarded as good things to do, not because they have good long-term consequences. Like, if you see someone save money to buy a car, a decent guess is that the whole chain of actions, every step of it, is something that they see as socially desirable. So during the first part, where they’re saving money but haven’t yet bought the car, they’d be proud to tell their friends and role models “I’m saving money—y’know I’m gonna buy a car!”. Saving the money is not a cost with a later benefit. Rather, the benefit is immediate. They don’t even need to be explicitly thinking about the social aspects, I think; once the association is there, just doing the thing feels intrinsically motivating—a primary reward, not a means to an end.
Doing the first step of a long-term plan, without social approval for that first step, is so rare that people generally regard it as highly suspicious. Just look at Earning To Give (EtG) in Effective Altruism, the idea of getting a high-paying job in order to have money and give it to charity. Go tell a normal non-quantitative person about EtG and they’ll assume it’s an obvious lie, and/or that the person is a psycho. That’s how weird it is—it doesn’t even cross most people’s minds that someone is actually doing a socially-weird plan because of its expected long-term consequences, unless the person is Machiavellian or something.
Speaking of which, there’s a fiction trope that basically only villains are allowed to make plans and display intelligence. The way to write a hero in (non-rationalist) fiction is to have conflicts between doing things that have strong immediate social approval, versus doing things for other reasons (e.g. fear, hunger, logic(!)), and the former wins out over the latter.
To be clear, I’m not accusing you of failing to do things with good long-term consequences because they have good long-term consequences. Rather, I would suggest that the pathway is that your brain has settled on the idea that working towards good long-term outcomes is socially good, e.g. the kind of thing that your role models would be happy to hear about. So then you get the immediate intrinsic motivation by doing that kind of work, and yet it’s also true that you’re sincerely working towards consequences that are (hopefully) good. And then some more narrow projects towards that end can also wind up feeling socially good (and hence become intrinsically rewarding, even without explicitly holding their long-term consequences in mind), etc.
the value function might learn this because it enables the agent to accomplish difficult long-range tasks which yield reward
I don’t think this is necessary per above, but I also don’t think it’s realistic. The value function updating rule is something like TD learning, a simple equation / mechanism, not an intelligent force with foresight. (Or sorry if I’m misunderstanding. I didn’t really follow this part or the rest of your comment :( But I can try again if it’s important.)
Thanks!
This point on why alignment is harder than David Silver and Richard Sutton think also applies to the specification for capabilities you made: …
I interpret you as saying: “In the OP, Steve blathered on and on about that it’s hard to ensure that the AI has some specific goal like ‘support human well-being’. But now Steve is saying it’s straightforward to ensure that the AI has a goal of earning $1B without funny business. What accounts for that difference?”
(Right?) Basically, my proposal here (with e.g. a reward button) is a lot like giving the AI a few hits of an addictive drug and then saying “you can have more hits, but I’m the only one who knows how to make the drug, you must do as I say for the next hit”.
This kind of technique is:
Very obvious and easy to implement in an RL agent context
Adequate for getting my AI to make me lots of money (because I can see it in my bank account and have a waiting period and due diligence as above),
Inadequate for getting AI to do alignment research (because the AI would ultimately care about producing outputs that convince me, rather than outputs that actually solve the problem, and we have abundant proof that humans can be convinced by incorrect arguments about alignment, otherwise the field wouldn’t have such long-running disagreements) (i.e. I’m taking John Wentworth’s side here),
Inadequate for solving the alignment problem by itself (because the AI will eventually get sufficiently powerful to brainwash me, kidnap my children, etc.).
A large crux here is that the task, assuming it was carried out in a way such that the AI could seize the reward button (as is likely to be true for realistic tasks and realistic capabilities levels)
Well yeah if the AI can seize the reward button but chooses not to, that’s obviously reason for optimism.
I was talking instead about a scenario where the AI can’t seize it, and where nobody knows how to make it such that the AI doesn’t want to seize it.
Maybe you think it’s implausible that the AI would be capable of earning $1B before being capable of seizing the reward button. If so, fine, whatever, just substitute a less ambitious goal than earning $1B. Or alternatively, imagine that the reward button is unusually secure, e.g. it’s implemented as ‘cryptographic reward tokens’ stored in an air-gapped underground bunker with security cameras etc. (Cf. some discussion in Superintelligence (2014)). This doesn’t work forever but this would be a way to delay the inevitable catastrophe, allowing more money to be made in the meantime.
If the value function is simple, I think it may be a lot worse than the world-model/thought-generator at evaluating what abstract plans are actually likely to work (since the agent hasn’t yet tried a lot of similar abstract plans from where it could’ve observed results, and the world model’s prediction making capabilities generalize further).
Here’s an example. Suppose I think: “I’m gonna pick the cabinet lock and then eat the candy inside”. The world model / thought generator is in charge of the “is” / plausibility part of this plan (but not the “ought” / desirability part): “if I do this plan, then I will almost definitely wind up eating candy”, versus “if I do this plan, then it probably won’t work, and I won’t eat candy anytime soon”. This is a prediction, and it’s constrained by my understanding of the world, as encoded in the thought generator. For example, if I don’t expect the plan to succeed, I can’t will myself to expect the plan to succeed, any more than I can will myself to sincerely believe that I’m scuba diving right now as I write this sentence.
Remember, the eating-candy is an essential part of the thought. “I’m going to break open the cabinet and eat the candy”. No way am I going to go to all that effort if the concept of eating candy at the end is not present in my mind.
Anyway, if I actually expect that such-and-such plan will lead to me eating candy with near-certainty in the immediate future, then the “me eating candy” concept will be strongly active when I think about the plan; conversely, if I don’t actually expect it to work, or expect it to take 6 hours, then the “me eating candy” concept will be more weakly active. (See image here.)
Meanwhile, the value function is figuring out if this is a good plan or not. But it doesn’t need to assess plausibility—the thought generator already did that. Instead, it’s much simpler: the value function has a positive coefficient on the “me eating candy” concept, because that concept has reliably predicted primary rewards in the past.
So if we combine the value function (linear functional with a big positive coefficient relating “me eating candy” concept activation to the resulting valence-guess) with the thought generator (strong activation of “me eating candy” when I’m actually expecting it to happen, especially soon), then we’re done! We automatically get plausible and immediate candy-eating plans getting a lot of valence / motivational force, while implausible, distant, and abstract candy-eating plans don’t feel so motivating.
Does that help? (I started writing a response to the rest of what you wrote, but maybe it’s better if I pause there and see what you think.)
Anecdata: I quit caffeine for the past three months (previously I was drinking 1 cup of coffee in the morning and sometimes an iced tea with lunch). My body reacts to caffeine in an unusual way—very very sensitive, e.g. I’ll get withdrawal headaches if I drink an iced tea with lunch on one day but skip it the next day—so this probably won’t generalize. But FWIW, I don’t notice much change overall, but the one exception is that I used to often get mentally tired in the late morning or early afternoon, whereas now I … still do, but the mental tiredness now happens in conjunction with sleepiness, so I can take a 15-minute power-nap and then I feel better. I think my ability to do intense deep work in the afternoon is somewhat better now overall because of that.
The downsides are that I liked my morning coffee, and still miss it, and also I had withdrawal headaches for the first day or two, and I had cravings for the first week or two, like a surprisingly strong feeling that something was missing in my life that would be solved by drinking a coffee. That part really caught me off-guard, I had never felt that way before. (I’ve never been much of a substance user obviously!)
Last week I used the google drive attachment option in google AI studio, with Gemini 2.5 pro, and asked it for copyediting things (typos, confusing wording, unexplained acronyms, etc.), then for things that could be deleted, then for things that could be added, then for things that seemed possibly incorrect. It was good! I mean, most of the suggestions were bad but enough were good that it was worth the time.
(Sorry if you’re talking about something else.)
“almost everything in the world is solvable via (1) Human A wants it solved, (2) Agent B is motivated by the prospect of Human A pressing the reward button on Agent B if things turn out well, (3) Human A is somewhat careful not to press the button until they’re quite sure that things have indeed turned out well, (4) Agent B is able to make and execute long-term plans”.
In particular, every aspect of automating the economy is solvable that way—for example (I was just writing this in a different thread), suppose I have a reward button, and tell an AI:
Hello AI. Here is a bank account with $100K of seed capital. Go make money. I’ll press the reward button if I can successfully withdraw $1B from that same bank account in the future. (But I’ll wait 1 year between withdrawing the funds and pressing the reward button, during which I’ll perform due diligence to check for law-breaking or any other funny business. And the definition of ‘funny business’ will be at my sole discretion, so you should check with me in advance if you’re unsure where I will draw the line.) Good luck!
And let’s assume the AI is purely motivated by the reward button, but not yet capable of brainwashing me or stealing my button. (I guess that’s rather implausible if it can already autonomously make $1B, but maybe we’re good at Option Control, or else substitute a less ambitious project like making a successful app or whatever.) And assume that I have no particular skill at “good evaluation” of AI outputs. I only know enough to hire competent lawyers and accountants for pretty basic due diligence, and it helps that I’m allowing an extra year for law enforcement or public outcry or whatever to surface any subtle or sneaky problems caused by my AI.
So that’s a way to automate the economy and make trillions of dollars (until catastrophic takeover) without making any progress on the “need for good evaluation” problem of §6.1. Right?
And I don’t buy your counterargument that the AI will fail at the “make $1B” project above (“trying to train on these very long-horizon reward signals poses a number of distinctive challenges…”) because e.g. that same argument would also “prove” that no human could possibly decide that they want to make $1B, and succeed. I think you’re thinking about RL too narrowly—but we can talk about that separately.
Thanks! …But I think you misunderstood.
Suppose I tell an AI:
Hello AI. Here is a bank account with $100K of seed capital. Go make money. I’ll press the reward button if I can successfully withdraw $1B from that same bank account in the future. (But I’ll wait 1 year between withdrawing the funds and pressing the reward button, during which I’ll perform due diligence to check for law-breaking or any other funny business. And the definition of ‘funny business’ will be at my sole discretion, so you should check with me in advance if you’re unsure where I will draw the line.) Good luck!
That’s full 100% automation, not 90%, right?
“Making $1B” is one example project for concreteness, but the same idea could apply to writing code, inventing technology, or whatever else. If the human can’t ever tell whether the AI succeeded or failed at the project, then that’s a very unusual project, and certainly not a project that results in making money or impressing investors etc. And if the human can tell, then they can press the reward button when they’re sure.
Normal people can tell that their umbrella keeps them dry without knowing anything about umbrella production. Normal people can tell whether their smartphone apps are working well without knowing anything about app development and debugging. Etc.
And then I’m claiming that this kind of strategy will “work” until the AI is sufficiently competent to grab the reward button and start building defenses around it etc.
Thanks!
What’s the difference between “utilities over outcomes/states” and “utilities over worlds/universes”?
(I didn’t use the word “outcome” in my OP, and I normally take “state” to be shorthand for “state of the world/universe”.)
I didn’t intend for the word “consequentialist” to imply CDT, if that’s what your thinking. I intended “consequentialist” to be describing what an agent’s final preferences are (i.e., they’re preferences about the state of the world/universe in the future, and more centrally the distant future), whereas I think decision theory is about how to make decisions given one’s final preferences.
I wrote: “For example, pause for a second and think about the human concept of “going to the football game”. It’s a big bundle of associations containing immediate actions, and future actions, and semantic context, and expectations of what will happen while we’re doing it, and expectations of what will result after we finish doing it, etc. We humans are perfectly capable of pattern-matching to these kinds of time-extended concepts, and I happen to expect that future AGIs will be as well.”
I think this is true, important, and quite foreign compared to the formalisms that I’ve seen from MIRI or Stuart Armstrong.
Can I offer a mathematical theory in which the mental concept of “going to the football game” comfortably sits? No. And I wouldn’t publish it even if I could, because of infohazards. But it’s obviously possible because human brains can do it.
I think “I am being helpful and non-manipulative”, or “the humans remain in control” could be a mental concept that a future AI might have, and pattern-match to, just as “going to the football game” is for me. And if so, we could potentially set things up such that the AI finds things-that-pattern-match-to-that-concept to be intrinsically motivating. Again, it’s a research direction, not a concrete plan. But I talk about it more at Plan for mediocre alignment of brain-like [model-based RL] AGI. For what it’s worth, I think I’m somewhat more skeptical of this research direction now than when I wrote that 2 years ago, more on which in a (hopefully) forthcoming post.
MIRI was trying to formalize things, and nobody knows how to formalize what I’m talking about, involving pattern-matching to fuzzy time-extended learned concepts, per above.
Anyway, I’m reluctant to get in an endless argument about what other people (who are not part of this conversation) believe. But FWIW, The Problem of Fully Updated Deference is IMO a nice illustration of how Eliezer has tended to assume that ASIs will have preferences purely about the state of the world in the future. And also everything else he said and wrote especially in the 2010s, e.g. the one I cited in the post. He doesn’t always say it out loud, but if he’s not making that assumption, almost everything he says in that post is trivially false. Right? You can also read this 2018 comment where (among other things) he argues more generally that “corrigibility is anti-natural”; my read of it is that he has that kind of preference structure at the back of his mind, although he doesn’t state that explicitly and there are other things going on too (e.g. as usual I think Eliezer is over-anchored on the evolution analogy.)