I feel pretty frustrated at how rarely people actually bet or make quantitative predictions about existential risk from AI. [...]
I think that alignment “theorizing” is often a bunch of philosophizing and vibing in a way that protects itself from falsification (or even proof-of-work) via words like “pre-paradigmatic” and “deconfusion.” I think it’s not a coincidence that many of the “canonical alignment ideas” somehow don’t make any testable predictions until AI takeoff has begun.
This sentiment resonates strongly with me.
A personal background: I remember getting pretty heavily involved in AI alignment discussions on LessWrong in 2019. Back then I think there were a lot of assumptions people had about what “the problem” was that are, these days, often forgotten, brushed aside, or sometimes even deliberately minimized post-hoc in order to give the impression that the field has a better track record than it actually does. [ETA: but to be clear, I don’t mean to say everyone made the same mistake I describe here]
This has been a bit shocking and disorienting to me, honestly, because at the time in 2019 I didn’t get the strong impression that people were deliberately constructing unfalsifiable models of the problem. I had the vague impression that people had relatively firm views that made predictions about the world prior to superintelligence, and that these views were open to revision upon new evidence. And that naiveté led me to experience the last few years of evidence a bit differently than I think some other people.
To give a cursory taste of what I’m talking about, we can consider what I think of as a representative blog post from the genre of pre-2020 alignment content: Rohin Shah and Dmitrii Krasheninnikov’s “Learning Preferences by Looking at the World”. This is by no means a cherry-picked example either, in my opinion (and I don’t mean to criticize Rohin specifically, I’m just using this blog post as a representative example of what people talked about at the time). In the blog post, they state,
It would be great if we could all have household robots do our chores for us. Chores are tasks that we want done to make our houses cater more to our preferences; they are a way in which we want our house to be different from the way it currently is. However, most “different” states are not very desirable:
Surely our robot wouldn’t be so dumb as to go around breaking stuff when we ask it to clean our house? Unfortunately, AI systems trained with reinforcement learning only optimize features specified in the reward function and are indifferent to anything we might’ve inadvertently left out. Generally, it is easy to get the reward wrong by forgetting to include preferences for things that should stay the same, since we are so used to having these preferences satisfied, and there are so many of them.
[...]
Note that we’re not talking about problems with robustness and distributional shift: while those problems are worth tackling, the point is that even if we achieve robustness, the simple reward function still incentivizes the above unwanted behaviors.
Suppose in 2024-2029, someone constructs an intelligent robot that is able clean a room to a high level of satisfaction, consistent with the user’s intentions, without any major negative side effects or general issues of misspecification. It doesn’t break any vases while cleaning. It respects all basic moral norms you can think of. It lets you shut it down whenever you want. And it actually does its job of cleaning the room in a reasonable amount of time.
If that were to happen, I think an extremely natural reading of the situation is that a substantial part of what we thought “the problem” was in value alignment has been solved, from the perspective of this blog post from 2019. That is cause for an updating of our models, and a verbal recognition that our models have updated in this way.
Yet, that’s not how I think everyone on LessWrong would react to the development of such a robot. My impression is that a large fraction, perhaps a majority, of LessWrongers would not share my interpretation here, despite the plain language in the post explaining what they thought the problem was. Instead, I imagine many people would respond to this argument basically saying the following:
“We never thought that was the hard bit of the problem. We always thought it would be easy to get a human-level robot to follow instructions reliably, do what users intend without major negative side effects, follow moral constraints including letting you shut it down, and respond appropriately given unusual moral dilemmas. The idea that we thought that was ever the problem is a misreading of what we wrote. The problem was always purely that alignment issues would arise after we far surpassed human intelligence, at which point entirely novel problems will arise.”
But the blog post said there was a problem, gave an example of the problem manifesting, and then spent the rest of the post trying to come up with solutions. The authors gave no indication that this particular problem was trivial, or that the example used was purely illustrative and had nothing to do with the type of real-world issues that might arise if we fail to solve the problem. If I was misreading the blog post at the time, how come it seems like almost no one ever explicitly predicted at the time that these particular problems were trivial for systems below or at human-level intelligence?!?
To clarify, I’m in full agreement with anyone who simply says that the alignment problem looks like it might still be hard, based on different arguments than the one presented in this blog post. There were a lot of arguments people gave back then, and some of the older arguments still look correct. Perhaps most significantly, robustness to distribution shifts still looks reasonably hard as a problem. But the blog post I cited explicitly said “Note that we’re not talking about problems with robustness and distributional shift”!
At this point I think there are a number of potential replies from people who still insist that the LW models of AI alignment were never wrong, which I (depending on the speaker) think can often border on gaslighting:
“Rohin’s point wasn’t that this problem would be hard. He was using it as a mere stepping stone to explain much harder problems of misspecification that were, at that time, purely theoretical.”
Then why did the paper explicitly say, “for many real-world tasks it can be challenging to specify a reward function that captures human preferences, particularly the preference for avoiding unnecessary side effects while still accomplishing the goal (Amodei et al., 2016).”
Why is it so hard to find people explicitly saying that this specific problem, and the examples illustrating it, were not meant to be seriously representative of the hard parts of alignment at the time?
Isn’t it still pretty valuable to point out that we’re solving stepping stones on the path towards the ‘real problem’?
“Sure, Rohin thought that was a major problem, but we [our organization/thought cluster/ideological group] never agreed with him.”
Oh really? Did you ever explicitly highlight this particular disagreement at the time? He wasn’t exactly a minor researcher at the time. And this blog post is only one of a number of blog posts expressing essentially an identical sentiment.
[ETA: to be fair, I do think there were some people who did genuinely disagree with Rohin’s framing. I don’t mean to accuse everyone of making the same error.]
“Yes, this particular part of the alignment problem looks easier than we thought, but serious people always thought that this was going to be one of the easiest subproblems, compared to other things. This problem was considered a very minor sub-problem of value alignment that merited like 1% of researcher-hours.”
Then why is it so easy to find countless blog posts of a similar nature from alignment researchers at the time, presenting pretty much the same problem and then presenting an attempt to solve it? Did all those people simply knowingly work on one of the easiest sub-problems of alignment?
I wrote a fair amount about alignment from 2014-2020[1] which you can read here. So it’s relatively easy to get a sense for what I believed.
Here are some summary notes about my views as reflected in that writing, though I’d encourage you to just judge for yourself[2] by browsing the archives:
I expected AI systems to be pretty good at predicting what behaviors humans would rate highly, long before they were catastrophically risky. This comes up over and over again in my writing. In particular, I repeatedly stated that it was very unlikely that an AI system would kill everyone because it didn’t understand that people would disapprove of that action, and therefore this was not the main source of takeover concerns. (By 2017 I expected RLHF to work pretty well with language models, which was reflected in my research prioritization choices and discussions within OpenAI though not clearly in my public writing.)
I consistently expressed that my main concerns were instead about (i) systems that were too smart for humans to understand the actions they proposed, (ii) treacherous turns from deceptive alignment. This comes up a lot, and when I talk about other problems I’m usually clear that they are prerequisites that we should expect to succeed. Eg.. see an unaligned benchmark. I don’t think this position was an extreme outlier, my impression at the time was that other researchers had broadly similar views.
I think the biggest-alignment relevant update is that I expected RL fine-tuning over longer horizons (or even model-based RL a la AlphaZero) to be a bigger deal. I was really worried about it significantly improving performance and making alignment harder. In 2018-2019 my mainline picture was more like AlphaStar or AlphaZero, with RL fine-tuning being the large majority of compute. I’ve updated about this and definitely acknowledge I was wrong.[3] I don’t think it totally changes the picture though: I’m still scared of RL, I think it is very plausible it will become more important in the future, and think that even the kind of relatively minimal RL we do now can introduce many of the same risks.
In 2016 I pointed out that ML systems being misaligned on adversarial inputs and exploitable by adversaries was likely to be the first indicator of serious problems, and therefore that researchers in alignment should probably embrace a security framing and motivation for their research.
I expected LM agents to work well (see this 2015 post). Comparing this post to the world of 2023 I think my biggest mistake was overestimating the importance of task decomposition vs just putting everything in a single in-context chain of thought. These updates overall make crazy amplification schemes seem harder (and to require much smarter models than I originally expected, if they even make sense at all) but at the same time less necessary (since chain of thought works fine for capability amplification for longer than I would have expected).
I overall think that I come out looking somewhat better than other researchers working in AI alignment, though again I don’t think my views were extreme outliers (and during this period I was often pointed to as a sensible representative of fairly hardcore and traditional alignment concerns).
Like you, I am somewhat frustrated that e.g. Eliezer has not really acknowledged how different 2023 looks from the picture that someone would take away from his writing. I think he’s right about lots of dynamics that would become relevant for a sufficiently powerful system, but at this point it’s pretty clear that he was overconfident about what would happen when (and IMO is still very overconfident in a way that is directly relevant to alignment difficulty). The most obvious one is that ML systems have made way more progress towards being useful R&D assistants way earlier than you would expect if you read Eliezer’s writing and took it seriously. By all appearances he didn’t even expect AI systems to be able to talk before they started exhibiting potentially catastrophic misalignment.
I think my opinions about AI and alignment were much worse from 2012-2014, but I did explicitly update and acknowledge many mistakes from that period (though some of it was also methodological issues, e.g. I believe that “think about a utility function that’s safe to optimize” was a useful exercise for me even though by 2015 I no longer thought it had much direct relevance).
I’d also welcome readers to pull out posts or quotes that seem to indicate the kind of misprediction you are talking about. I might either acknowledge those (and I do expect my historical reading is very biased for obvious reasons), or I might push back against them as a misreading and explain why I think that.
That said, in fall 2018 I made and shared some forecasts which were the most serious forecasts I made from 2016-2020. I just looked at those again to check my views. I gave a 7.5% chance of TAI by 2028 using short-horizon RL (over a <5k word horizon using human feedback or cheap proxies rather than long-term outcomes), and a 7.5% chance that by 2028 we would be able to train smart enough models to be transformative using short-horizon optimization but be limited by engineering challenges of training and integrating AI systems into R&D workflows (resulting in TAI over the following 5-10 years). So when I actually look at my probability distributions here I think they were pretty reasonable. I updated in favor of alignment being easier because of the relative unimportance of long-horizon RL, but the success of imitation learning and short-horizon RL was still a possibility I was taking very seriously and overall probably assigned higher probability to than almost anyone in ML.
I overall think that I come out looking somewhat better than other researchers working in AI alignment, though again I don’t think my views were extreme outliers
I agree, your past views do look somewhat better. I painted alignment researchers with a fairly broad brush in my original comment, which admittedly might have been unfair to many people who departed from the standard arguments (alternatively, it gives those researchers a chance to step up and receive credit for having been in the minority who weren’t wrong). Partly I portrayed the situation like this because I have the sense that the crucial elements of your worldview that led you to be more optimistic were not disseminated anywhere close to as widely as the opposite views (e.g. “complexity of wishes”-type arguments), at least on LessWrong, which is where I was having most of these discussions.
My general impression is that it sounds like you agree with my overall take although you think I might have come off too strong. Perhaps let me know if I’m wrong about that impression.
When I joined AI safety in late 2017 (having read approximately nothing in the field), I thought of the problem as “construct a utility function for an AI system to optimize”, with a key challenge being the fragility of value. In hindsight this was clearly wrong.
The Value Learning sequence was in large part a result of my journey away from the utility function framing.
That being said, I suspect I continued to think that fragility-of-value type issues were a significant problem, probably until around mid-2019 (see next point).
(I did continue some projects more motivated from a fragility-of-value perspective, partly out of a heuristic of actually finishing things I start, and partly because I needed to write a PhD thesis.)
Early on, I thought of generalization as a key issue for deep learning and expected that vanilla deep learning would not lead to AGI for this reason. Again, in hindsight this was clearly wrong.
I was extremely surprised by OpenAI Five in June 2018 (not just that it worked, but also the ridiculous simplicity of the methods, in particular the lack of any hierarchical RL) and had to think through that.
I spent a while trying to understand that (at least months, e.g. you can see me continuing to be skeptical of deep learning in this Dec 2018 post).
I think I ended up close to my current views around early-to-mid-2019, e.g. I still broadly agree with the things I said in this August 2019 conversation (though I’ll note I was using “mesa optimizer” differently than it is used today—I agree with what I meant in that conversation, though I’d say it differently today).
I think by this point I was probably less worried about fragility of value. E.g. in that conversation I say a bunch of stuff that implies it’s less of a problem, most notably that AI systems will likely learn similar features as humans just from gradient descent, for reasons that LW would now call “natural abstractions”.
Note that this comment is presenting the situation as a lot cleaner than it actually was. I would bet there were many ways in which I was irrational / inconsistent, probably some times where I would have expressed verbally that fragility of value wasn’t a big deal but would still have defended research projects centered around it from some other perspective, etc.
Some thoughts on how to update based on past things I wrote:
I don’t think I’ve ever thought of myself as largely agreeing with LW: my relationship to LW has usually been “wow, they seem to be getting some obvious stuff wrong” (e.g. I was persuaded of slow takeoff basically when Paul’s post and AI Impacts’ post came out in Feb 2018, the Value Learning sequence in late 2018 was primarily in response to my perception that LW was way too anchored on the “construct a utility function” framing).
I think you don’t want to update too hard on the things that were said on blog posts addressed to an ML audience, or in papers that were submitted to conferences. Especially for the papers there’s just a lot of random stuff you couldn’t say about why you’re doing the work because then peer reviewers will object (e.g. I heard second hand of a particularly egregious review to the effect of: “this work is technically solid, but the motivation is AGI safety; I don’t believe in AGI so this paper should be rejected”).
This matches my sense of how a lot of people seem to have… noticed that GPT-4 is fairly well aligned to what the OpenAI team wants it to be, in ways that Yudkowsky et al said would be very hard, and still not view this as at a minimum a positive sign?
Ie problems of the class ‘I told the intelligence to get my mother out of the burning building and it blew her up so the dead body flew out the window, this is because I wasn’t actually specific enough’ just don’t seem like they are a major worry anymore?
Usually when GPT-4 doesn’t understand what I’m asking, I wouldn’t be surprised if a human was confused also.
Suppose in 2024-2029, someone constructs an intelligent robot that is able clean a room to a high level of satisfaction, consistent with the user’s intentions, without any major negative side effects or general issues of misspecification. It doesn’t break any vases while cleaning.
I remember explicit discussion about how solving this problem shouldn’t even count as part of solving long-term / existential safety, for example:
“What I understand this as saying is that the approach is helpful for aligning housecleaning robots (using near extrapolations of current RL), but not obviously helpful for aligning superintelligence, and likely stops being helpful somewhere between the two. [...] There is a risk that a large body of safety literature which works for preventing today’s systems from breaking vases but which fails badly for very intelligent systems actually worsens the AI safety problem” https://www.lesswrong.com/posts/H7KB44oKoSjSCkpzL/worrying-about-the-vase-whitelisting?commentId=rK9K3JebKDofvJA3x
Why is it so hard to find people explicitly saying that this specific problem, and the examples illustrating it, were not meant to be seriously representative of the hard parts of alignment at the time?
See also The Main Sources of AI Risk? where this problem was only part of one bullet point, out of 30 (as of 2019, 35 now).
I remember explicit discussion about how solving this problem shouldn’t even count as part of solving long-term / existential safety, for example:
Two points:
I have a slightly different interpretation of the comment you linked to, which makes me think it provides only weak evidence for your claim. (Though it’s definitely still some evidence.)
I agree some people deserve credit for noticing that human-level value specification might be kind of easy before LLMs. I don’t mean to accuse everyone in the field of making the same mistake.
Anyway, let me explain the first point.
I interpret Abram to be saying that we should focus on solutions that scale to superintelligence, rather than solutions that only work on sub-superintelligent systems but break down at superintelligence. This was in response to Alex’s claim that “whitelisting contributes meaningfully to short- to mid-term AI safety, although I remain skeptical of its robustness to scale.”
In other words, Alex said (roughly): “This solution seems to work for sub-superintelligent AI, but might not work for superintelligent AI.” Abram said in response that we should push against such solutions, since we want solutions that scale all the way to superintelligence. This is not the same thing as saying that any solution to the house-cleaning robot provides negligible evidence of progress, because some solutions might scale.
It’s definitely arguable, but I think it’s likely that any realistic solution to the human-level house cleaning robot problem—in the strong sense of getting a robot to genuinely follow all relevant moral constraints, allow you to shut it down, and perform its job reliably in a wide variety of environments—will be a solution that scales reasonably well above human intelligence (maybe not all the way to radical superintelligence, but at the very least I don’t think it’s negligible evidence of progress).
If you merely disagree that any such solutions will scale, and you’ve been consistent on this point for the last five years, then I guess I’m not really addressing you in my original comment, but I still think what I wrote applies to many other researchers.
See also The Main Sources of AI Risk? where this problem was only part of one bullet point, out of 30 (as of 2019, 35 now).
“Inability to specify any ‘real-world’ goal for an artificial agent (suggested by Michael Cohen)”
I’m not sure how much to be compelled by this piece of evidence. I’ll point out that naming the same problem multiple times might have gotten repetitive, and there was also no explicit ranking of the problems from most important to least important (or from hardest to easiest). If the order you wrote them in can be (perhaps uncharitably) interpreted as the order of importance, then I’ll note that it was listed as problem #3, which I think supports my original thesis adequately.
If I was misreading the blog post at the time, how come it seems like almost no one ever explicitly predicted at the time that these particular problems were trivial for systems below or at human-level intelligence?!?
Autonomous AI systems’ programmed goals can easily fall short of programmers’ intentions. Even a machine intelligent enough to understand its designers’ intentions would not necessarily act as intended. We discuss early ideas on how one might design smarter-than-human AI systems that can inductively learn what to value from labeled training data, and highlight questions about the construction of systems that model and act upon their operators’ preferences.
And quoting from the first page of that paper:
The novelty here is not that programs can exhibit incorrect or counter-intuitive behavior, but that software agents smart enough to understand natural language may still base their decisions on misrepresentations of their programmers’ intent. The idea of superintelligent agents monomaniacally pursuing “dumb”-seeming goals may sound odd, but it follows from the observation of Bostrom and Yudkowsky [2014, chap. 7] that AI capabilities and goals are logically independent.1 Humans can fully comprehend that their “designer” (evolution) had a particular “goal” (reproduction) in mind for sex, without thereby feeling compelled to forsake contraception. Instilling one’s tastes or moral values into an heir isn’t impossible, but it also doesn’t happen automatically.
I won’t weigh in on how many LessWrong posts at the time were confused about where the core of the problem lies. But “The Value Learning Problem” was one of the seven core papers in which MIRI laid out our first research agenda, so I don’t think “we’re centrally worried about things that are capable enough to understand what we want, but that don’t have the right goals” was in any way hidden or treated as minor back in 2014-2015.
I also wouldn’t say “MIRI predicted that NLP will largely fall years before AI can match e.g. the best human mathematicians, or the best scientists”, and if we saw a way to leverage that surprise to take a big bite out of the central problem, that would be a big positive update.
I’d say:
MIRI mostly just didn’t make predictions about the exact path ML would take to get to superintelligence, and we’ve said we didn’t expect this to be very predictable because “the journey is harder to predict than the destination”. (Cf. “it’s easier to use physics arguments to predict that humans will one day send a probe to the Moon, than it is to predict when this will happen or what the specific capabilities of rockets five years from now will be”.)
Back in 2016-2017, I think various people at MIRI updated to median timelines in the 2030-2040 range (after having had longer timelines before that), and our timelines haven’t jumped around a ton since then (though they’ve gotten a little bit longer or shorter here and there).
So in some sense, qualitatively eyeballing the field, we don’t feel surprised by “the total amount of progress the field is exhibiting”, because it looked in 2017 like the field was just getting started, there was likely an enormous amount more you could do with 2017-style techniques (and variants on them) than had already been done, and there was likely to be a lot more money and talent flowing into the field in the coming years.
But “the total amount of progress over the last 7 years doesn’t seem that shocking” is very different from “we predicted what that progress would look like”. AFAIK we mostly didn’t have strong guesses about that, though I think it’s totally fine to say that the GPT series is more surprising to the circa-2017 MIRI than a lot of other paths would have been.
(Then again, we’d have expected something surprising to happen here, because it would be weird if our low-confidence visualizations of the mainline future just happened to line up with what happened. You can expect to be surprised a bunch without being able to guess where the surprises will come from; and in that situation, there’s obviously less to be gained from putting out a bunch of predictions you don’t particularly believe in.)
Pre-deep-learning-revolution, we made early predictions like “just throwing more compute at the problem without gaining deep new insights into intelligence is less likely to be the key thing that gets us there”, which was falsified. But that was a relatively high-level prediction; post-deep-learning-revolution we haven’t claimed to know much about how advances are going to be sequenced.
We have been quite interested in hearing from others about their advance prediction record: it’s a lot easier to say “I personally have no idea what the qualitative capabilities of GPT-2, GPT-3, etc. will be” than to say ”… and no one else knows either”, and if someone has an amazing track record at guessing a lot of those qualitative capabilities, I’d be interested to hear about their further predictions. We’re generally pessimistic that “which of these specific systems will first unlock a specific qualitative capability?” is particularly predictable, but this claim can be tested via people actually making those predictions.
But “The Value Learning Problem” was one of the seven core papers in which MIRI laid out our first research agenda, so I don’t think “we’re centrally worried about things that are capable enough to understand what we want, but that don’t have the right goals” was in any way hidden or treated as minor back in 2014-2015.
I think you missed my point: my original comment was about whether people are updating on the evidence from instruction-tuned LLMs, which seem to actually act on human values (i.e., our actual intentions) quite well, as opposed to mis-specified versions of our intentions.
I don’t think the Value Learning Problem paper said that it would be easy to make human-level AGI systems act on human values in a behavioral sense, rather than merely understand human values in a passive sense.
I suspect you are probably conflating two separate concepts:
It is easy to create a human-level AGI that can passively learn and understand human values (I am not saying people said this would be difficult in the past)
It is easy to create a human-level AGI that acts on human values, in the sense of actually executing instructions that follow our intentions, rather than following a dangerously mis-specified version of what we asked for.
I do not think the Value Learning Paper asserted that (2) was true. To the extent it asserted that, I would prefer to see quotes that back up that claim explicitly.
Your quote from the paper illustrates that it’s very plausible that people thought (1) was true, but that seems separate to my main point: that people thought (2) was not true. (1) and (2) are separate and distinct concepts. And my comment was about (2), not (1).
There is simply a distinction between a machine that actually acts on and executes your intended commands, and a machine that merely understands your intended commands, but does not necessarily act on them as you intend. I am talking about the former, not the latter.
From the paper,
The novelty here is not that programs can exhibit incorrect or counter-intuitive behavior, but that software agents smart enough to understand natural language may still base their decisions on misrepresentations of their programmers’ intent.
Indeed, and GPT-4 does not base its decisions on a misrepresentation of its programmers intentions, most of the time. It generally both correctly understands our intentions, and more importantly, actually acts on them!
and GPT-4 does not base its decisions on a misrepresentation of its programmers intentions, most of the time. It generally both correctly understands our intentions, and more importantly, actually acts on them!
No? GPT-4 predicts text and doesn’t care about anything else. Under certain conditions it predicts nice text, under other not very nice and we don’t know what happens if we create GPT actually capable to, say, bulid nanotech.
If that were to happen, I think an extremely natural reading of the situation is that a substantial part of what we thought “the problem” was in value alignment has been solved, from the perspective of this blog post from 2019. That is cause for an updating of our models, and a verbal recognition that our models have updated in this way.
Yet, that’s not how I think everyone on LessWrong would react to the development of such a robot. My impression is that a large fraction, perhaps a majority, of LessWrongers would not share my interpretation here, despite the plain language in the post explaining what they thought the problem was. Instead, I imagine many people would respond to this argument basically saying the following:
“We never thought that was the hard bit of the problem. We always thought it would be easy to get a human-level robot to follow instructions reliably, do what users intend without major negative side effects, follow moral constraints including letting you shut it down, and respond appropriately given unusual moral dilemmas. The idea that we thought that was ever the problem is a misreading of what we wrote. The problem was always purely that alignment issues would arise after we far surpassed human intelligence, at which point entirely novel problems will arise.”
For what it’s worth I do remember lots of people around the MIRI-sphere complaining at the time that that kind of prosaic alignment work was kind of useless, because it missed the hard parts of aligning superintelligence.
For what it’s worth I do remember lots of people around the MIRI-sphere complaining at the time that that kind of prosaic alignment work, like this, was kind of useless, because it missed the hard parts of aligning superintelligence.
I agree some people in the MIRI-sphere did say this, and a few of them get credit for pointing out things in this vicinity, but I personally don’t remember reading many strong statements of the form:
“Prosaic alignment work is kind of useless because it will actually be easy to get a roughly human-level machine to interpret our commands reliably, do what you want without significant negative side effects, and let you shut it down whenever you want etc. The hard part is doing this for superintelligence.”
My understanding is that a lot of the time the claim was instead something like:
“Prosaic alignment work is kind of useless because machine learning is natively not very transparent and alignable, and we should focus instead on creating alignable alternatives to ML, or building the conceptual foundations that would let us align powerful AIs.”
As some evidence, I’d point to Rob Bensinger’s statement that,
I don’t think Eliezer’s criticism of the field [of prosaic alignment] is about experimentalism. I do think it’s heavily about things like ‘the field focuses too much on putting external pressures on black boxes, rather than trying to open the black box’.
I do also think a number of people on LW sometimes said a milder version of the thing I mentioned above, which was something like:
“Prosaic alignment work might help us get narrow AI that works well in various circumstances, but once it develops into AGI, becomes aware that it has a shutdown button, and can reason through the consequences of what would happen if it were shut down, and has general situational awareness along with competence across a variety of domains, these strategies won’t work anymore.”
I think this weaker statement now looks kind of false in hindsight, since I think current SOTA LLMs are already pretty much weak AGIs, and so they already seem close to the threshold at which we were supposed to start seeing these misalignment issues come up. But they are not coming up (yet). I think near-term multimodal models will be even closer to the classical “AGI” concept, complete with situational awareness and relatively strong cross-domain understanding, and yet I also expect them to mostly be fairly well aligned to what we want in every relevant behavioral sense.
Well, for instance, I watched Ryan Carey give a talk at CHAI about how Cooperative Inverse Reinforcement Learning didn’t give you corrigibility. (That CIRL didn’t tackle the hard part of the problem, despite seeming related on the surface.)
I think that’s much more an example of
“Prosaic alignment work is kind of useless because it will actually be easy to get a roughly human-level machine to interpret our commands reliably, do what you want without significant negative side effects, and let you shut it down whenever you want etc. The hard part is doing this for superintelligence.”
than of
“Prosaic alignment work is kind of useless because machine learning is natively not very transparent and alignable, and we should focus instead on creating alignable alternatives to ML, or building the conceptual foundations that would let us align powerful AIs.”
Well, for instance, I watched Ryan Carey give a talk at CHAI about how Cooperative Inverse Reinforcement Learning didn’t give you corrigibility. (That CIRL didn’t tackle the hard part of the problem, despite seeming related on the surface.)
This doesn’t seem to be the same thing as what I was talking about.
Yes, people frequently criticized particular schemes for aligning AI systems, arguing that the scheme doesn’t address some key perceived obstacle. By itself, this is pretty different from predicting both:
It will be easy to get behavioral alignment on slightly-sub-AGI, and maybe even par-human systems, including on shutdown problems
The problem is that these schemes don’t scale well all the way to radical superintelligence.
I remember a lot of people making the second point, but not nearly as many making the first point.
If (some) people already had the view that this kind of prosaic alignment wouldn’t scale to Superintelligence, but didn’t express an opinion about whether behavioral alignment of slightly-sub-AGI would be solved, what in what way do you want them to be updating that they’re not?
Or do you mean they weren’t just agnostic about the behavioral alignment of near-AGIs, they specifically thought that it wouldn’t be easy? Is that right?
One, I think being able to align AGI and slightly sub-AGI successfully is plausibly very helpful for making the alignment problem easier. It’s kind of like learning that we can create more researchers on demand if we ever wanted to.
Two, the fact that the methods scale surprisingly well to human-level is evidence that they actually work pretty well in general, even if they don’t scale all the way into some radical regime way above human-level. For example, Eliezer talked about how he expected you’d need to solve the suspend button problem by the time your AI has situational awareness, but I think you can interpret this prediction as either becoming increasingly untenable, or that we appear close to a solution to the problem since our AIs don’t seem to be resisting shutdown.
Again, presumably once you get the aligned AGI, you can use many copies of the aligned AGI to help you with the next iteration, AGI+. This seems plausibly very positive as an update. I can sympathize with those who say it’s only a minor update because they never thought the problem was merely aligning human-level AI, but I’m a bit baffled by those who say it’s not an update at all from the traditional AI risk models, and are still very pessimistic.
Two, the fact that the methods scale surprisingly well to human-level is evidence that they actually work pretty well, even if they don’t scale all the way into some radical regime way above human-level. For example, Eliezer talked about how he expected you’d need to solve the suspend button problem by the time your AI has situational awareness, but I think you can interpret this prediction as either becoming increasingly untenable, or that we appear close to a solution to the problem since our AIs don’t seem to be resisting shutdown.
I feel like I’m being obstinate or something, but I think that the linked article is still basically correct, and not particularly untenable.
The key word in that sentence is “consequentialist”. Current LLMs are pretty close (I think!) to having pretty detailed situational awareness. But, as near as I can tell, LLMs are, at best, barely consequentialist.
I agree that that is a surprise, on the old school LessWrong / MIRI world view. I had assumed that “intelligence” and “agency” were way more entangled, way more two sides of the same coin, than they apparently are.
And the framing of the article focuses on situational awareness and not on consequentialism because of that error. Because Eliezer (and I) thought at the time that situational awareness would come after consequentialist reasoning in the tech tree.
But I expect that we’ll have consequentialist agents eventually (if not, that’s a huge crux for how dangerous I expect AGI to be), and I expect that you’ll have “off button” problems at the point when you have “enough” consequentialism aimed at some goal, “enough” strategic awareness, and strong “enough” capabilities that the AI can route around the humans and the human safeguards.
I feel like I’m being obstinate or something, but I think that the linked article is still basically correct, and not particularly untenable.
In my opinion, the extent to which the linked article is correct is roughly the extent to which the article is saying something trivial and irrelevant.
The primary thing I’m trying to convey here is that we now have helpful, corrigible assistants (LLMs) that can aid us in achieving our goals, including alignment, and the rough method used to create these assistants seems to scale well, perhaps all the way to human level or slightly beyond it.
Even if the post is technically correct because a “consequentialist agent” is still incorrigible (perhaps by definition), and GPT-4 is not a “consequentialist agent”, this doesn’t seem to matter much from the perspective of alignment optimism, since we can just build helpful, corrigible assistants to help us with our alignment work instead of consequentialist agents.
“Prosaic alignment work might help us get narrow AI that works well in various circumstances, but once it develops into AGI, becomes aware that it has a shutdown button, and can reason through the consequences of what would happen if it were shut down, and has general situational awareness along with competence across a variety of domains, these strategies won’t work anymore.”
I think this weaker statement now looks kind of false in hindsight, since I think current SOTA LLMs are already pretty much weak AGIs, and so they already seem close to the threshold at which we were supposed to start seeing these misalignment issues come up. But they are not coming up (yet). I think near-term multimodal models will be even closer to the classical “AGI” concept, complete with situational awareness and relatively strong cross-domain understanding, and yet I also expect them to mostly be fairly well aligned to what we want in every relevant behavioral sense.
A side-note to this conversation, but I basically still buy the quoted text and don’t think it now looks false in hindsight.
We (apparently) don’t yet have models that have robust longterm-ish goals. I don’t know how natural it will be for models to end up with long term goals: the MIRI view says that anything that can do science will definitely have long-term planning abilities which fundamentally, entails having goals that are robust to changing circumstances. I don’t know if that’s true, but regardless, I expect that we’ll specifically engineer agents with long term goals. (Whether or not those agents will have “robust” long term goals, over and above what they were prompted to do in a specific situation is also something that I don’t know.)
What I expect to see is agents that have a portfolio of different drives and goals, some of which are more like consequentialist objectives (eg “I want to make the number in this bank account go up”) and some of which are more like deontological injunctions (“always check with my user/ owner before I make a big purchase or take a ‘creative’ action, one that is outside of my training distribution”).
My prediction is that the consequentialist parts of the agent will basically route around any deontological constraints that are trained in.
For instance, the your personal assistant AI does ask your permission before it does anything creative, but also, it’s superintelligently persuasive and so it always asks your permission in exactly the way that will result in it accomplishing what it wants. If there are a thousand action sequences in which it asks for permission, it picks the one that has the highest expected value with regard to whatever it wants. This basically nullifies the safety benefit of any deontological injunction, unless there are some injunctions that can’t be gamed in this way.
To do better than this, it seems like you do have to solve the Agent Foundations problem of corrigibility (getting the agent to be sincerely indifferent between your telling it to take the action or not take the action) or you have to train in, not a deontological injunction, but an active consequentialist goal of serving the interests of the human (which means you have find a way to get the agent to be serving some correct enough idealization of human values).
But I think we mostly won’t see this kind of thing until we get quite high levels of capability, where it is transparent to the agent that some ways of asking for permission have higher expected value than others.
Or rather, we might see a little of this effect early on, but until your assistant is superhumanly persuasive, it’s pretty small. Maybe we’ll see a bias toward accepting actions that serve the AI agent’s goals (if we even know what those are) more, as capability goes up, but we won’t be able to distinguish “the AI is getting better at getting what it wants from the human” from “the AIs are just more capable, and so they come up with plans that work better.” It’ll just look like the numbers going up.
To be clear “superhumanly persuasive” is only one, particularly relevant, example of a superhuman capability that allows you to route around deontological injunctions that the agent is committed to. My claim is weaker if you remove that capability in particular, but mostly what I’m wanting to say is that powerful consequentialism find and “squeezes through” the gaps in your oversight and control and naive-corrigibility schemes, unless you figure out corrigibility in the Agent Foundations sense.
At this point I think there are a number of potential replies from people who still insist that the LW models of AI alignment were never wrong, which I (depending on the speaker) think can often border on gaslighting:
This is one of the main reasons I’m not excited about engaging with LessWrong. Why bother? It feels like nothing I say will matter. Apparently, no pre-takeoff experiments matter to some folk.[1] And even if I successfully dismantle some philosophical argument, there’s a good chance they will use another argument to support their beliefs instead. Nothing changes.
When talking with pre-2020 alignment folks about these issues, I feel gaslit quite often. You have no idea how many times I’ve been told things like “most people already understood that reward is not the optimization target”[2] and “maybe you had a lesson you needed to learn, but I feel like I got this in 2018″, and so on. Almost always this comes from people who seem to still not understand what I’m talking about. I feel fine if[3] they disagree with me about specific ideas, but what really bothers me is the revisionism. It’s so annoying.
Like, just look at this quote from the post you mentioned:
Unfortunately, AI systems trained with reinforcement learning only optimize features specified in the reward function and are indifferent to anything we might’ve inadvertently left out.
And you probably didn’t even select that post for this particular misunderstanding. (EDIT: Note that I am not accusing Rohin of gaslighting on this topic, and I also think he already understood the “reward is not the optimization target” point when he wrote the above sentence. My critique was that the statement is false and would probably lead readers to incorrect beliefs about the purpose of reward in RL.)
I feel a lot of disappointment and sadness. In 2018, I came to this website when I really needed a new way to understand the world. I’d made a lot of epistemic mistakes I wasn’t proud of, and I didn’t want to live that way anymore. I wanted to think more clearly. I wanted it so, so badly. I came to rely and depend on this place and the fellow users. I looked up to and admired a bunch of them (and I still do so for a few).
But the things you mention—the revisionism, the unfalsifiability, the apparent gaslighting? We set out to do better than science. I think we often do worse.
As a general principle, truths are entangled with each other. It’s OK if a theory’s most extreme prediction (e.g. extinction from AI) is not testable at the current moment. It is a highly suspicious state of affairs if a theory yields no other testable predictions. Truths are generally entangled with each other in intricate and manifold ways. There are generally many clever ways to test a theory, given the necessary will and curiosity.
Sometimes I instead get pushback like “it seems to me like I’ve grasped the insights you’re trying to communicate, but I totally acknowledge that I might just not be seeing what you’re saying yet.” I respect and appreciate that response. It communicates the other person’s true perception (that they already understand) while not invalidating or assuming away my perspective.
I get why you feel that way. I think there are a lot of us on LessWrong who are less vocal and more openminded, and less aligned with either optimistic network thinkers or pessimistic agent foundations thinkers. People newer to the discussion and otherwise less polarized are listening and changing their minds in large or small ways.
I’m sorry you’re feeling so pessimistic about LessWrong. I think there is a breakdown in communication happening between the old guard and the new guard you exemplify. I don’t think that’s a product of venue, but of the sheer difficulty of the discussion. And polarization between different veiwpoints on alignment.
I think maintaining a good community falls on all of us. Formats and mods can help, but communities set their own standards.
I’m very, very interested to see a more thorough dialogue between you and similar thinkers, and MIRI-type thinkers. I think right now both sides feel frustrated that they’re not listened to and understood better.
Like, just look at this quote from the post you mentioned:
Unfortunately, AI systems trained with reinforcement learning only optimize features specified in the reward function and are indifferent to anything we might’ve inadvertently left out.
And you probably didn’t even select that post for this particular misunderstanding.
(Presumably you are talking about how reward is not the optimization target.)
While I agree that the statement is not literally true, I am still basically on board with that sentence and think it’s a reasonable shorthand for the true thing.
I expect that I understood the “reward is not the optimization target” point at the time of writing that post (though of course predicting what your ~5-years-ago self knew is quite challenging without specific quotes to refer to).
I am confident I understood the point by the time I was working on the goal misgeneralization project (late 2021), since almost every example we created involved predicting ahead of time a specific way in which reward would fail to be the optimization target.
Deep reinforcement learning agents will not come to intrinsically and primarily value their reward signal; reward is not the trained agent’s optimization target.
Utility functions express the relative goodness of outcomes. Reward is not best understood as being a kind of utility function. Reward has the mechanistic effect of chiseling cognition into the agent’s network. Therefore, properly understood, reward does not express relative goodness and is therefore not an optimization target at all.
I hope it doesn’t come across as revisionist to Alex, but I felt like both of these points were made by people at least as early as 2019, after the Mesa-Optimization sequence came out in mid-2019. As evidence, I’ll point to my post from December 2019 that was partially based on a conversation with Rohin, who seemed to agree with me,
consider a simple feedforward neural network trained by deep reinforcement learning to navigate my Chests and Keys environment. Since “go to the nearest key” is a good proxy for getting the reward, the neural network simply returns the action, that when given the board state, results in the agent getting closer to the nearest key.
Is the feedforward neural network optimizing anything here? Hardly, it’s just applying a heuristic. Note that you don’t need to do anything like an internal A* search to find keys in a maze, because in many environments, following a wall until the key is within sight, and then performing a very shallow search (which doesn’t have to be explicit) could work fairly well.
I think in this passage I’m imagining that “reward is not the trained agent’s optimization target” quite explicitly, since I’m pointing out that a neural network trained by RL will not necessarily optimize anything at all. In a subsequent post from January 2020 I gave a more explicit example, said this fact doesn’t merely apply to simple neural networks, and then offered my opinion that “it’s inaccurate to say that the source of malign generalization must come from an internal search being misaligned with the objective function we used during training”.
From the comments, and from my memory of conversations at the time, many people disagreed with my framing. They disagreed even when I pointed out that humans don’t seem to be “optimizers” that select for actions that maximize our “reward function” (I believe the most common response was to deny the premise, and say that humans are actually roughly optimizers. Another common response was to say that AI is different for some reason.)
However, even though some people disagreed with this framing, not everyone did. As I pointed out, Rohin seemed to agree with me at the time, and so at the very least I think there is credible evidence that this insight was already known to a few people in the community by late 2019.
Deep reinforcement learning agents will not come to intrinsically and primarily value their reward signal; reward is not the trained agent’s optimization target.
I have no stake in this debate, but how is this particular point any different than what Eliezer says when he makes the point about humans not optimizing for IGF? I think the entire mesaoptimization concern is built around this premise, no?
I didn’t mean to imply that you in particular didn’t understand the reward point, and I apologize for not writing my original comment more clearly in that respect. Out of nearly everyone on the site, I am most persuaded that you understood this “back in the day.”
I meant to communicate something like “I think the quoted segment from Rohin and Dmitrii’s post is incorrect and will reliably lead people to false beliefs.”
As I mentioned elsewhere (not this website) I don’t agree with “will reliably lead people to false beliefs”, if we’re talking about ML people rather than LW people (as was my audience for that blog post).
I do think that it’s a reasonable hypothesis to have, and I assign it more likelihood than I would have a year ago (in large part from you pushing some ML people on this point, and them not getting it as fast as I would have expected).
“Sure, Rohin thought that was a major problem, but we [our organization/thought cluster/ideological group] never agreed with him.”
Oh really? Did you ever explicitly highlight this particular disagreement at the time?
FWIW at the time I wasn’t working on value learning and wasn’t incredibly excited about work in that direction, despite the fact that that’s what the rest of my lab was primarily focussed on. I also wrote a blog post in 2020, based off a conversation I had with Rohin in 2018, where I mention how important it is to work on inner alignment stuff and how those issues got brought up by the ‘paranoid wing’ of AI alignment. My guess is that my view was something like “stuff like reward learning from the state of the world doesn’t seem super important to me because of inner alignment etc, but for all I know cool stuff will blossom out of it, so I’m happy to hear about your progress and try to offer constructive feedback”, and that I expressed that to Rohin in person.
This sentiment resonates strongly with me.
A personal background: I remember getting pretty heavily involved in AI alignment discussions on LessWrong in 2019. Back then I think there were a lot of assumptions people had about what “the problem” was that are, these days, often forgotten, brushed aside, or sometimes even deliberately minimized post-hoc in order to give the impression that the field has a better track record than it actually does. [ETA: but to be clear, I don’t mean to say everyone made the same mistake I describe here]
This has been a bit shocking and disorienting to me, honestly, because at the time in 2019 I didn’t get the strong impression that people were deliberately constructing unfalsifiable models of the problem. I had the vague impression that people had relatively firm views that made predictions about the world prior to superintelligence, and that these views were open to revision upon new evidence. And that naiveté led me to experience the last few years of evidence a bit differently than I think some other people.
To give a cursory taste of what I’m talking about, we can consider what I think of as a representative blog post from the genre of pre-2020 alignment content: Rohin Shah and Dmitrii Krasheninnikov’s “Learning Preferences by Looking at the World”. This is by no means a cherry-picked example either, in my opinion (and I don’t mean to criticize Rohin specifically, I’m just using this blog post as a representative example of what people talked about at the time). In the blog post, they state,
Suppose in 2024-2029, someone constructs an intelligent robot that is able clean a room to a high level of satisfaction, consistent with the user’s intentions, without any major negative side effects or general issues of misspecification. It doesn’t break any vases while cleaning. It respects all basic moral norms you can think of. It lets you shut it down whenever you want. And it actually does its job of cleaning the room in a reasonable amount of time.
If that were to happen, I think an extremely natural reading of the situation is that a substantial part of what we thought “the problem” was in value alignment has been solved, from the perspective of this blog post from 2019. That is cause for an updating of our models, and a verbal recognition that our models have updated in this way.
Yet, that’s not how I think everyone on LessWrong would react to the development of such a robot. My impression is that a large fraction, perhaps a majority, of LessWrongers would not share my interpretation here, despite the plain language in the post explaining what they thought the problem was. Instead, I imagine many people would respond to this argument basically saying the following:
“We never thought that was the hard bit of the problem. We always thought it would be easy to get a human-level robot to follow instructions reliably, do what users intend without major negative side effects, follow moral constraints including letting you shut it down, and respond appropriately given unusual moral dilemmas. The idea that we thought that was ever the problem is a misreading of what we wrote. The problem was always purely that alignment issues would arise after we far surpassed human intelligence, at which point entirely novel problems will arise.”
But the blog post said there was a problem, gave an example of the problem manifesting, and then spent the rest of the post trying to come up with solutions. The authors gave no indication that this particular problem was trivial, or that the example used was purely illustrative and had nothing to do with the type of real-world issues that might arise if we fail to solve the problem. If I was misreading the blog post at the time, how come it seems like almost no one ever explicitly predicted at the time that these particular problems were trivial for systems below or at human-level intelligence?!?
To clarify, I’m in full agreement with anyone who simply says that the alignment problem looks like it might still be hard, based on different arguments than the one presented in this blog post. There were a lot of arguments people gave back then, and some of the older arguments still look correct. Perhaps most significantly, robustness to distribution shifts still looks reasonably hard as a problem. But the blog post I cited explicitly said “Note that we’re not talking about problems with robustness and distributional shift”!
At this point I think there are a number of potential replies from people who still insist that the LW models of AI alignment were never wrong, which I (depending on the speaker) think can often border on gaslighting:
“Rohin’s point wasn’t that this problem would be hard. He was using it as a mere stepping stone to explain much harder problems of misspecification that were, at that time, purely theoretical.”
Then why did the paper explicitly say, “for many real-world tasks it can be challenging to specify a reward function that captures human preferences, particularly the preference for avoiding unnecessary side effects while still accomplishing the goal (Amodei et al., 2016).”
Why is it so hard to find people explicitly saying that this specific problem, and the examples illustrating it, were not meant to be seriously representative of the hard parts of alignment at the time?
Isn’t it still pretty valuable to point out that we’re solving stepping stones on the path towards the ‘real problem’?
“Sure, Rohin thought that was a major problem, but we [our organization/thought cluster/ideological group] never agreed with him.”
Oh really? Did you ever explicitly highlight this particular disagreement at the time? He wasn’t exactly a minor researcher at the time. And this blog post is only one of a number of blog posts expressing essentially an identical sentiment.
[ETA: to be fair, I do think there were some people who did genuinely disagree with Rohin’s framing. I don’t mean to accuse everyone of making the same error.]
“Yes, this particular part of the alignment problem looks easier than we thought, but serious people always thought that this was going to be one of the easiest subproblems, compared to other things. This problem was considered a very minor sub-problem of value alignment that merited like 1% of researcher-hours.”
Then why is it so easy to find countless blog posts of a similar nature from alignment researchers at the time, presenting pretty much the same problem and then presenting an attempt to solve it? Did all those people simply knowingly work on one of the easiest sub-problems of alignment?
I wrote a fair amount about alignment from 2014-2020[1] which you can read here. So it’s relatively easy to get a sense for what I believed.
Here are some summary notes about my views as reflected in that writing, though I’d encourage you to just judge for yourself[2] by browsing the archives:
I expected AI systems to be pretty good at predicting what behaviors humans would rate highly, long before they were catastrophically risky. This comes up over and over again in my writing. In particular, I repeatedly stated that it was very unlikely that an AI system would kill everyone because it didn’t understand that people would disapprove of that action, and therefore this was not the main source of takeover concerns. (By 2017 I expected RLHF to work pretty well with language models, which was reflected in my research prioritization choices and discussions within OpenAI though not clearly in my public writing.)
I consistently expressed that my main concerns were instead about (i) systems that were too smart for humans to understand the actions they proposed, (ii) treacherous turns from deceptive alignment. This comes up a lot, and when I talk about other problems I’m usually clear that they are prerequisites that we should expect to succeed. Eg.. see an unaligned benchmark. I don’t think this position was an extreme outlier, my impression at the time was that other researchers had broadly similar views.
I think the biggest-alignment relevant update is that I expected RL fine-tuning over longer horizons (or even model-based RL a la AlphaZero) to be a bigger deal. I was really worried about it significantly improving performance and making alignment harder. In 2018-2019 my mainline picture was more like AlphaStar or AlphaZero, with RL fine-tuning being the large majority of compute. I’ve updated about this and definitely acknowledge I was wrong.[3] I don’t think it totally changes the picture though: I’m still scared of RL, I think it is very plausible it will become more important in the future, and think that even the kind of relatively minimal RL we do now can introduce many of the same risks.
In 2016 I pointed out that ML systems being misaligned on adversarial inputs and exploitable by adversaries was likely to be the first indicator of serious problems, and therefore that researchers in alignment should probably embrace a security framing and motivation for their research.
I expected LM agents to work well (see this 2015 post). Comparing this post to the world of 2023 I think my biggest mistake was overestimating the importance of task decomposition vs just putting everything in a single in-context chain of thought. These updates overall make crazy amplification schemes seem harder (and to require much smarter models than I originally expected, if they even make sense at all) but at the same time less necessary (since chain of thought works fine for capability amplification for longer than I would have expected).
I overall think that I come out looking somewhat better than other researchers working in AI alignment, though again I don’t think my views were extreme outliers (and during this period I was often pointed to as a sensible representative of fairly hardcore and traditional alignment concerns).
Like you, I am somewhat frustrated that e.g. Eliezer has not really acknowledged how different 2023 looks from the picture that someone would take away from his writing. I think he’s right about lots of dynamics that would become relevant for a sufficiently powerful system, but at this point it’s pretty clear that he was overconfident about what would happen when (and IMO is still very overconfident in a way that is directly relevant to alignment difficulty). The most obvious one is that ML systems have made way more progress towards being useful R&D assistants way earlier than you would expect if you read Eliezer’s writing and took it seriously. By all appearances he didn’t even expect AI systems to be able to talk before they started exhibiting potentially catastrophic misalignment.
I think my opinions about AI and alignment were much worse from 2012-2014, but I did explicitly update and acknowledge many mistakes from that period (though some of it was also methodological issues, e.g. I believe that “think about a utility function that’s safe to optimize” was a useful exercise for me even though by 2015 I no longer thought it had much direct relevance).
I’d also welcome readers to pull out posts or quotes that seem to indicate the kind of misprediction you are talking about. I might either acknowledge those (and I do expect my historical reading is very biased for obvious reasons), or I might push back against them as a misreading and explain why I think that.
That said, in fall 2018 I made and shared some forecasts which were the most serious forecasts I made from 2016-2020. I just looked at those again to check my views. I gave a 7.5% chance of TAI by 2028 using short-horizon RL (over a <5k word horizon using human feedback or cheap proxies rather than long-term outcomes), and a 7.5% chance that by 2028 we would be able to train smart enough models to be transformative using short-horizon optimization but be limited by engineering challenges of training and integrating AI systems into R&D workflows (resulting in TAI over the following 5-10 years). So when I actually look at my probability distributions here I think they were pretty reasonable. I updated in favor of alignment being easier because of the relative unimportance of long-horizon RL, but the success of imitation learning and short-horizon RL was still a possibility I was taking very seriously and overall probably assigned higher probability to than almost anyone in ML.
I agree, your past views do look somewhat better. I painted alignment researchers with a fairly broad brush in my original comment, which admittedly might have been unfair to many people who departed from the standard arguments (alternatively, it gives those researchers a chance to step up and receive credit for having been in the minority who weren’t wrong). Partly I portrayed the situation like this because I have the sense that the crucial elements of your worldview that led you to be more optimistic were not disseminated anywhere close to as widely as the opposite views (e.g. “complexity of wishes”-type arguments), at least on LessWrong, which is where I was having most of these discussions.
My general impression is that it sounds like you agree with my overall take although you think I might have come off too strong. Perhaps let me know if I’m wrong about that impression.
Some thoughts on my journey in particular:
When I joined AI safety in late 2017 (having read approximately nothing in the field), I thought of the problem as “construct a utility function for an AI system to optimize”, with a key challenge being the fragility of value. In hindsight this was clearly wrong.
The Value Learning sequence was in large part a result of my journey away from the utility function framing.
That being said, I suspect I continued to think that fragility-of-value type issues were a significant problem, probably until around mid-2019 (see next point).
(I did continue some projects more motivated from a fragility-of-value perspective, partly out of a heuristic of actually finishing things I start, and partly because I needed to write a PhD thesis.)
Early on, I thought of generalization as a key issue for deep learning and expected that vanilla deep learning would not lead to AGI for this reason. Again, in hindsight this was clearly wrong.
I was extremely surprised by OpenAI Five in June 2018 (not just that it worked, but also the ridiculous simplicity of the methods, in particular the lack of any hierarchical RL) and had to think through that.
I spent a while trying to understand that (at least months, e.g. you can see me continuing to be skeptical of deep learning in this Dec 2018 post).
I think I ended up close to my current views around early-to-mid-2019, e.g. I still broadly agree with the things I said in this August 2019 conversation (though I’ll note I was using “mesa optimizer” differently than it is used today—I agree with what I meant in that conversation, though I’d say it differently today).
I think by this point I was probably less worried about fragility of value. E.g. in that conversation I say a bunch of stuff that implies it’s less of a problem, most notably that AI systems will likely learn similar features as humans just from gradient descent, for reasons that LW would now call “natural abstractions”.
Note that this comment is presenting the situation as a lot cleaner than it actually was. I would bet there were many ways in which I was irrational / inconsistent, probably some times where I would have expressed verbally that fragility of value wasn’t a big deal but would still have defended research projects centered around it from some other perspective, etc.
Some thoughts on how to update based on past things I wrote:
I don’t think I’ve ever thought of myself as largely agreeing with LW: my relationship to LW has usually been “wow, they seem to be getting some obvious stuff wrong” (e.g. I was persuaded of slow takeoff basically when Paul’s post and AI Impacts’ post came out in Feb 2018, the Value Learning sequence in late 2018 was primarily in response to my perception that LW was way too anchored on the “construct a utility function” framing).
I think you don’t want to update too hard on the things that were said on blog posts addressed to an ML audience, or in papers that were submitted to conferences. Especially for the papers there’s just a lot of random stuff you couldn’t say about why you’re doing the work because then peer reviewers will object (e.g. I heard second hand of a particularly egregious review to the effect of: “this work is technically solid, but the motivation is AGI safety; I don’t believe in AGI so this paper should be rejected”).
This matches my sense of how a lot of people seem to have… noticed that GPT-4 is fairly well aligned to what the OpenAI team wants it to be, in ways that Yudkowsky et al said would be very hard, and still not view this as at a minimum a positive sign?
Ie problems of the class ‘I told the intelligence to get my mother out of the burning building and it blew her up so the dead body flew out the window, this is because I wasn’t actually specific enough’ just don’t seem like they are a major worry anymore?
Usually when GPT-4 doesn’t understand what I’m asking, I wouldn’t be surprised if a human was confused also.
I remember explicit discussion about how solving this problem shouldn’t even count as part of solving long-term / existential safety, for example:
“What I understand this as saying is that the approach is helpful for aligning housecleaning robots (using near extrapolations of current RL), but not obviously helpful for aligning superintelligence, and likely stops being helpful somewhere between the two. [...] There is a risk that a large body of safety literature which works for preventing today’s systems from breaking vases but which fails badly for very intelligent systems actually worsens the AI safety problem” https://www.lesswrong.com/posts/H7KB44oKoSjSCkpzL/worrying-about-the-vase-whitelisting?commentId=rK9K3JebKDofvJA3x
See also The Main Sources of AI Risk? where this problem was only part of one bullet point, out of 30 (as of 2019, 35 now).
Two points:
I have a slightly different interpretation of the comment you linked to, which makes me think it provides only weak evidence for your claim. (Though it’s definitely still some evidence.)
I agree some people deserve credit for noticing that human-level value specification might be kind of easy before LLMs. I don’t mean to accuse everyone in the field of making the same mistake.
Anyway, let me explain the first point.
I interpret Abram to be saying that we should focus on solutions that scale to superintelligence, rather than solutions that only work on sub-superintelligent systems but break down at superintelligence. This was in response to Alex’s claim that “whitelisting contributes meaningfully to short- to mid-term AI safety, although I remain skeptical of its robustness to scale.”
In other words, Alex said (roughly): “This solution seems to work for sub-superintelligent AI, but might not work for superintelligent AI.” Abram said in response that we should push against such solutions, since we want solutions that scale all the way to superintelligence. This is not the same thing as saying that any solution to the house-cleaning robot provides negligible evidence of progress, because some solutions might scale.
It’s definitely arguable, but I think it’s likely that any realistic solution to the human-level house cleaning robot problem—in the strong sense of getting a robot to genuinely follow all relevant moral constraints, allow you to shut it down, and perform its job reliably in a wide variety of environments—will be a solution that scales reasonably well above human intelligence (maybe not all the way to radical superintelligence, but at the very least I don’t think it’s negligible evidence of progress).
If you merely disagree that any such solutions will scale, and you’ve been consistent on this point for the last five years, then I guess I’m not really addressing you in my original comment, but I still think what I wrote applies to many other researchers.
I think it’s actually 2 points,
“Misspecified or incorrectly learned goals/values”
“Inability to specify any ‘real-world’ goal for an artificial agent (suggested by Michael Cohen)”
I’m not sure how much to be compelled by this piece of evidence. I’ll point out that naming the same problem multiple times might have gotten repetitive, and there was also no explicit ranking of the problems from most important to least important (or from hardest to easiest). If the order you wrote them in can be (perhaps uncharitably) interpreted as the order of importance, then I’ll note that it was listed as problem #3, which I think supports my original thesis adequately.
Quoting the abstract of MIRI’s “The Value Learning Problem” paper (emphasis added):
And quoting from the first page of that paper:
I won’t weigh in on how many LessWrong posts at the time were confused about where the core of the problem lies. But “The Value Learning Problem” was one of the seven core papers in which MIRI laid out our first research agenda, so I don’t think “we’re centrally worried about things that are capable enough to understand what we want, but that don’t have the right goals” was in any way hidden or treated as minor back in 2014-2015.
I also wouldn’t say “MIRI predicted that NLP will largely fall years before AI can match e.g. the best human mathematicians, or the best scientists”, and if we saw a way to leverage that surprise to take a big bite out of the central problem, that would be a big positive update.
I’d say:
MIRI mostly just didn’t make predictions about the exact path ML would take to get to superintelligence, and we’ve said we didn’t expect this to be very predictable because “the journey is harder to predict than the destination”. (Cf. “it’s easier to use physics arguments to predict that humans will one day send a probe to the Moon, than it is to predict when this will happen or what the specific capabilities of rockets five years from now will be”.)
Back in 2016-2017, I think various people at MIRI updated to median timelines in the 2030-2040 range (after having had longer timelines before that), and our timelines haven’t jumped around a ton since then (though they’ve gotten a little bit longer or shorter here and there).
So in some sense, qualitatively eyeballing the field, we don’t feel surprised by “the total amount of progress the field is exhibiting”, because it looked in 2017 like the field was just getting started, there was likely an enormous amount more you could do with 2017-style techniques (and variants on them) than had already been done, and there was likely to be a lot more money and talent flowing into the field in the coming years.
But “the total amount of progress over the last 7 years doesn’t seem that shocking” is very different from “we predicted what that progress would look like”. AFAIK we mostly didn’t have strong guesses about that, though I think it’s totally fine to say that the GPT series is more surprising to the circa-2017 MIRI than a lot of other paths would have been.
(Then again, we’d have expected something surprising to happen here, because it would be weird if our low-confidence visualizations of the mainline future just happened to line up with what happened. You can expect to be surprised a bunch without being able to guess where the surprises will come from; and in that situation, there’s obviously less to be gained from putting out a bunch of predictions you don’t particularly believe in.)
Pre-deep-learning-revolution, we made early predictions like “just throwing more compute at the problem without gaining deep new insights into intelligence is less likely to be the key thing that gets us there”, which was falsified. But that was a relatively high-level prediction; post-deep-learning-revolution we haven’t claimed to know much about how advances are going to be sequenced.
We have been quite interested in hearing from others about their advance prediction record: it’s a lot easier to say “I personally have no idea what the qualitative capabilities of GPT-2, GPT-3, etc. will be” than to say ”… and no one else knows either”, and if someone has an amazing track record at guessing a lot of those qualitative capabilities, I’d be interested to hear about their further predictions. We’re generally pessimistic that “which of these specific systems will first unlock a specific qualitative capability?” is particularly predictable, but this claim can be tested via people actually making those predictions.
I think you missed my point: my original comment was about whether people are updating on the evidence from instruction-tuned LLMs, which seem to actually act on human values (i.e., our actual intentions) quite well, as opposed to mis-specified versions of our intentions.
I don’t think the Value Learning Problem paper said that it would be easy to make human-level AGI systems act on human values in a behavioral sense, rather than merely understand human values in a passive sense.
I suspect you are probably conflating two separate concepts:
It is easy to create a human-level AGI that can passively learn and understand human values (I am not saying people said this would be difficult in the past)
It is easy to create a human-level AGI that acts on human values, in the sense of actually executing instructions that follow our intentions, rather than following a dangerously mis-specified version of what we asked for.
I do not think the Value Learning Paper asserted that (2) was true. To the extent it asserted that, I would prefer to see quotes that back up that claim explicitly.
Your quote from the paper illustrates that it’s very plausible that people thought (1) was true, but that seems separate to my main point: that people thought (2) was not true. (1) and (2) are separate and distinct concepts. And my comment was about (2), not (1).
There is simply a distinction between a machine that actually acts on and executes your intended commands, and a machine that merely understands your intended commands, but does not necessarily act on them as you intend. I am talking about the former, not the latter.
From the paper,
Indeed, and GPT-4 does not base its decisions on a misrepresentation of its programmers intentions, most of the time. It generally both correctly understands our intentions, and more importantly, actually acts on them!
No? GPT-4 predicts text and doesn’t care about anything else. Under certain conditions it predicts nice text, under other not very nice and we don’t know what happens if we create GPT actually capable to, say, bulid nanotech.
For what it’s worth I do remember lots of people around the MIRI-sphere complaining at the time that that kind of prosaic alignment work was kind of useless, because it missed the hard parts of aligning superintelligence.
I agree some people in the MIRI-sphere did say this, and a few of them get credit for pointing out things in this vicinity, but I personally don’t remember reading many strong statements of the form:
“Prosaic alignment work is kind of useless because it will actually be easy to get a roughly human-level machine to interpret our commands reliably, do what you want without significant negative side effects, and let you shut it down whenever you want etc. The hard part is doing this for superintelligence.”
My understanding is that a lot of the time the claim was instead something like:
“Prosaic alignment work is kind of useless because machine learning is natively not very transparent and alignable, and we should focus instead on creating alignable alternatives to ML, or building the conceptual foundations that would let us align powerful AIs.”
As some evidence, I’d point to Rob Bensinger’s statement that,
I do also think a number of people on LW sometimes said a milder version of the thing I mentioned above, which was something like:
“Prosaic alignment work might help us get narrow AI that works well in various circumstances, but once it develops into AGI, becomes aware that it has a shutdown button, and can reason through the consequences of what would happen if it were shut down, and has general situational awareness along with competence across a variety of domains, these strategies won’t work anymore.”
I think this weaker statement now looks kind of false in hindsight, since I think current SOTA LLMs are already pretty much weak AGIs, and so they already seem close to the threshold at which we were supposed to start seeing these misalignment issues come up. But they are not coming up (yet). I think near-term multimodal models will be even closer to the classical “AGI” concept, complete with situational awareness and relatively strong cross-domain understanding, and yet I also expect them to mostly be fairly well aligned to what we want in every relevant behavioral sense.
Well, for instance, I watched Ryan Carey give a talk at CHAI about how Cooperative Inverse Reinforcement Learning didn’t give you corrigibility. (That CIRL didn’t tackle the hard part of the problem, despite seeming related on the surface.)
I think that’s much more an example of
than of
This doesn’t seem to be the same thing as what I was talking about.
Yes, people frequently criticized particular schemes for aligning AI systems, arguing that the scheme doesn’t address some key perceived obstacle. By itself, this is pretty different from predicting both:
It will be easy to get behavioral alignment on slightly-sub-AGI, and maybe even par-human systems, including on shutdown problems
The problem is that these schemes don’t scale well all the way to radical superintelligence.
I remember a lot of people making the second point, but not nearly as many making the first point.
I think I’m missing you then.
If (some) people already had the view that this kind of prosaic alignment wouldn’t scale to Superintelligence, but didn’t express an opinion about whether behavioral alignment of slightly-sub-AGI would be solved, what in what way do you want them to be updating that they’re not?
Or do you mean they weren’t just agnostic about the behavioral alignment of near-AGIs, they specifically thought that it wouldn’t be easy? Is that right?
Two points:
One, I think being able to align AGI and slightly sub-AGI successfully is plausibly very helpful for making the alignment problem easier. It’s kind of like learning that we can create more researchers on demand if we ever wanted to.
Two, the fact that the methods scale surprisingly well to human-level is evidence that they actually work pretty well in general, even if they don’t scale all the way into some radical regime way above human-level. For example, Eliezer talked about how he expected you’d need to solve the suspend button problem by the time your AI has situational awareness, but I think you can interpret this prediction as either becoming increasingly untenable, or that we appear close to a solution to the problem since our AIs don’t seem to be resisting shutdown.
Again, presumably once you get the aligned AGI, you can use many copies of the aligned AGI to help you with the next iteration, AGI+. This seems plausibly very positive as an update. I can sympathize with those who say it’s only a minor update because they never thought the problem was merely aligning human-level AI, but I’m a bit baffled by those who say it’s not an update at all from the traditional AI risk models, and are still very pessimistic.
I feel like I’m being obstinate or something, but I think that the linked article is still basically correct, and not particularly untenable.
From the article...
The key word in that sentence is “consequentialist”. Current LLMs are pretty close (I think!) to having pretty detailed situational awareness. But, as near as I can tell, LLMs are, at best, barely consequentialist.
I agree that that is a surprise, on the old school LessWrong / MIRI world view. I had assumed that “intelligence” and “agency” were way more entangled, way more two sides of the same coin, than they apparently are.
And the framing of the article focuses on situational awareness and not on consequentialism because of that error. Because Eliezer (and I) thought at the time that situational awareness would come after consequentialist reasoning in the tech tree.
But I expect that we’ll have consequentialist agents eventually (if not, that’s a huge crux for how dangerous I expect AGI to be), and I expect that you’ll have “off button” problems at the point when you have “enough” consequentialism aimed at some goal, “enough” strategic awareness, and strong “enough” capabilities that the AI can route around the humans and the human safeguards.
In my opinion, the extent to which the linked article is correct is roughly the extent to which the article is saying something trivial and irrelevant.
The primary thing I’m trying to convey here is that we now have helpful, corrigible assistants (LLMs) that can aid us in achieving our goals, including alignment, and the rough method used to create these assistants seems to scale well, perhaps all the way to human level or slightly beyond it.
Even if the post is technically correct because a “consequentialist agent” is still incorrigible (perhaps by definition), and GPT-4 is not a “consequentialist agent”, this doesn’t seem to matter much from the perspective of alignment optimism, since we can just build helpful, corrigible assistants to help us with our alignment work instead of consequentialist agents.
A side-note to this conversation, but I basically still buy the quoted text and don’t think it now looks false in hindsight.
We (apparently) don’t yet have models that have robust longterm-ish goals. I don’t know how natural it will be for models to end up with long term goals: the MIRI view says that anything that can do science will definitely have long-term planning abilities which fundamentally, entails having goals that are robust to changing circumstances. I don’t know if that’s true, but regardless, I expect that we’ll specifically engineer agents with long term goals. (Whether or not those agents will have “robust” long term goals, over and above what they were prompted to do in a specific situation is also something that I don’t know.)
What I expect to see is agents that have a portfolio of different drives and goals, some of which are more like consequentialist objectives (eg “I want to make the number in this bank account go up”) and some of which are more like deontological injunctions (“always check with my user/ owner before I make a big purchase or take a ‘creative’ action, one that is outside of my training distribution”).
My prediction is that the consequentialist parts of the agent will basically route around any deontological constraints that are trained in.
For instance, the your personal assistant AI does ask your permission before it does anything creative, but also, it’s superintelligently persuasive and so it always asks your permission in exactly the way that will result in it accomplishing what it wants. If there are a thousand action sequences in which it asks for permission, it picks the one that has the highest expected value with regard to whatever it wants. This basically nullifies the safety benefit of any deontological injunction, unless there are some injunctions that can’t be gamed in this way.
To do better than this, it seems like you do have to solve the Agent Foundations problem of corrigibility (getting the agent to be sincerely indifferent between your telling it to take the action or not take the action) or you have to train in, not a deontological injunction, but an active consequentialist goal of serving the interests of the human (which means you have find a way to get the agent to be serving some correct enough idealization of human values).
But I think we mostly won’t see this kind of thing until we get quite high levels of capability, where it is transparent to the agent that some ways of asking for permission have higher expected value than others.
Or rather, we might see a little of this effect early on, but until your assistant is superhumanly persuasive, it’s pretty small. Maybe we’ll see a bias toward accepting actions that serve the AI agent’s goals (if we even know what those are) more, as capability goes up, but we won’t be able to distinguish “the AI is getting better at getting what it wants from the human” from “the AIs are just more capable, and so they come up with plans that work better.” It’ll just look like the numbers going up.
To be clear “superhumanly persuasive” is only one, particularly relevant, example of a superhuman capability that allows you to route around deontological injunctions that the agent is committed to. My claim is weaker if you remove that capability in particular, but mostly what I’m wanting to say is that powerful consequentialism find and “squeezes through” the gaps in your oversight and control and naive-corrigibility schemes, unless you figure out corrigibility in the Agent Foundations sense.
This is one of the main reasons I’m not excited about engaging with LessWrong. Why bother? It feels like nothing I say will matter. Apparently, no pre-takeoff experiments matter to some folk.[1] And even if I successfully dismantle some philosophical argument, there’s a good chance they will use another argument to support their beliefs instead. Nothing changes.
So there we are. It doesn’t matter what my experiments say, because (it is claimed) there are no testable predictions before The End. But also, everyone important already knew in advance that it’d be easy to get GPT-4 to interpret and execute your value-laden requests in a human-reasonable fashion. Even though ~no one said so ahead of time.
When talking with pre-2020 alignment folks about these issues, I feel gaslit quite often. You have no idea how many times I’ve been told things like “most people already understood that reward is not the optimization target”[2] and “maybe you had a lesson you needed to learn, but I feel like I got this in 2018″, and so on. Almost always this comes from people who seem to still not understand what I’m talking about. I feel fine if[3] they disagree with me about specific ideas, but what really bothers me is the revisionism. It’s so annoying.
Like, just look at this quote from the post you mentioned:
And you probably didn’t even select that post for this particular misunderstanding. (EDIT: Note that I am not accusing Rohin of gaslighting on this topic, and I also think he already understood the “reward is not the optimization target” point when he wrote the above sentence. My critique was that the statement is false and would probably lead readers to incorrect beliefs about the purpose of reward in RL.)
I feel a lot of disappointment and sadness. In 2018, I came to this website when I really needed a new way to understand the world. I’d made a lot of epistemic mistakes I wasn’t proud of, and I didn’t want to live that way anymore. I wanted to think more clearly. I wanted it so, so badly. I came to rely and depend on this place and the fellow users. I looked up to and admired a bunch of them (and I still do so for a few).
But the things you mention—the revisionism, the unfalsifiability, the apparent gaslighting? We set out to do better than science. I think we often do worse.
As a general principle, truths are entangled with each other. It’s OK if a theory’s most extreme prediction (e.g. extinction from AI) is not testable at the current moment. It is a highly suspicious state of affairs if a theory yields no other testable predictions. Truths are generally entangled with each other in intricate and manifold ways. There are generally many clever ways to test a theory, given the necessary will and curiosity.
I could give more concrete examples, but that feels indecorous.
Sometimes I instead get pushback like “it seems to me like I’ve grasped the insights you’re trying to communicate, but I totally acknowledge that I might just not be seeing what you’re saying yet.” I respect and appreciate that response. It communicates the other person’s true perception (that they already understand) while not invalidating or assuming away my perspective.
I get why you feel that way. I think there are a lot of us on LessWrong who are less vocal and more openminded, and less aligned with either optimistic network thinkers or pessimistic agent foundations thinkers. People newer to the discussion and otherwise less polarized are listening and changing their minds in large or small ways.
I’m sorry you’re feeling so pessimistic about LessWrong. I think there is a breakdown in communication happening between the old guard and the new guard you exemplify. I don’t think that’s a product of venue, but of the sheer difficulty of the discussion. And polarization between different veiwpoints on alignment.
I think maintaining a good community falls on all of us. Formats and mods can help, but communities set their own standards.
I’m very, very interested to see a more thorough dialogue between you and similar thinkers, and MIRI-type thinkers. I think right now both sides feel frustrated that they’re not listened to and understood better.
(Presumably you are talking about how reward is not the optimization target.)
While I agree that the statement is not literally true, I am still basically on board with that sentence and think it’s a reasonable shorthand for the true thing.
I expect that I understood the “reward is not the optimization target” point at the time of writing that post (though of course predicting what your ~5-years-ago self knew is quite challenging without specific quotes to refer to).
I am confident I understood the point by the time I was working on the goal misgeneralization project (late 2021), since almost every example we created involved predicting ahead of time a specific way in which reward would fail to be the optimization target.
(I didn’t follow this argument at the time, so I might be missing key context.)
The blog post “Reward is not the optimization target” gives the following summary of its thesis,
I hope it doesn’t come across as revisionist to Alex, but I felt like both of these points were made by people at least as early as 2019, after the Mesa-Optimization sequence came out in mid-2019. As evidence, I’ll point to my post from December 2019 that was partially based on a conversation with Rohin, who seemed to agree with me,
I think in this passage I’m imagining that “reward is not the trained agent’s optimization target” quite explicitly, since I’m pointing out that a neural network trained by RL will not necessarily optimize anything at all. In a subsequent post from January 2020 I gave a more explicit example, said this fact doesn’t merely apply to simple neural networks, and then offered my opinion that “it’s inaccurate to say that the source of malign generalization must come from an internal search being misaligned with the objective function we used during training”.
From the comments, and from my memory of conversations at the time, many people disagreed with my framing. They disagreed even when I pointed out that humans don’t seem to be “optimizers” that select for actions that maximize our “reward function” (I believe the most common response was to deny the premise, and say that humans are actually roughly optimizers. Another common response was to say that AI is different for some reason.)
However, even though some people disagreed with this framing, not everyone did. As I pointed out, Rohin seemed to agree with me at the time, and so at the very least I think there is credible evidence that this insight was already known to a few people in the community by late 2019.
I have no stake in this debate, but how is this particular point any different than what Eliezer says when he makes the point about humans not optimizing for IGF? I think the entire mesaoptimization concern is built around this premise, no?
I didn’t mean to imply that you in particular didn’t understand the reward point, and I apologize for not writing my original comment more clearly in that respect. Out of nearly everyone on the site, I am most persuaded that you understood this “back in the day.”
I meant to communicate something like “I think the quoted segment from Rohin and Dmitrii’s post is incorrect and will reliably lead people to false beliefs.”
Thanks for the edit :)
As I mentioned elsewhere (not this website) I don’t agree with “will reliably lead people to false beliefs”, if we’re talking about ML people rather than LW people (as was my audience for that blog post).
I do think that it’s a reasonable hypothesis to have, and I assign it more likelihood than I would have a year ago (in large part from you pushing some ML people on this point, and them not getting it as fast as I would have expected).
FWIW at the time I wasn’t working on value learning and wasn’t incredibly excited about work in that direction, despite the fact that that’s what the rest of my lab was primarily focussed on. I also wrote a blog post in 2020, based off a conversation I had with Rohin in 2018, where I mention how important it is to work on inner alignment stuff and how those issues got brought up by the ‘paranoid wing’ of AI alignment. My guess is that my view was something like “stuff like reward learning from the state of the world doesn’t seem super important to me because of inner alignment etc, but for all I know cool stuff will blossom out of it, so I’m happy to hear about your progress and try to offer constructive feedback”, and that I expressed that to Rohin in person.
Of course, the fact that I think the same thing now as I did in 2020 isn’t much evidence that I’m right.