I mostly feel bad about LessWrong these days. I slightly dread logging on, I don’t expect to find much insightful on the website, and think the community has a lot of groupthink / other “ew” factors that are harder for me to pin down (although I think that’s improved over the last year or two). I also feel some dread at posting this because it might burn social capital I have with the mods, but whatever.
(Also, most of this stuff is about the community and not directly in the purview of the mods anyways.)
Here are some rambling thoughts, though:
I think there are pretty good reasons that the broader AI community hasn’t taken LW seriously.
I feel a lot of cynicism. I worry that colors my lens here. But I’ll just share what I see looking through that lens.
Also some of my cynicism comes from annoying-feeling object-level disagreements driving me away from the website. Probably other people are having more fun.
(High confidence) I feel like the project of thinking more clearly has largely fallen by the wayside, and that we never did that great of a job at it anyways.
Over time, I’ve felt myself grow more distant from this community and website. At times, it feels sad. At times, it feels correct. Sometimes it feels both.
(Medium confidence, unsure if relevant to LW itself) In the bay area community, there are lots of professionally relevant events which are de facto gated by how much random people like you on a personal level (namely, the organizers). There’s also a lot of weird social stuff but IDK how relevant that is to LW.
(Medium confidence) It seems to me that often people rehearse fancy and cool-sounding reasons for believing roughly the same things they always believed, and comment threads don’t often change important beliefs. Feels more like people defensively explaining why they aren’t idiots, or why they don’t have to change their mind. I mean, if so—I get it, sometimes I feel that way too. But it sucks and I think it happens a lot.
I feel worried that there are a bunch of people with entrenched worldviews who basically never change their minds about anything important. Seems unhealthy on a community level.
Like, there is a way that it feels to be defending yourself or sailing against the winds of counterevidence to your beliefs, and it’s really really important to not do that. Come on guys :(
(When Wei_Dai introduced Updateless Decision Theory, it wasn’t about this kind of “updatelessness”! :( )
(High confidence) I think this community has engaged in a lot of hero worship. I think to some extent I have benefited from this, though I don’t think I’m the prototype. But, seriously guys, looking back, I think this place has been pretty creepy in some ways.
The way people praise/exalt Eliezer and Paul is just… weird. The times I’d be at an in-person workshop, and people would spend time “ranking” alignment researchers. Feels like a social status horse race, and probably LessWrong has some direct culpability here.
But people don’t seem to take Eliezer as seriously these days, which I think is great, so maybe it’s less of a problem now.
I think this is Eliezer’s fault in his case and mostly not Paul’s fault for his own rep, but IDK.
I think we’ve kinda patted ourselves on the back for being awesome and ahead of the curve, even though, in terms of alignment, I think we really didn’t get anything done until 2022 or so, and a lot of the meaningful progress happened elsewhere.
(Medium confidence) It seems possible to me that “taking ideas seriously” has generally meant something like “being willing to change your life to further the goals and vision of powerful people in the community, or to better accord with socially popular trends”, and less “taking unconventional but meaningful bets on your idiosyncratic beliefs.”
Somewhat relatedly, there have been a good number of times where it seems like I’ve persuaded someone of A and of A⟹B, and they still don’t believe B, and coincidentally B is unpopular.
I feel pretty frustrated at how rarely people actually bet or make quantitative predictions about existential risk from AI. EG my recent attempt to operationalize a bet with Nate went nowhere. Paul trying to get Eliezer to bet during the MIRI dialogues also went nowhere, or barely anywhere—I think they ended up making some random bet about how long an IMO challenge would take to be solved by AI. (feels pretty weak and unrelated to me. lame. but huge props to Paul for being so ready to bet, that made me take him a lot more seriously.)
(Medium-high confidence) I think that alignment “theorizing” is often a bunch of philosophizing and vibing in a way that protects itself from falsification (or even proof-of-work) via words like “pre-paradigmatic” and “deconfusion.” I think it’s not a coincidence that many of the “canonical alignment ideas” somehow don’t make any testable predictions until AI takeoff has begun. 🤔
I expect there to be a bunch of responses which strike me as defensive, revisionist gaslighting, and I don’t know if/when I’ll reply.
I feel pretty frustrated at how rarely people actually bet or make quantitative predictions about existential risk from AI. [...]
I think that alignment “theorizing” is often a bunch of philosophizing and vibing in a way that protects itself from falsification (or even proof-of-work) via words like “pre-paradigmatic” and “deconfusion.” I think it’s not a coincidence that many of the “canonical alignment ideas” somehow don’t make any testable predictions until AI takeoff has begun.
This sentiment resonates strongly with me.
A personal background: I remember getting pretty heavily involved in AI alignment discussions on LessWrong in 2019. Back then I think there were a lot of assumptions people had about what “the problem” was that are, these days, often forgotten, brushed aside, or sometimes even deliberately minimized post-hoc in order to give the impression that the field has a better track record than it actually does. [ETA: but to be clear, I don’t mean to say everyone made the same mistake I describe here]
This has been a bit shocking and disorienting to me, honestly, because at the time in 2019 I didn’t get the strong impression that people were deliberately constructing unfalsifiable models of the problem. I had the vague impression that people had relatively firm views that made predictions about the world prior to superintelligence, and that these views were open to revision upon new evidence. And that naiveté led me to experience the last few years of evidence a bit differently than I think some other people.
To give a cursory taste of what I’m talking about, we can consider what I think of as a representative blog post from the genre of pre-2020 alignment content: Rohin Shah and Dmitrii Krasheninnikov’s “Learning Preferences by Looking at the World”. This is by no means a cherry-picked example either, in my opinion (and I don’t mean to criticize Rohin specifically, I’m just using this blog post as a representative example of what people talked about at the time). In the blog post, they state,
It would be great if we could all have household robots do our chores for us. Chores are tasks that we want done to make our houses cater more to our preferences; they are a way in which we want our house to be different from the way it currently is. However, most “different” states are not very desirable:
Surely our robot wouldn’t be so dumb as to go around breaking stuff when we ask it to clean our house? Unfortunately, AI systems trained with reinforcement learning only optimize features specified in the reward function and are indifferent to anything we might’ve inadvertently left out. Generally, it is easy to get the reward wrong by forgetting to include preferences for things that should stay the same, since we are so used to having these preferences satisfied, and there are so many of them.
[...]
Note that we’re not talking about problems with robustness and distributional shift: while those problems are worth tackling, the point is that even if we achieve robustness, the simple reward function still incentivizes the above unwanted behaviors.
Suppose in 2024-2029, someone constructs an intelligent robot that is able clean a room to a high level of satisfaction, consistent with the user’s intentions, without any major negative side effects or general issues of misspecification. It doesn’t break any vases while cleaning. It respects all basic moral norms you can think of. It lets you shut it down whenever you want. And it actually does its job of cleaning the room in a reasonable amount of time.
If that were to happen, I think an extremely natural reading of the situation is that a substantial part of what we thought “the problem” was in value alignment has been solved, from the perspective of this blog post from 2019. That is cause for an updating of our models, and a verbal recognition that our models have updated in this way.
Yet, that’s not how I think everyone on LessWrong would react to the development of such a robot. My impression is that a large fraction, perhaps a majority, of LessWrongers would not share my interpretation here, despite the plain language in the post explaining what they thought the problem was. Instead, I imagine many people would respond to this argument basically saying the following:
“We never thought that was the hard bit of the problem. We always thought it would be easy to get a human-level robot to follow instructions reliably, do what users intend without major negative side effects, follow moral constraints including letting you shut it down, and respond appropriately given unusual moral dilemmas. The idea that we thought that was ever the problem is a misreading of what we wrote. The problem was always purely that alignment issues would arise after we far surpassed human intelligence, at which point entirely novel problems will arise.”
But the blog post said there was a problem, gave an example of the problem manifesting, and then spent the rest of the post trying to come up with solutions. The authors gave no indication that this particular problem was trivial, or that the example used was purely illustrative and had nothing to do with the type of real-world issues that might arise if we fail to solve the problem. If I was misreading the blog post at the time, how come it seems like almost no one ever explicitly predicted at the time that these particular problems were trivial for systems below or at human-level intelligence?!?
To clarify, I’m in full agreement with anyone who simply says that the alignment problem looks like it might still be hard, based on different arguments than the one presented in this blog post. There were a lot of arguments people gave back then, and some of the older arguments still look correct. Perhaps most significantly, robustness to distribution shifts still looks reasonably hard as a problem. But the blog post I cited explicitly said “Note that we’re not talking about problems with robustness and distributional shift”!
At this point I think there are a number of potential replies from people who still insist that the LW models of AI alignment were never wrong, which I (depending on the speaker) think can often border on gaslighting:
“Rohin’s point wasn’t that this problem would be hard. He was using it as a mere stepping stone to explain much harder problems of misspecification that were, at that time, purely theoretical.”
Then why did the paper explicitly say, “for many real-world tasks it can be challenging to specify a reward function that captures human preferences, particularly the preference for avoiding unnecessary side effects while still accomplishing the goal (Amodei et al., 2016).”
Why is it so hard to find people explicitly saying that this specific problem, and the examples illustrating it, were not meant to be seriously representative of the hard parts of alignment at the time?
Isn’t it still pretty valuable to point out that we’re solving stepping stones on the path towards the ‘real problem’?
“Sure, Rohin thought that was a major problem, but we [our organization/thought cluster/ideological group] never agreed with him.”
Oh really? Did you ever explicitly highlight this particular disagreement at the time? He wasn’t exactly a minor researcher at the time. And this blog post is only one of a number of blog posts expressing essentially an identical sentiment.
[ETA: to be fair, I do think there were some people who did genuinely disagree with Rohin’s framing. I don’t mean to accuse everyone of making the same error.]
“Yes, this particular part of the alignment problem looks easier than we thought, but serious people always thought that this was going to be one of the easiest subproblems, compared to other things. This problem was considered a very minor sub-problem of value alignment that merited like 1% of researcher-hours.”
Then why is it so easy to find countless blog posts of a similar nature from alignment researchers at the time, presenting pretty much the same problem and then presenting an attempt to solve it? Did all those people simply knowingly work on one of the easiest sub-problems of alignment?
I wrote a fair amount about alignment from 2014-2020[1] which you can read here. So it’s relatively easy to get a sense for what I believed.
Here are some summary notes about my views as reflected in that writing, though I’d encourage you to just judge for yourself[2] by browsing the archives:
I expected AI systems to be pretty good at predicting what behaviors humans would rate highly, long before they were catastrophically risky. This comes up over and over again in my writing. In particular, I repeatedly stated that it was very unlikely that an AI system would kill everyone because it didn’t understand that people would disapprove of that action, and therefore this was not the main source of takeover concerns. (By 2017 I expected RLHF to work pretty well with language models, which was reflected in my research prioritization choices and discussions within OpenAI though not clearly in my public writing.)
I consistently expressed that my main concerns were instead about (i) systems that were too smart for humans to understand the actions they proposed, (ii) treacherous turns from deceptive alignment. This comes up a lot, and when I talk about other problems I’m usually clear that they are prerequisites that we should expect to succeed. Eg.. see an unaligned benchmark. I don’t think this position was an extreme outlier, my impression at the time was that other researchers had broadly similar views.
I think the biggest-alignment relevant update is that I expected RL fine-tuning over longer horizons (or even model-based RL a la AlphaZero) to be a bigger deal. I was really worried about it significantly improving performance and making alignment harder. In 2018-2019 my mainline picture was more like AlphaStar or AlphaZero, with RL fine-tuning being the large majority of compute. I’ve updated about this and definitely acknowledge I was wrong.[3] I don’t think it totally changes the picture though: I’m still scared of RL, I think it is very plausible it will become more important in the future, and think that even the kind of relatively minimal RL we do now can introduce many of the same risks.
In 2016 I pointed out that ML systems being misaligned on adversarial inputs and exploitable by adversaries was likely to be the first indicator of serious problems, and therefore that researchers in alignment should probably embrace a security framing and motivation for their research.
I expected LM agents to work well (see this 2015 post). Comparing this post to the world of 2023 I think my biggest mistake was overestimating the importance of task decomposition vs just putting everything in a single in-context chain of thought. These updates overall make crazy amplification schemes seem harder (and to require much smarter models than I originally expected, if they even make sense at all) but at the same time less necessary (since chain of thought works fine for capability amplification for longer than I would have expected).
I overall think that I come out looking somewhat better than other researchers working in AI alignment, though again I don’t think my views were extreme outliers (and during this period I was often pointed to as a sensible representative of fairly hardcore and traditional alignment concerns).
Like you, I am somewhat frustrated that e.g. Eliezer has not really acknowledged how different 2023 looks from the picture that someone would take away from his writing. I think he’s right about lots of dynamics that would become relevant for a sufficiently powerful system, but at this point it’s pretty clear that he was overconfident about what would happen when (and IMO is still very overconfident in a way that is directly relevant to alignment difficulty). The most obvious one is that ML systems have made way more progress towards being useful R&D assistants way earlier than you would expect if you read Eliezer’s writing and took it seriously. By all appearances he didn’t even expect AI systems to be able to talk before they started exhibiting potentially catastrophic misalignment.
I think my opinions about AI and alignment were much worse from 2012-2014, but I did explicitly update and acknowledge many mistakes from that period (though some of it was also methodological issues, e.g. I believe that “think about a utility function that’s safe to optimize” was a useful exercise for me even though by 2015 I no longer thought it had much direct relevance).
I’d also welcome readers to pull out posts or quotes that seem to indicate the kind of misprediction you are talking about. I might either acknowledge those (and I do expect my historical reading is very biased for obvious reasons), or I might push back against them as a misreading and explain why I think that.
That said, in fall 2018 I made and shared some forecasts which were the most serious forecasts I made from 2016-2020. I just looked at those again to check my views. I gave a 7.5% chance of TAI by 2028 using short-horizon RL (over a <5k word horizon using human feedback or cheap proxies rather than long-term outcomes), and a 7.5% chance that by 2028 we would be able to train smart enough models to be transformative using short-horizon optimization but be limited by engineering challenges of training and integrating AI systems into R&D workflows (resulting in TAI over the following 5-10 years). So when I actually look at my probability distributions here I think they were pretty reasonable. I updated in favor of alignment being easier because of the relative unimportance of long-horizon RL, but the success of imitation learning and short-horizon RL was still a possibility I was taking very seriously and overall probably assigned higher probability to than almost anyone in ML.
I overall think that I come out looking somewhat better than other researchers working in AI alignment, though again I don’t think my views were extreme outliers
I agree, your past views do look somewhat better. I painted alignment researchers with a fairly broad brush in my original comment, which admittedly might have been unfair to many people who departed from the standard arguments (alternatively, it gives those researchers a chance to step up and receive credit for having been in the minority who weren’t wrong). Partly I portrayed the situation like this because I have the sense that the crucial elements of your worldview that led you to be more optimistic were not disseminated anywhere close to as widely as the opposite views (e.g. “complexity of wishes”-type arguments), at least on LessWrong, which is where I was having most of these discussions.
My general impression is that it sounds like you agree with my overall take although you think I might have come off too strong. Perhaps let me know if I’m wrong about that impression.
When I joined AI safety in late 2017 (having read approximately nothing in the field), I thought of the problem as “construct a utility function for an AI system to optimize”, with a key challenge being the fragility of value. In hindsight this was clearly wrong.
The Value Learning sequence was in large part a result of my journey away from the utility function framing.
That being said, I suspect I continued to think that fragility-of-value type issues were a significant problem, probably until around mid-2019 (see next point).
(I did continue some projects more motivated from a fragility-of-value perspective, partly out of a heuristic of actually finishing things I start, and partly because I needed to write a PhD thesis.)
Early on, I thought of generalization as a key issue for deep learning and expected that vanilla deep learning would not lead to AGI for this reason. Again, in hindsight this was clearly wrong.
I was extremely surprised by OpenAI Five in June 2018 (not just that it worked, but also the ridiculous simplicity of the methods, in particular the lack of any hierarchical RL) and had to think through that.
I spent a while trying to understand that (at least months, e.g. you can see me continuing to be skeptical of deep learning in this Dec 2018 post).
I think I ended up close to my current views around early-to-mid-2019, e.g. I still broadly agree with the things I said in this August 2019 conversation (though I’ll note I was using “mesa optimizer” differently than it is used today—I agree with what I meant in that conversation, though I’d say it differently today).
I think by this point I was probably less worried about fragility of value. E.g. in that conversation I say a bunch of stuff that implies it’s less of a problem, most notably that AI systems will likely learn similar features as humans just from gradient descent, for reasons that LW would now call “natural abstractions”.
Note that this comment is presenting the situation as a lot cleaner than it actually was. I would bet there were many ways in which I was irrational / inconsistent, probably some times where I would have expressed verbally that fragility of value wasn’t a big deal but would still have defended research projects centered around it from some other perspective, etc.
Some thoughts on how to update based on past things I wrote:
I don’t think I’ve ever thought of myself as largely agreeing with LW: my relationship to LW has usually been “wow, they seem to be getting some obvious stuff wrong” (e.g. I was persuaded of slow takeoff basically when Paul’s post and AI Impacts’ post came out in Feb 2018, the Value Learning sequence in late 2018 was primarily in response to my perception that LW was way too anchored on the “construct a utility function” framing).
I think you don’t want to update too hard on the things that were said on blog posts addressed to an ML audience, or in papers that were submitted to conferences. Especially for the papers there’s just a lot of random stuff you couldn’t say about why you’re doing the work because then peer reviewers will object (e.g. I heard second hand of a particularly egregious review to the effect of: “this work is technically solid, but the motivation is AGI safety; I don’t believe in AGI so this paper should be rejected”).
Suppose in 2024-2029, someone constructs an intelligent robot that is able clean a room to a high level of satisfaction, consistent with the user’s intentions, without any major negative side effects or general issues of misspecification. It doesn’t break any vases while cleaning.
I remember explicit discussion about how solving this problem shouldn’t even count as part of solving long-term / existential safety, for example:
“What I understand this as saying is that the approach is helpful for aligning housecleaning robots (using near extrapolations of current RL), but not obviously helpful for aligning superintelligence, and likely stops being helpful somewhere between the two. [...] There is a risk that a large body of safety literature which works for preventing today’s systems from breaking vases but which fails badly for very intelligent systems actually worsens the AI safety problem” https://www.lesswrong.com/posts/H7KB44oKoSjSCkpzL/worrying-about-the-vase-whitelisting?commentId=rK9K3JebKDofvJA3x
Why is it so hard to find people explicitly saying that this specific problem, and the examples illustrating it, were not meant to be seriously representative of the hard parts of alignment at the time?
See also The Main Sources of AI Risk? where this problem was only part of one bullet point, out of 30 (as of 2019, 35 now).
I remember explicit discussion about how solving this problem shouldn’t even count as part of solving long-term / existential safety, for example:
Two points:
I have a slightly different interpretation of the comment you linked to, which makes me think it provides only weak evidence for your claim. (Though it’s definitely still some evidence.)
I agree some people deserve credit for noticing that human-level value specification might be kind of easy before LLMs. I don’t mean to accuse everyone in the field of making the same mistake.
Anyway, let me explain the first point.
I interpret Abram to be saying that we should focus on solutions that scale to superintelligence, rather than solutions that only work on sub-superintelligent systems but break down at superintelligence. This was in response to Alex’s claim that “whitelisting contributes meaningfully to short- to mid-term AI safety, although I remain skeptical of its robustness to scale.”
In other words, Alex said (roughly): “This solution seems to work for sub-superintelligent AI, but might not work for superintelligent AI.” Abram said in response that we should push against such solutions, since we want solutions that scale all the way to superintelligence. This is not the same thing as saying that any solution to the house-cleaning robot provides negligible evidence of progress, because some solutions might scale.
It’s definitely arguable, but I think it’s likely that any realistic solution to the human-level house cleaning robot problem—in the strong sense of getting a robot to genuinely follow all relevant moral constraints, allow you to shut it down, and perform its job reliably in a wide variety of environments—will be a solution that scales reasonably well above human intelligence (maybe not all the way to radical superintelligence, but at the very least I don’t think it’s negligible evidence of progress).
If you merely disagree that any such solutions will scale, and you’ve been consistent on this point for the last five years, then I guess I’m not really addressing you in my original comment, but I still think what I wrote applies to many other researchers.
See also The Main Sources of AI Risk? where this problem was only part of one bullet point, out of 30 (as of 2019, 35 now).
“Inability to specify any ‘real-world’ goal for an artificial agent (suggested by Michael Cohen)”
I’m not sure how much to be compelled by this piece of evidence. I’ll point out that naming the same problem multiple times might have gotten repetitive, and there was also no explicit ranking of the problems from most important to least important (or from hardest to easiest). If the order you wrote them in can be (perhaps uncharitably) interpreted as the order of importance, then I’ll note that it was listed as problem #3, which I think supports my original thesis adequately.
This matches my sense of how a lot of people seem to have… noticed that GPT-4 is fairly well aligned to what the OpenAI team wants it to be, in ways that Yudkowsky et al said would be very hard, and still not view this as at a minimum a positive sign?
Ie problems of the class ‘I told the intelligence to get my mother out of the burning building and it blew her up so the dead body flew out the window, this is because I wasn’t actually specific enough’ just don’t seem like they are a major worry anymore?
Usually when GPT-4 doesn’t understand what I’m asking, I wouldn’t be surprised if a human was confused also.
If I was misreading the blog post at the time, how come it seems like almost no one ever explicitly predicted at the time that these particular problems were trivial for systems below or at human-level intelligence?!?
Autonomous AI systems’ programmed goals can easily fall short of programmers’ intentions. Even a machine intelligent enough to understand its designers’ intentions would not necessarily act as intended. We discuss early ideas on how one might design smarter-than-human AI systems that can inductively learn what to value from labeled training data, and highlight questions about the construction of systems that model and act upon their operators’ preferences.
And quoting from the first page of that paper:
The novelty here is not that programs can exhibit incorrect or counter-intuitive behavior, but that software agents smart enough to understand natural language may still base their decisions on misrepresentations of their programmers’ intent. The idea of superintelligent agents monomaniacally pursuing “dumb”-seeming goals may sound odd, but it follows from the observation of Bostrom and Yudkowsky [2014, chap. 7] that AI capabilities and goals are logically independent.1 Humans can fully comprehend that their “designer” (evolution) had a particular “goal” (reproduction) in mind for sex, without thereby feeling compelled to forsake contraception. Instilling one’s tastes or moral values into an heir isn’t impossible, but it also doesn’t happen automatically.
I won’t weigh in on how many LessWrong posts at the time were confused about where the core of the problem lies. But “The Value Learning Problem” was one of the seven core papers in which MIRI laid out our first research agenda, so I don’t think “we’re centrally worried about things that are capable enough to understand what we want, but that don’t have the right goals” was in any way hidden or treated as minor back in 2014-2015.
I also wouldn’t say “MIRI predicted that NLP will largely fall years before AI can match e.g. the best human mathematicians, or the best scientists”, and if we saw a way to leverage that surprise to take a big bite out of the central problem, that would be a big positive update.
I’d say:
MIRI mostly just didn’t make predictions about the exact path ML would take to get to superintelligence, and we’ve said we didn’t expect this to be very predictable because “the journey is harder to predict than the destination”. (Cf. “it’s easier to use physics arguments to predict that humans will one day send a probe to the Moon, than it is to predict when this will happen or what the specific capabilities of rockets five years from now will be”.)
Back in 2016-2017, I think various people at MIRI updated to median timelines in the 2030-2040 range (after having had longer timelines before that), and our timelines haven’t jumped around a ton since then (though they’ve gotten a little bit longer or shorter here and there).
So in some sense, qualitatively eyeballing the field, we don’t feel surprised by “the total amount of progress the field is exhibiting”, because it looked in 2017 like the field was just getting started, there was likely an enormous amount more you could do with 2017-style techniques (and variants on them) than had already been done, and there was likely to be a lot more money and talent flowing into the field in the coming years.
But “the total amount of progress over the last 7 years doesn’t seem that shocking” is very different from “we predicted what that progress would look like”. AFAIK we mostly didn’t have strong guesses about that, though I think it’s totally fine to say that the GPT series is more surprising to the circa-2017 MIRI than a lot of other paths would have been.
(Then again, we’d have expected something surprising to happen here, because it would be weird if our low-confidence visualizations of the mainline future just happened to line up with what happened. You can expect to be surprised a bunch without being able to guess where the surprises will come from; and in that situation, there’s obviously less to be gained from putting out a bunch of predictions you don’t particularly believe in.)
Pre-deep-learning-revolution, we made early predictions like “just throwing more compute at the problem without gaining deep new insights into intelligence is less likely to be the key thing that gets us there”, which was falsified. But that was a relatively high-level prediction; post-deep-learning-revolution we haven’t claimed to know much about how advances are going to be sequenced.
We have been quite interested in hearing from others about their advance prediction record: it’s a lot easier to say “I personally have no idea what the qualitative capabilities of GPT-2, GPT-3, etc. will be” than to say ”… and no one else knows either”, and if someone has an amazing track record at guessing a lot of those qualitative capabilities, I’d be interested to hear about their further predictions. We’re generally pessimistic that “which of these specific systems will first unlock a specific qualitative capability?” is particularly predictable, but this claim can be tested via people actually making those predictions.
But “The Value Learning Problem” was one of the seven core papers in which MIRI laid out our first research agenda, so I don’t think “we’re centrally worried about things that are capable enough to understand what we want, but that don’t have the right goals” was in any way hidden or treated as minor back in 2014-2015.
I think you missed my point: my original comment was about whether people are updating on the evidence from instruction-tuned LLMs, which seem to actually act on human values (i.e., our actual intentions) quite well, as opposed to mis-specified versions of our intentions.
I don’t think the Value Learning Problem paper said that it would be easy to make human-level AGI systems act on human values in a behavioral sense, rather than merely understand human values in a passive sense.
I suspect you are probably conflating two separate concepts:
It is easy to create a human-level AGI that can passively learn and understand human values (I am not saying people said this would be difficult in the past)
It is easy to create a human-level AGI that acts on human values, in the sense of actually executing instructions that follow our intentions, rather than following a dangerously mis-specified version of what we asked for.
I do not think the Value Learning Paper asserted that (2) was true. To the extent it asserted that, I would prefer to see quotes that back up that claim explicitly.
Your quote from the paper illustrates that it’s very plausible that people thought (1) was true, but that seems separate to my main point: that people thought (2) was not true. (1) and (2) are separate and distinct concepts. And my comment was about (2), not (1).
There is simply a distinction between a machine that actually acts on and executes your intended commands, and a machine that merely understands your intended commands, but does not necessarily act on them as you intend. I am talking about the former, not the latter.
From the paper,
The novelty here is not that programs can exhibit incorrect or counter-intuitive behavior, but that software agents smart enough to understand natural language may still base their decisions on misrepresentations of their programmers’ intent.
Indeed, and GPT-4 does not base its decisions on a misrepresentation of its programmers intentions, most of the time. It generally both correctly understands our intentions, and more importantly, actually acts on them!
and GPT-4 does not base its decisions on a misrepresentation of its programmers intentions, most of the time. It generally both correctly understands our intentions, and more importantly, actually acts on them!
No? GPT-4 predicts text and doesn’t care about anything else. Under certain conditions it predicts nice text, under other not very nice and we don’t know what happens if we create GPT actually capable to, say, bulid nanotech.
If that were to happen, I think an extremely natural reading of the situation is that a substantial part of what we thought “the problem” was in value alignment has been solved, from the perspective of this blog post from 2019. That is cause for an updating of our models, and a verbal recognition that our models have updated in this way.
Yet, that’s not how I think everyone on LessWrong would react to the development of such a robot. My impression is that a large fraction, perhaps a majority, of LessWrongers would not share my interpretation here, despite the plain language in the post explaining what they thought the problem was. Instead, I imagine many people would respond to this argument basically saying the following:
“We never thought that was the hard bit of the problem. We always thought it would be easy to get a human-level robot to follow instructions reliably, do what users intend without major negative side effects, follow moral constraints including letting you shut it down, and respond appropriately given unusual moral dilemmas. The idea that we thought that was ever the problem is a misreading of what we wrote. The problem was always purely that alignment issues would arise after we far surpassed human intelligence, at which point entirely novel problems will arise.”
For what it’s worth I do remember lots of people around the MIRI-sphere complaining at the time that that kind of prosaic alignment work was kind of useless, because it missed the hard parts of aligning superintelligence.
For what it’s worth I do remember lots of people around the MIRI-sphere complaining at the time that that kind of prosaic alignment work, like this, was kind of useless, because it missed the hard parts of aligning superintelligence.
I agree some people in the MIRI-sphere did say this, and a few of them get credit for pointing out things in this vicinity, but I personally don’t remember reading many strong statements of the form:
“Prosaic alignment work is kind of useless because it will actually be easy to get a roughly human-level machine to interpret our commands reliably, do what you want without significant negative side effects, and let you shut it down whenever you want etc. The hard part is doing this for superintelligence.”
My understanding is that a lot of the time the claim was instead something like:
“Prosaic alignment work is kind of useless because machine learning is natively not very transparent and alignable, and we should focus instead on creating alignable alternatives to ML, or building the conceptual foundations that would let us align powerful AIs.”
As some evidence, I’d point to Rob Bensinger’s statement that,
I don’t think Eliezer’s criticism of the field [of prosaic alignment] is about experimentalism. I do think it’s heavily about things like ‘the field focuses too much on putting external pressures on black boxes, rather than trying to open the black box’.
I do also think a number of people on LW sometimes said a milder version of the thing I mentioned above, which was something like:
“Prosaic alignment work might help us get narrow AI that works well in various circumstances, but once it develops into AGI, becomes aware that it has a shutdown button, and can reason through the consequences of what would happen if it were shut down, and has general situational awareness along with competence across a variety of domains, these strategies won’t work anymore.”
I think this weaker statement now looks kind of false in hindsight, since I think current SOTA LLMs are already pretty much weak AGIs, and so they already seem close to the threshold at which we were supposed to start seeing these misalignment issues come up. But they are not coming up (yet). I think near-term multimodal models will be even closer to the classical “AGI” concept, complete with situational awareness and relatively strong cross-domain understanding, and yet I also expect them to mostly be fairly well aligned to what we want in every relevant behavioral sense.
Well, for instance, I watched Ryan Carey give a talk at CHAI about how Cooperative Inverse Reinforcement Learning didn’t give you corrigibility. (That CIRL didn’t tackle the hard part of the problem, despite seeming related on the surface.)
I think that’s much more an example of
“Prosaic alignment work is kind of useless because it will actually be easy to get a roughly human-level machine to interpret our commands reliably, do what you want without significant negative side effects, and let you shut it down whenever you want etc. The hard part is doing this for superintelligence.”
than of
“Prosaic alignment work is kind of useless because machine learning is natively not very transparent and alignable, and we should focus instead on creating alignable alternatives to ML, or building the conceptual foundations that would let us align powerful AIs.”
Well, for instance, I watched Ryan Carey give a talk at CHAI about how Cooperative Inverse Reinforcement Learning didn’t give you corrigibility. (That CIRL didn’t tackle the hard part of the problem, despite seeming related on the surface.)
This doesn’t seem to be the same thing as what I was talking about.
Yes, people frequently criticized particular schemes for aligning AI systems, arguing that the scheme doesn’t address some key perceived obstacle. By itself, this is pretty different from predicting both:
It will be easy to get behavioral alignment on slightly-sub-AGI, and maybe even par-human systems, including on shutdown problems
The problem is that these schemes don’t scale well all the way to radical superintelligence.
I remember a lot of people making the second point, but not nearly as many making the first point.
If (some) people already had the view that this kind of prosaic alignment wouldn’t scale to Superintelligence, but didn’t express an opinion about whether behavioral alignment of slightly-sub-AGI would be solved, what in what way do you want them to be updating that they’re not?
Or do you mean they weren’t just agnostic about the behavioral alignment of near-AGIs, they specifically thought that it wouldn’t be easy? Is that right?
One, I think being able to align AGI and slightly sub-AGI successfully is plausibly very helpful for making the alignment problem easier. It’s kind of like learning that we can create more researchers on demand if we ever wanted to.
Two, the fact that the methods scale surprisingly well to human-level is evidence that they actually work pretty well in general, even if they don’t scale all the way into some radical regime way above human-level. For example, Eliezer talked about how he expected you’d need to solve the suspend button problem by the time your AI has situational awareness, but I think you can interpret this prediction as either becoming increasingly untenable, or that we appear close to a solution to the problem since our AIs don’t seem to be resisting shutdown.
Again, presumably once you get the aligned AGI, you can use many copies of the aligned AGI to help you with the next iteration, AGI+. This seems plausibly very positive as an update. I can sympathize with those who say it’s only a minor update because they never thought the problem was merely aligning human-level AI, but I’m a bit baffled by those who say it’s not an update at all from the traditional AI risk models, and are still very pessimistic.
Two, the fact that the methods scale surprisingly well to human-level is evidence that they actually work pretty well, even if they don’t scale all the way into some radical regime way above human-level. For example, Eliezer talked about how he expected you’d need to solve the suspend button problem by the time your AI has situational awareness, but I think you can interpret this prediction as either becoming increasingly untenable, or that we appear close to a solution to the problem since our AIs don’t seem to be resisting shutdown.
I feel like I’m being obstinate or something, but I think that the linked article is still basically correct, and not particularly untenable.
The key word in that sentence is “consequentialist”. Current LLMs are pretty close (I think!) to having pretty detailed situational awareness. But, as near as I can tell, LLMs are, at best, barely consequentialist.
I agree that that is a surprise, on the old school LessWrong / MIRI world view. I had assumed that “intelligence” and “agency” were way more entangled, way more two sides of the same coin, than they apparently are.
And the framing of the article focuses on situational awareness and not on consequentialism because of that error. Because Eliezer (and I) thought at the time that situational awareness would come after consequentialist reasoning in the tech tree.
But I expect that we’ll have consequentialist agents eventually (if not, that’s a huge crux for how dangerous I expect AGI to be), and I expect that you’ll have “off button” problems at the point when you have “enough” consequentialism aimed at some goal, “enough” strategic awareness, and strong “enough” capabilities that the AI can route around the humans and the human safeguards.
I feel like I’m being obstinate or something, but I think that the linked article is still basically correct, and not particularly untenable.
In my opinion, the extent to which the linked article is correct is roughly the extent to which the article is saying something trivial and irrelevant.
The primary thing I’m trying to convey here is that we now have helpful, corrigible assistants (LLMs) that can aid us in achieving our goals, including alignment, and the rough method used to create these assistants seems to scale well, perhaps all the way to human level or slightly beyond it.
Even if the post is technically correct because a “consequentialist agent” is still incorrigible (perhaps by definition), and GPT-4 is not a “consequentialist agent”, this doesn’t seem to matter much from the perspective of alignment optimism, since we can just build helpful, corrigible assistants to help us with our alignment work instead of consequentialist agents.
“Prosaic alignment work might help us get narrow AI that works well in various circumstances, but once it develops into AGI, becomes aware that it has a shutdown button, and can reason through the consequences of what would happen if it were shut down, and has general situational awareness along with competence across a variety of domains, these strategies won’t work anymore.”
I think this weaker statement now looks kind of false in hindsight, since I think current SOTA LLMs are already pretty much weak AGIs, and so they already seem close to the threshold at which we were supposed to start seeing these misalignment issues come up. But they are not coming up (yet). I think near-term multimodal models will be even closer to the classical “AGI” concept, complete with situational awareness and relatively strong cross-domain understanding, and yet I also expect them to mostly be fairly well aligned to what we want in every relevant behavioral sense.
A side-note to this conversation, but I basically still buy the quoted text and don’t think it now looks false in hindsight.
We (apparently) don’t yet have models that have robust longterm-ish goals. I don’t know how natural it will be for models to end up with long term goals: the MIRI view says that anything that can do science will definitely have long-term planning abilities which fundamentally, entails having goals that are robust to changing circumstances. I don’t know if that’s true, but regardless, I expect that we’ll specifically engineer agents with long term goals. (Whether or not those agents will have “robust” long term goals, over and above what they were prompted to do in a specific situation is also something that I don’t know.)
What I expect to see is agents that have a portfolio of different drives and goals, some of which are more like consequentialist objectives (eg “I want to make the number in this bank account go up”) and some of which are more like deontological injunctions (“always check with my user/ owner before I make a big purchase or take a ‘creative’ action, one that is outside of my training distribution”).
My prediction is that the consequentialist parts of the agent will basically route around any deontological constraints that are trained in.
For instance, the your personal assistant AI does ask your permission before it does anything creative, but also, it’s superintelligently persuasive and so it always asks your permission in exactly the way that will result in it accomplishing what it wants. If there are a thousand action sequences in which it asks for permission, it picks the one that has the highest expected value with regard to whatever it wants. This basically nullifies the safety benefit of any deontological injunction, unless there are some injunctions that can’t be gamed in this way.
To do better than this, it seems like you do have to solve the Agent Foundations problem of corrigibility (getting the agent to be sincerely indifferent between your telling it to take the action or not take the action) or you have to train in, not a deontological injunction, but an active consequentialist goal of serving the interests of the human (which means you have find a way to get the agent to be serving some correct enough idealization of human values).
But I think we mostly won’t see this kind of thing until we get quite high levels of capability, where it is transparent to the agent that some ways of asking for permission have higher expected value than others.
Or rather, we might see a little of this effect early on, but until your assistant is superhumanly persuasive, it’s pretty small. Maybe we’ll see a bias toward accepting actions that serve the AI agent’s goals (if we even know what those are) more, as capability goes up, but we won’t be able to distinguish “the AI is getting better at getting what it wants from the human” from “the AIs are just more capable, and so they come up with plans that work better.” It’ll just look like the numbers going up.
To be clear “superhumanly persuasive” is only one, particularly relevant, example of a superhuman capability that allows you to route around deontological injunctions that the agent is committed to. My claim is weaker if you remove that capability in particular, but mostly what I’m wanting to say is that powerful consequentialism find and “squeezes through” the gaps in your oversight and control and naive-corrigibility schemes, unless you figure out corrigibility in the Agent Foundations sense.
At this point I think there are a number of potential replies from people who still insist that the LW models of AI alignment were never wrong, which I (depending on the speaker) think can often border on gaslighting:
This is one of the main reasons I’m not excited about engaging with LessWrong. Why bother? It feels like nothing I say will matter. Apparently, no pre-takeoff experiments matter to some folk.[1] And even if I successfully dismantle some philosophical argument, there’s a good chance they will use another argument to support their beliefs instead. Nothing changes.
When talking with pre-2020 alignment folks about these issues, I feel gaslit quite often. You have no idea how many times I’ve been told things like “most people already understood that reward is not the optimization target”[2] and “maybe you had a lesson you needed to learn, but I feel like I got this in 2018″, and so on. Almost always this comes from people who seem to still not understand what I’m talking about. I feel fine if[3] they disagree with me about specific ideas, but what really bothers me is the revisionism. It’s so annoying.
Like, just look at this quote from the post you mentioned:
Unfortunately, AI systems trained with reinforcement learning only optimize features specified in the reward function and are indifferent to anything we might’ve inadvertently left out.
And you probably didn’t even select that post for this particular misunderstanding. (EDIT: Note that I am not accusing Rohin of gaslighting on this topic, and I also think he already understood the “reward is not the optimization target” point when he wrote the above sentence. My critique was that the statement is false and would probably lead readers to incorrect beliefs about the purpose of reward in RL.)
I feel a lot of disappointment and sadness. In 2018, I came to this website when I really needed a new way to understand the world. I’d made a lot of epistemic mistakes I wasn’t proud of, and I didn’t want to live that way anymore. I wanted to think more clearly. I wanted it so, so badly. I came to rely and depend on this place and the fellow users. I looked up to and admired a bunch of them (and I still do so for a few).
But the things you mention—the revisionism, the unfalsifiability, the apparent gaslighting? We set out to do better than science. I think we often do worse.
As a general principle, truths are entangled with each other. It’s OK if a theory’s most extreme prediction (e.g. extinction from AI) is not testable at the current moment. It is a highly suspicious state of affairs if a theory yields no other testable predictions. Truths are generally entangled with each other in intricate and manifold ways. There are generally many clever ways to test a theory, given the necessary will and curiosity.
Sometimes I instead get pushback like “it seems to me like I’ve grasped the insights you’re trying to communicate, but I totally acknowledge that I might just not be seeing what you’re saying yet.” I respect and appreciate that response. It communicates the other person’s true perception (that they already understand) while not invalidating or assuming away my perspective.
I get why you feel that way. I think there are a lot of us on LessWrong who are less vocal and more openminded, and less aligned with either optimistic network thinkers or pessimistic agent foundations thinkers. People newer to the discussion and otherwise less polarized are listening and changing their minds in large or small ways.
I’m sorry you’re feeling so pessimistic about LessWrong. I think there is a breakdown in communication happening between the old guard and the new guard you exemplify. I don’t think that’s a product of venue, but of the sheer difficulty of the discussion. And polarization between different veiwpoints on alignment.
I think maintaining a good community falls on all of us. Formats and mods can help, but communities set their own standards.
I’m very, very interested to see a more thorough dialogue between you and similar thinkers, and MIRI-type thinkers. I think right now both sides feel frustrated that they’re not listened to and understood better.
Like, just look at this quote from the post you mentioned:
Unfortunately, AI systems trained with reinforcement learning only optimize features specified in the reward function and are indifferent to anything we might’ve inadvertently left out.
And you probably didn’t even select that post for this particular misunderstanding.
(Presumably you are talking about how reward is not the optimization target.)
While I agree that the statement is not literally true, I am still basically on board with that sentence and think it’s a reasonable shorthand for the true thing.
I expect that I understood the “reward is not the optimization target” point at the time of writing that post (though of course predicting what your ~5-years-ago self knew is quite challenging without specific quotes to refer to).
I am confident I understood the point by the time I was working on the goal misgeneralization project (late 2021), since almost every example we created involved predicting ahead of time a specific way in which reward would fail to be the optimization target.
Deep reinforcement learning agents will not come to intrinsically and primarily value their reward signal; reward is not the trained agent’s optimization target.
Utility functions express the relative goodness of outcomes. Reward is not best understood as being a kind of utility function. Reward has the mechanistic effect of chiseling cognition into the agent’s network. Therefore, properly understood, reward does not express relative goodness and is therefore not an optimization target at all.
I hope it doesn’t come across as revisionist to Alex, but I felt like both of these points were made by people at least as early as 2019, after the Mesa-Optimization sequence came out in mid-2019. As evidence, I’ll point to my post from December 2019 that was partially based on a conversation with Rohin, who seemed to agree with me,
consider a simple feedforward neural network trained by deep reinforcement learning to navigate my Chests and Keys environment. Since “go to the nearest key” is a good proxy for getting the reward, the neural network simply returns the action, that when given the board state, results in the agent getting closer to the nearest key.
Is the feedforward neural network optimizing anything here? Hardly, it’s just applying a heuristic. Note that you don’t need to do anything like an internal A* search to find keys in a maze, because in many environments, following a wall until the key is within sight, and then performing a very shallow search (which doesn’t have to be explicit) could work fairly well.
I think in this passage I’m imagining that “reward is not the trained agent’s optimization target” quite explicitly, since I’m pointing out that a neural network trained by RL will not necessarily optimize anything at all. In a subsequent post from January 2020 I gave a more explicit example, said this fact doesn’t merely apply to simple neural networks, and then offered my opinion that “it’s inaccurate to say that the source of malign generalization must come from an internal search being misaligned with the objective function we used during training”.
From the comments, and from my memory of conversations at the time, many people disagreed with my framing. They disagreed even when I pointed out that humans don’t seem to be “optimizers” that select for actions that maximize our “reward function” (I believe the most common response was to deny the premise, and say that humans are actually roughly optimizers. Another common response was to say that AI is different for some reason.)
However, even though some people disagreed with this framing, not everyone did. As I pointed out, Rohin seemed to agree with me at the time, and so at the very least I think there is credible evidence that this insight was already known to a few people in the community by late 2019.
Deep reinforcement learning agents will not come to intrinsically and primarily value their reward signal; reward is not the trained agent’s optimization target.
I have no stake in this debate, but how is this particular point any different than what Eliezer says when he makes the point about humans not optimizing for IGF? I think the entire mesaoptimization concern is built around this premise, no?
I didn’t mean to imply that you in particular didn’t understand the reward point, and I apologize for not writing my original comment more clearly in that respect. Out of nearly everyone on the site, I am most persuaded that you understood this “back in the day.”
I meant to communicate something like “I think the quoted segment from Rohin and Dmitrii’s post is incorrect and will reliably lead people to false beliefs.”
As I mentioned elsewhere (not this website) I don’t agree with “will reliably lead people to false beliefs”, if we’re talking about ML people rather than LW people (as was my audience for that blog post).
I do think that it’s a reasonable hypothesis to have, and I assign it more likelihood than I would have a year ago (in large part from you pushing some ML people on this point, and them not getting it as fast as I would have expected).
“Sure, Rohin thought that was a major problem, but we [our organization/thought cluster/ideological group] never agreed with him.”
Oh really? Did you ever explicitly highlight this particular disagreement at the time?
FWIW at the time I wasn’t working on value learning and wasn’t incredibly excited about work in that direction, despite the fact that that’s what the rest of my lab was primarily focussed on. I also wrote a blog post in 2020, based off a conversation I had with Rohin in 2018, where I mention how important it is to work on inner alignment stuff and how those issues got brought up by the ‘paranoid wing’ of AI alignment. My guess is that my view was something like “stuff like reward learning from the state of the world doesn’t seem super important to me because of inner alignment etc, but for all I know cool stuff will blossom out of it, so I’m happy to hear about your progress and try to offer constructive feedback”, and that I expressed that to Rohin in person.
It seems to me that often people rehearse fancy and cool-sounding reasons for believing roughly the same things they always believed, and comment threads don’t often change important beliefs. Feels more like people defensively explaining why they aren’t idiots, or why they don’t have to change their mind. I mean, if so—I get it, sometimes I feel that way too. But it sucks and I think it happens a lot.
My sense is that this is an inevitable consequence of low-bandwidth communication. I have no idea whether you’re referring to me or not, and I am really not saying you are doing so, but I think an interesting example (whether you’re referring to it or not) are some of the threads recently where we’ve been discussing deceptive alignment. My sense is that neither of us have been very persuaded by those conversations, and I claim that’s not very surprising, in a way that’s epistemically defensible for both of us. I’ve spent literal years working through the topic myself in great detail, so it would be very surprising if my view was easily swayed by a short comment chain—and similarly I expect that the same thing is true of you, where you’ve spent much more time thinking about this and have much more detailed thoughts than are easy to represent in a simple comment chain.
My long-standing position has been and continues to be that the only good medium of communication for this sort of stuff is direct, non-public, in-person communication. That being said, obviously that’s not always workable, and I do think that LessWrong is one of the least bad of all the bad options. Certainly I think it’s preferable to any of the other social media platforms on offer—you mention the broader AI community as not liking LessWrong, but I think they mostly use Twitter for this instead, which seems substantially worse on all of the axes that you criticize. My impression of the quality of AI discourse on Twitter on all sides of the AI safety debate has been very negative, with it mostly just rewarding cheap dunks and increasing polarization—e.g. I felt like I saw this a lot during the OpenAI fiasco. At least on LessWrong I think it is still sometimes possible for nuance to be rewarded rather than punished.
FWIW, LessWrong does seem—in at least one or two ways—saner than other communities of similar composition. I agree it’s better than Twitter overall. But in many ways it seems worse than other communities. I don’t know what to do about it, and to be honest I don’t have much faith in e.g. the mods.[1]
Hopefully my comments do something anyways, though. I do have some hope because it seems like a good amount has improved over the last year or two.
There’s a caveat here. It’s inevitable for communication that veers towards the emotional/subjective/sympathetic.
When the average writer tries to compress it down to a few hundred or thousand letters on a screen it does often seem ridiculous.
Even from moderately above average writers it often sounds more like anxious upper-middle-class virtue signalling then meaningful conversations.
I think it takes a really really clever writer to make it more substantial than that and escape the perception entirely.
On the other hand, discussions of purely objective topics, that are falsifiable and verifiable by independent third parties, don’t suffer the same pitfalls.
As long as you really know what you are talking about, or willing to learn, even the below average writer can communicate just fine.
Why are you so focused on Eliezer/MIRI yourself? If you think you (or events in general) have adequately shown that their specific concerns are not worth worrying about, maybe turn your attention elsewhere for a bit? For example you could look into other general concerns about AI risk, or my specific concerns about AIs based on shard theory. I don’t think I’ve seen shard theory researchers address many of these yet.
When I trace the dependencies of common alignment beliefs and claims, a lot of them come back to e.g. RFLO and other ideas put forward by the MIRI cluster. Since I often find myself arguing against common alignment claims, I often argue against the historical causes of those ideas, which involves arguing against MIRI-takes.
I’m personally satisfied that their concerns are (generally) not worth worrying about. However, often people in my social circles are not. And such beliefs will probably have real-world consequences for governance.
Neargroup—I have a few friends who work at MIRI, and debate them on alignment ideas pretty often. I also sometimes work near MIRI people.
Because I disagree with them very sharply, their claims bother me more and are rendered more salient.
I feel bothered about MIRI still (AFAICT) getting so much funding/attention (even though it’s relatively lower than it used to be), because it seems to me that since e.g. 2016 they have released ~zero technical research that helps us align AI in the present or in the future. It’s been five years since they stopped disclosing any of their research, and it seems like no one else really cares anymore. That bothers me.
As to why I haven’t responded to e.g. your concerns in detail:
I currently don’t put much value on marginal theoretical research (even in shard theory, which I think is quite a bit better than other kinds of theory).
I feel less hopeful about LessWrong debate doing much, as I have described elsewhere. It feels like a better use of my time to put my head down, read a bunch of papers, and do good empirical work at GDM.
I am generally worn out of arguing about theory on the website, and have been since last December. (I will note that I have enjoyed our interactions and appreciated your contributions.)
Sounds like to the extent that you do have time/energy for theory, you might want to strategically reallocate your attention a bit? I get that you think a bunch of people are wrong and you’re worried about the consequences of that, but diminishing returns is a thing, and you could be too certain yourself (that MIRI concerns are definitely wrong).
And then empirical versus theory, how much do you worry about architectural changes obsoleting your empirical work? I noticed for example that in image generation GAN was recently replaced by latent diffusion, which probably made a lot of efforts to “control” GAN-based image generation useless.
That aside, “heads down empirical work” only makes sense if you picked a good general direction before putting your head down. Should it not worry people that shard theory researchers do not seem to have engaged with (or better yet, preemptively addressed) basic concerns/objections about their approach?
I feel pretty frustrated at how rarely people actually bet or make quantitative predictions about existential risk from AI. EG my recent attempt to operationalize a bet with Nate went nowhere. Paul trying to get Eliezer to bet during the MIRI dialogues also went nowhere, or barely anywhere—I think they ended up making some random bet about how long an IMO challenge would take to be solved by AI. (feels pretty weak and unrelated to me. lame. but huge props to Paul for being so ready to bet, that made me take him a lot more seriously.)
For what it’s worth, I would be up for a dialogue or some other context where I can make concrete predictions. I do think it’s genuinely hard, since I do think there is a lot of masking of problems going on, and optimization pressure that makes problems harder to spot (both internally in AI systems and institutionally), so asking me to make predictions feels a bit like asking me to make predictions about FTX before it collapsed.
Like, yeah, I expect it to look great, until it explodes. Similarly I expect AI to look pretty great until it explodes. That seems like kind of a core part of the argument for difficulty for me.
I would nevertheless be happy to try to operationalize some bets, and still expect we would have lots of domains where we disagree, and would be happy to bet on those.
Like, yeah, I expect it to look great, until it explodes. Similarly I expect AI to look pretty great until it explodes. That seems like kind of a core part of the argument for difficulty for me.
If your hypothesis smears probability over a wider range of outcomes than mine, while I can more sharply predict events using my theory of how alignment works—that constitutes a Bayes-update towards my theory and away from yours. Right?
“Anything can happen before the explosion” is not a strength for a theory. It’s a vulnerability. If probability is better-concentrated by any other theories which make claims about both the present and the future of AI, then the noncommittal theory gets dropped.
Sure, yeah, though like, I don’t super understand. My model will probably make the same predictions as your model in the short term. So we both get equal Bayes points. The evidence that distinguishes our models seems further out, and in a territory where there is a decent chance that we will be dead, which sucks, but isn’t in any way contradictory with Bayes rule. I don’t think I would have put that much probability on us being dead at this point, so I don’t think that loses much of any bayes points. I agree that if we are still alive in 20-30 years, then that’s definitely bayes points, and I am happy to take that into account then, but I’ve never had timelines or models that predicted things to look that different from now (or like, where there were other world models that clearly predicted things much better).
My model will probably make the same predictions as your model in the short term.
No, I don’t think so. My model(s) I use for AGI risk is an outgrowth of the model I use for normal AI research, and so it makes tons of detailed predictions. That’s why my I have weekly fluctuations in my beliefs about alignment difficulty.
Overall question I’m interested in: What, if any, catastrophic risks are posed by advanced AI? By what mechanisms do they arise, and by what solutions can risks be addressed?
Making different predictions. The most extreme prediction of AI x-risk is that AI presents, well, an x-risk. But theories gain and lose points not just on their most extreme predictions, but on all their relevant predictions.
I have a bunch of uncertainty about how agentic/transformative systems will look, but I put at least 50% on “They’ll be some scaffolding + natural outgrowth of LLMs.” I’ll focus on that portion of my uncertainty in order to avoid meta-discussions on what to think of unknown future systems.
I don’t know what your model of AGI risk is, but I’m going to point to a cluster of adjacent models and memes which have been popular on LW and point out a bunch of predictions they make, and why I think my views tend to do far better.
Format:
Historical claim or meme relevant to models of AI ruin. [Exposition]
[Comparison of model predictions]
The historical value misspecification argument.Consider a model which involves the claim “it’s really laborious and fragile to specify complex human goals to systems, such that the systems actually do what you want.”
This model naturally predicts things like “it’s intractably hard/fragile to get GPT-4 to help people with stuff.” Sure, the model doesn’t predict this with probability 1, but it’s definitely an obvious prediction. (As an intuition pump, if we observed the above, we’d obviously update towards fragility/complexity of value; so since we don’t observe the above, we have to update away from that.)
My models involve things like “most of the system’s alignment properties will come from the training data” (and not e.g. from initialization or architecture), and also “there are a few SGD-plausible generalizations of any large dataset data” and also “to first order, overparameterized LLMs generalize how a naive person would expect after seeing the training behavior” (IE “edge instantiation isn’t a big problem.”) Also “the reward model doesn’t have to be perfect or even that good in order to elicit desired behavior from the policy.” Also noticing that DL just generalizes really well, despite classical statistical learning theory pointing out that almost all expressive models will misgeneralize!
(All of these models offer testable predictions!)
So overall, I think the second view predicts reality much more strongly than the first view.
It’s important to make large philosophical progress on an AI reasoning about its own future outputs. In Constitutional AI, an English-language “constitutional principle” (like “Be nice”) is chosen for each potential future training datapoint. The LLM then considers whether the datapoint is in line with that constitutional principle. The datapoint is later trained on if and only if the LLM concludes that the datapoint accords with the principle. The AI is, in effect, reasoning about its future training process, which will affect its future cognition.
The above “embedded agency=hard/confusing” model would naturally predict that reflection is hard and that we’d need to put in a lot of work to solve the “reflection problem.” While this setup is obviously a simple, crude form of reflection, it’s still valid. Therefore, the model predicts with increased confidence that constitutional AI would go poorly. But… Constitutional AI worked pretty well! RL from AI feedback also works well! There are a bunch of nice self-supervised alignment-boosting methods (one recent one I read about is RAIN).
One reason this matters: Under the “AGI from scaffolded super-LLMs” model, the scaffolding will probably prompt the LLM to evaluate its own plans. If we observe that current models do a good job of self-evaluation,[1] that’s strong evidence that future models will too. If strong models do a good job of moral and accurate self-evaluation, that decreases the chance that the future AI will execute immoral / bad plans.
I expect AIs to do very well here because AIs will reliably pick up a lot of nice “values” from the training corpus. Empirically that seems to happen, and theoretically you’d get some of the way there from natural abstractions + “there are a few meaningful generalizations” + “if you train the AI to do thing X when you prompt it, it will do thing X when prompted.”
Intelligence is a “package deal” / tool AI won’t work well / intelligence comes in service of goals. There isn’t a way to take AIXI and lop off the “dangerous capabilities” part of the algorithm and then have an AI which can still do clever stuff on your behalf. It’s all part of the argmax, which holds both the promise and peril of these (unrealistic) AIXI agents. Is this true for LLMs?
So I think the “intelligence is a package deal” philosophy isn’t holding up that great. (And we had an in-person conversation where I had predicted these steering vector results, and you had expected the opposite.)
The steering vectors were in fact derived using shard theory reasoning (about activating certain shards more or less strongly by adding a direction to the latent space). So this is a strong prediction of my models.
If intelligence isn’t a package deal, then tool AI becomes far more technically probable (but still maybe not commercially probable). This means we can maybe extract reasonably consequentialist reasoning with “deontological compulsions” against e.g. powerseeking, and have that make the AI agent not want to seek power.
There are certain training assumptions which are likely to be met by future systems but not present systems, by default and for all powerful systems we expect to know how to build build), the AI will develop internal goals which it pursues ~coherently across situations.[2] (This would be a knock against smart tool AI.)
Risks from Learned Optimization posited that a “simple” way to “do well in training” is to learn a unified goal and then a bunch of generalized machinery to achieve that goal. This model naturally predicts that when you train overparameterized networks on a wide range of tasks, then . Even if that network isn’t an AGI.
That’s a misprediction of the “unified motivations are simple” frame; if we have the theoretical precision to describe the simplicity biases of unknown future systems, that model should crank out good predictions for modern systems too.
I’m happy to bet on any additional experiments related to the above.
There are probably a bunch of other things, and I might come back with more, but I’m getting tired of writing this comment. The main point is that common components of threat models regularly make meaningful mispredictions. My models often do better (though once I misread some data and strongly updated against my models, so I think I’m amenable to counterevidence here). Therefore, I’m able to refine my models of AGI risk. I certainly don’t think we’re in the dark and unable to find experimental evidence.
I expect you to basically disagree about future AI being a separate magisterium or something, but I don’t know why that’d be true.
Often the claimed causes of future doom imply models which make pre-takeoff predictions, as shown above (e.g. fragility of value). But even if your model doesn’t make pre-takeoff predictions… Since my model is unified for both present and future AI, I can keep gaining Bayes points and refining my model! This happens whether or not your model makes predictions here. This is useful insofar as the observations I’m updating on actually update me on mechanisms in my model which are relevant for AGI alignment.
If you think I just listed a bunch of irrelevant stuff, well… I guess I super disagree! But I’ll keep updating anyways.
The Emulated Finetuning paper found that GPT-4 is superhuman at grading helpfulness/harmlessness. In the cases of disagreements between GPT-4 and humans, a more careful analysis revealed that 80% of the time the disagreement was caused by errors in the human judgment, rather than GPT-4’s analysis.
This model naturally predicts things like “it’s intractably hard/fragile to get GPT-4 to help people with stuff.” Sure, the model doesn’t predict this with probability 1, but it’s definitely an obvious prediction.
Another point is that I think GPT-4 straightforwardly implies that various naive supervision techniques work pretty well. Let me explain.
From the perspective of 2019, it was plausible to me that getting GPT-4-level behavioral alignment would have been pretty hard, and might have needed something like AI safety via debate or other proposals that people had at the time. The claim here is not that we would never reach GPT-4-level alignment abilities before the end, but rather that a lot of conceptual and empirical work would be needed in order to get models to:
Reliably perform tasks how I intended as opposed to what I literally asked for
Have negligible negative side effects on the world in the course of its operation
Responsibly handle unexpected ethical dilemmas in a way that is human-reasonable
Well, to the surprise of my 2019-self, it turns out that naive RLHF with a cautious supervisor designing the reward model seems basically sufficient to do all of these things in a reasonably adequate way. That doesn’t mean that RLHF scales all the way to superintelligence, but it’s very significant nonetheless and interesting that it scales as far as it does.
You might think “why does this matter? We know RLHF will break down at some point” but I think that’s missing the point. Suppose right now, you learned that RLHF scales reasonably well all the way to John von Neumann-level AI. Or, even more boldly, say, you learned it scaled to 20 IQ points past John von Neumann. 100 points? Are you saying you wouldn’t update even a little bit on that knowledge?
The point at which RLHF breaks down is enormously important to overall alignment difficulty. If it breaks down at some point before the human range, that would be terrible IMO. If it breaks down at some point past the human range, that would be great. To see why, consider that if RLHF breaks down at some point past the human range, that implies that we could build aligned human-level AIs, who could then help us align slighter smarter AIs!
If you’re not updating at all on observations about when RLHF breaks down, then you probably either (1) think it doesn’t matter when RLHF breaks down, or (2) you already knew in advance exactly when it would break down. I think position 1 is just straight-up unreasonable, and I’m highly skeptical of most people who claim position 2. This basic perspective is a large part of why I’m making such a fuss about how people should update on current observations.
On the other hand, it could be considered bad news that IDA/Debate/etc. haven’t been deployed yet, or even that RLHF is (at least apparently) working as well as it is. To quote a 2017 post by Paul Christiano (later reposted in 2018 and 2019):
As in the previous sections, it’s easy to be too optimistic about exactly when a non-scalable alignment scheme will break down. It’s much easier to keep ourselves honest if we actually hold ourselves to producing scalable systems.
It seems that AI labs are not yet actually holding themselves to producing scalable systems, and it may well be better if RLHF broke down in some obvious way before we reach potentially dangerous capabilities, to force them to do that.
(I’ve pointed Paul to this thread to get his own take, but haven’t gotten a response yet.)
ETA: I should also note that there is a lot of debate about whether IDA and Debate are actually scalable or not, so some could consider even deployment of IDA or Debate (or these techniques appearing to work well) to be bad news. I’ve tended to argue on the “they are too risky” side in the past, but am conflicted because maybe they are just the best that we can realistically hope for and at least an improvement over RLHF?
I think these methods are pretty clearly not indefinitely scalable, but they might be pretty scalable. E.g., perhaps scalable to somewhat smarter than human level AI. See the ELK report for more discussion on why these methods aren’t indefinitely scalable.
A while ago, I think Paul had maybe 50% that with simple-ish tweaks IDA could be literally indefinitely scalable. (I’m not aware of an online source for this, but I’m pretty confident this or something similar is true.) IMO, this seems very predictably wrong.
TBC, I don’t think we should necessarily care very much about whether a method is indefinitely scalable.
Sometimes people do seem to think that debate or IDA could be indefinitely scalable, but this just seems pretty wrong to me (what is your debate about alphafold going to look like...).
I’ve been struggling with whether to upvote or downvote this comment btw. I think the point about how it’s really important when RLHF breaks down and more attention needs to be paid to this is great. But the other point about how RLHF hasn’t broke yet and this is evidence against the standard misalignment stories is very wrong IMO. For now I’ll neither upvote nor downvote.
I agree that if RLHF scaled all the way to von neumann then we’d probably be fine. I agree that the point at which RLHF breaks down is enormously important to overall alignment difficulty.
I think if you had described to me in 2019 how GPT4 was trained, I would have correctly predicted its current qualitative behavior. I would not have said that it would do 1, 2, or 3 to a greater extent than it currently does.
I’m in neither category (1) or (2); it’s a false dichotomy.
I’m in neither category (1) or (2); it’s a false dichotomy.
The categories were conditioned on whether you’re “not updating at all on observations about when RLHF breaks down”. Assuming you are updating, then I think you’re not really the the type of person who I’m responding to in my original comment.
But if you’re not updating, or aren’t updating significantly, then perhaps you can predict now when you expect RLHF to “break down”? Is there some specific prediction that you would feel comfortable making at this time, such that we could look back on this conversation in 2-10 years and say “huh, he really knew broadly what would happen in the future, specifically re: when alignment would start getting hard”?
(The caveat here is that I’d be kind of disappointed by an answer like “RLHF will break down at superintelligence” since, well, yeah, duh. And that would not be very specific.)
I’m not updating significantly because things have gone basically exactly as I expected.
As for when RLHF will break down, two points:
(1) I’m not sure, but I expect it to happen for highly situationally aware, highly agentic opaque systems. Our current systems like GPT4 are opaque but not very agentic and their level of situational awareness is probably medium. (Also: This is not a special me-take. This is basically the standard take, no? I feel like this is what Risks from Learned Optimization predicts too.)
(2) When it breaks down I do not expect it to look like the failures you described—e.g. it stupidly carries out your requests to the letter and ignores their spirit, and thus makes a fool of itself and is generally thought to be a bad chatbot. Why would it fail in that way? That would be stupid. It’s not stupid.
(Related question: I’m pretty sure on r/chatgpt you can find examples of all three failures. They just don’t happen often enough, and visibly enough, to be a serious problem. Is this also your understanding? When you say these kinds of failures don’t happen, you mean they don’t happen frequently enough to make ChatGPT a bad chatbot?)
Re: Elaborating: Sure, happy to, but not sure where to begin. All of this has been explained before e.g. in Ajeya’s Training Game report for example. Also Joe Carlsmith’s thing. Also the original mesaoptimizers paper, though I guess it didn’t talk about situational awareness idk. Would you like me to say more about what situational awareness is, or what agency is, or why I think both of those together are big risk factors for RLHF breaking down?
From a technical perspective I’m not certain if Direct Preference Optimization is theoretically that much different from RLHF beyond being much quicker and lower friction at what it does, but so far it seems like it has some notable performance gains over RLHF in ways that might indicate a qualitative difference in effectiveness. Running a local model with a bit of light DPO training feels more intent-aligned compared to its non-DPO brethren in a pretty meaningful way. So I’d probably be considering also how DPO scales, at this point. If there is a big theoretical difference, it’s likely in not training a separate model, and removing whatever friction or loss of potential performance that causes.
In fact, it had redundant internal representations of the goal square! Due to how CNNs work, that should be literally meaningless!
What does this mean? I don’t know as much about CNNs as you—are you saying that their architecture allows for the reuse of internal representations, such that redundancy should never arise? Or are you saying that the goal square shouldn’t be representable by this architecture?
If your hypothesis smears probability over a wider range of outcomes than mine, while I can more sharply predict events using my theory of how alignment works—that constitutes a Bayes-update towards my theory and away from yours.
There is a reference class judgement in this. If I have a theory of good moves in Go (and absently dabble in chess a little bit), while you have a great theory of chess, looking at some move in chess shouldn’t lead to a Bayes-update against ability of my theory to reason about Go. The scope of classical alignment worries is typically about the post-AGI situation. If it manages to say something uninformed about the pre-AGI situation, that’s something out of its natural scope, and shouldn’t be meaningful evidence either way.
I think the correct way of defeating classical alignment worries (about the post-AGI situation) is on priors, looking at the arguments themselves, not on observations where the theory doesn’t expect to have clear or good predictions (and empirically doesn’t). If the arguments appear weak, there is no recourse without observation of the post-AGI world, it remains weak at least until then. Even if it happened to have made good predictions about the current situation, it shouldn’t count in its favor.
If your hypothesis smears probability over a wider range of outcomes than mine, while I can more sharply predict events using my theory of how alignment works—that constitutes a Bayes-update towards my theory and away from yours. Right?
He didn’t say “anything can happen before AI explodes”. He said “I expect AI to look pretty great until it explodes.” And he didn’t say that his model about AGI safety generated that prediction; maybe his model about AGI safety generates some long-run predictions and then he’s using other models to make the “look pretty great” prediction.
“Anything can happen before the explosion” is not a strength for a theory.
This is why I hate a lot of mathematical universe hypothesis/simulation hypothesis discourse, since they both predict anything, which is not a strength for these theories, even though I do think they’re true, they’re just too trivial as theories to work.
I feel pretty frustrated at how rarely people actually bet or make quantitative predictions about existential risk from AI.
Without commenting on how often people do or don’t bet, I think overall betting is great and I’d love to see more it!
I’m also excited how much of it I’ve seen since Manifold started gaining traction. So I’d like to give a shout out to LessWrong users who are active on Manifold, in particular on AI questions. Some I’ve seen are:
There are definitely more folks than this: feel free to mention more folks in the comments who you want to give kudos to (though please don’t dox anyone who’s name on either platforms is pseudonymous and doesn’t match the other).
Yeah, I’m not really happy with the state of discourse on this matter either.
I think it’s not a coincidence that many of the “canonical alignment ideas” somehow don’t make any testable predictions until AI takeoff has begun. 🤔
As a proponent of an AI-risk model that does this, I acknowledge that this is an issue, and I indeed feel pretty defensive on this point. Mainly because, as @habryka pointed out and as I’d outlined before, I think there are legitimate reasons to expect no blatant evidence until it’s too late, and indeed, that’s the whole reason AI risk is such a problem. As was repeatedly stated.
So all these moves to demand immediate well-operationalized bets read a bit like tactical social attacks that are being unintentionally launched by people who ought to know better, which are effectively exploiting the territory-level insidious nature of the problem to undermine attempts to combat it, by painting the people pointing out the problem as blind believers. Like challenges that you’re set up to lose if you take them on, but which make you look bad if you turn them down.
And the above, of course, may read exactly like a defense attempt a particularly self-aware blind believer might construct. Which doesn’t inspire much self-doubt in me[1], but it does make me feel like I’m– no, not like I’m sailing against the winds of counterevidence – like I’m playing the social game on the side that’s poised to lose it in the long run, so I should switch up to the winning side to maximize my status, even if its position is wrong.
I’m somewhat hopeful about navigating to some concrete empirical or mathematical evidence within the next couple years. But in the meanwhile, yeah, discussing the matter just makes me feel weary and tired.
(Edit, because I’m concerned I’d been too subtle there: I am not accusing anyone, and especially not @TurnTrout, of deliberately employing social tactics to undermine their opponents rather than cooperatively seeking the truth. I’m only saying that the (usually extremely reasonable) requests for well-operationalized bets effectively have this result in this particular case.
Neither am I suggesting that the position I’m defending should be immune to criticism. Empirical evidence easily tied to well-operationalized bets is usually an excellent way to resolve disagreements and establish truth. But it’s not the only one, and it just so happens that this specific position can’t field many good predictions in this field.)
Your post defending the least forgiving take on alignment basically relies on a sharp/binary property of AGI, and IMO a pretty large crux is that either this property probably doesn’t exist, or if it does exist, it is not universal, and IMO I think tends to be overused.
To be clear, I’m increasingly agreeing with a weak version of the hypothesis, and I also think you are somewhat correct, but IMO I dont think your stronger hypothesis is correct, and I think that the lesson of AI progress is that it’s less sharp the more tasks you want, and the more general intelligence you want, which is in opposition to your hypothesis on AI progress being sharp.
But in the meanwhile, yeah, discussing the matter just makes me feel weary and tired.
I actually kinda agree with you here, but unfortunately, this is very, very important, since your allies are trying to gain real-life political power over AI, and given this is extremely impactful, it is basically required for us to discuss it.
I think that the lesson of AI progress is that it’s less sharp the more tasks you want, and the more general intelligence you want
There’s a bit of “one man’s modus ponens is another’s modus tollens” going on. I assume that when you look at a new AI model, and see how it’s not doing instrumental convergence/value reflection/whatever, you interpret it as evidence against “canonical” alignment views. I interpret it as evidence that it’s not AGI yet; or sometimes, even evidence that this whole line of research isn’t AGI-complete.
E. g., I’ve updated all the way on this in the case of LLMs. I think you can scale them a thousandfold, and it won’t give you AGI. I’m mostly in favour of doing that, too, or at least fully realizing the potential of the products already developed. Probably same for Gemini and Q*. Cool tech. (Well, there are totalitarianism concerns, I suppose.)
I also basically agree with all the takes in the recent “AI is easy to control” post. But what I take from it isn’t “AI is safe”, it’s “the current training methods aren’t gonna give you AGI”. Because if you put a human – the only known type of entity with the kinds of cognitive capabilities we’re worrying about – into a situation isomorphic to a DL AI’s, the human would exhibit all the issues we’re worrying about.
Like, just because something has a label of “AI” and is technically an AI doesn’t mean studying it can give you lessons about “AGI”, the scary lightcone-eating thing all the fuss is about, yeah? Any more than studying GOFAI FPS bots is going to teach you lessons about how LLMs work?
And that the Deep Learning paradigm can probably scale to AGI doesn’t mean that studying the intermediary artefacts it’s currently producing can teach us much about the AGI it’ll eventually spit out. Any more than studying a MNIST-classifier CNN can teach you much about LLMs; any more than studying squirrel neurology can teach you much about winning moral-philosophy debates.
That’s basically where I’m at. LLMs and such stuff is just in the entirely wrong reference class for studying “generally intelligent”/scary systems.
Any more than studying GOFAI FPS bots is going to teach you lessons about how LLMs work?
No, but my point here is that once we increase the complexity of the domain, and require more tasks to be done, things start to smooth over, and we don’t have nearly as sharp.
I suspect a big part of that is the effects of Amdahl’s law kicking in combined with Baumol’s cost disease and power law scaling, which means you are always bottlenecked on the least automatable and doable tasks, so improvements in one area like Go don’t exactly matter as much as you think.
I’d say the main lesson of AI progress, one that might even have been formulatable in the 1970s-1980s days, is that compute and data were the biggest factors, by a wide margin, and these grow smoothly. Only now are algorithms starting to play a role, and even then, it’s only because of the fact that transformers turn out to be fairly terrible at generalizing or doing stuff, which is related to your claim about LLMs being not real AGI, but I think this effect is weaker than you think, and I’m sympathetic to the continuous view as well. There probably will be some discontinuities, but IMO LWers have fairly drastically overstated how discontinuous progress was, especially if we realize that a lot of the outliers were likely simpler than the real world (Though Go comes close to it, at least for it’s domain, the problem is that the domain is far too small to matter.)
I assume that when you look at a new AI model, and see how it’s not doing instrumental convergence/value reflection/whatever, you interpret it as evidence against “canonical” alignment views. I interpret it as evidence that it’s not AGI yet; or sometimes, even evidence that this whole line of research isn’t AGI-complete.
I think this roughly tracks how we updated, though there was a brief phase where I became more pessimistic as I learned that LLMs probably wasn’t going to scale to AGI, and broke a few of my alignment plans, but I found other reasons to be more optimistic that didn’t depend on LLMs nearly as much.
My worry is that while I think it’s fine enough to update towards “it’s not going to have any impact on anything, and that’s the reason it’s safe.” I worry that this is basically defining away the possibility of safety, and thus making the model useless:
I interpret it as evidence that it’s not AGI yet; or sometimes, even evidence that this whole line of research isn’t AGI-complete.
Because if you put a human – the only known type of entity with the kinds of cognitive capabilities we’re worrying about – into a situation isomorphic to a DL AI’s, the human would exhibit all the issues we’re worrying about.
I basically disagree entirely with that, and I’m extremely surprised you claimed that. If we grant that we get the same circumstances to control humans as we can do for DL AIs, then alignment becomes basically trivial in my view, since human control research would have way better ability to study humans, and in particular there is no IRB/FDA or regulation to control you, which would be huge changes to how science basically works today. It may take a lot of brute force work, but I think it basically becomes trivial to align human beings if humans could be put into a situation isomorphic to DL AIs.
I’d say the main lesson of AI progress, one that might even have been formulatable in the 1970s-1980s days, is that compute and data were the biggest factors
As far as producing algorithms that are able to, once trained on a vast dataset of [A, B] samples, interpolate a valid completion B for an arbitrary prompt sampled from the distribution of A? Yes, for sure.
As far as producing something that can genuinely generalize off-distribution, strike way outside the boundaries of interpolation? Jury’s still out.
Like, I think my update on all the LLM stuff is “boy, who knew interpolation can get you this far?”. The concept-space sure turned out to have a lot of intricate structure that could be exploited via pure brute force.
I basically disagree entirely with that, and I’m extremely surprised you claimed that
Oh, I didn’t mean “if we could hook up a flesh-and-blood human (or a human upload) to the same sort of cognition-shaping setup as we subject our AIs to”. I meant “if the forward-pass of an LLM secretly simulated a human tasked with figuring out what token to output next”, but without the ML researchers being aware that it’s what’s going on, and with them still interacting with the thing as with a token-predictor. It’s a more literal interpretation of the thing sometimes called an “inner homunculus”.
I’m well aware that the LLM training procedure is never going to result in that. I’m just saying that if it did, and if the inner homunculus became smart enough, that’d cause all the deceptive-alignment/inner-misalignment/wrapper-mind issues. And that if you’re not modeling the AI as being/having a homunculus, you’re not thinking about an AGI, so it’s no wonder the canonical AI-risk arguments fail for that system and it’s no wonder it’s basically safe.
As far as producing algorithms that are able to, once trained on a vast dataset of [A, B] samples, interpolate a valid completion B for an arbitrary prompt sampled from the distribution of A? Yes, for sure.
I’d say this still applies even to non-LLM architectures like RL, which is the important part, but Jacob Cannell and 1a3orn will have to clarify.
As far as producing something that can genuinely generalize off-distribution, strike way outside the boundaries of interpolation? Jury’s still out.
I agree, but with a caveat, in that I think we do have enough evidence to rule out extreme importance on algorithms, ala Eliezer, and compute is not negligible. Epoch estimates a 50⁄50 split between compute and algorithmic progress being important. Algorithmic progress will likely matter IMO, just not nearly as much as some LWers think it is.
Like, I think my update on all the LLM stuff is “boy, who knew interpolation can get you this far?”. The concept-space sure turned out to have a lot of intricate structure that could be exploited via pure brute force.
I definitely updated something in this direction, which is important, but I now think the AI optimist arguments are general enough to not rely on LLMs, and sometimes not even relying on a model of what future AI will look like beyond the fact that capabilities will grow, and people expect to profit from it.
I’m just saying that if it did, and if the inner homunculus became smart enough, that’d cause all the deceptive-alignment/inner-misalignment/wrapper-mind issues.
Not automatically, and there are potential paths to AGI like Steven Byrnes’s path to Brain-like AGI that either outright avoid deceptive alignment altogether or make it far easier to solve (the short answer is that Steven Byrnes suspects there’s a simple generator of value, so simple that it’s dozens of lines long and if that’s the case, then the corrigible alignment/value learning agent’s simplicity gap is either 0, negative, or a very small positive gap, so small that very little data is required to pick out the honest value learning agent over the deceptive aligned agent, and we have a lot of data on human values, so this is likely to be pretty easy.)
And that if you’re not modeling the AI as being/having a homunculus, you’re not thinking about an AGI,
I think a crux is that I think that AIs will basically always have much more white-boxness to them than any human mind, and I think that a lot of future paradigms of AI, including the ones that scale to superintelligence, that the AI control research is easier point to still mostly be true, especially since I think AI control is fundamentally very profitable and AIs have no legal rights/IRB boards to slow down control research.
I agree, but with a caveat, in that I think we do have enough evidence to rule out extreme importance on algorithms
Mm, I think the “algorithms vs. compute” distinction here doesn’t quite cleave reality at its joints. Much as I talked about interpolation before, it’s a pretty abstract kind of interpolation: LLMs don’t literally memorize the data points, their interpolation relies on compact generative algorithms they learn (but which, I argue, are basically still bounded by the variance in the data points they’ve been shown). The problem of machine learning, then, is in finding some architecture + training-loop setup that would, over the course of training, move the ML model towards implementing some high-performance cognitive algorithms.
It’s dramatically easier than hard-coding the algorithms by hand, yes, and the learning algorithms we do code are very simple. But you still need to figure out in which direction to “push” your model first. (Pretty sure if you threw 2023 levels of compute at a Very Deep fully-connected NN, it won’t match a modern LLM’s performance, won’t even come close.)
So algorithms do matter. It’s just our way of picking the right algorithms consists of figuring out the right search procedure for these algorithms, then throwing as much compute as we can at it.
So that’s where, I would argue, the sharp left turn would lie. Not in-training, when a model’s loss suddenly drops as it “groks” general intelligence. (Although that too might happen.) It would happen when the distributed optimization process of ML researchers tinkering with training loops stumbles upon a training setup that actually pushes the ML model in the direction of the basin of general intelligence. And then that model, once scaled up enough, would suddenly generalize far off-distribution. (Indeed, that’s basically what happened in the human case: the distributed optimization process of evolution searched over training architectures, and eventually stumbled upon one that was able to bootstrap itself into taking off. The “main” sharp left turn happens during the architecture search, not during the training.)
And I’m reasonably sure we’re in an agency overhang, meaning that the newborn GI would pass human intelligence in an eye-blink. (And if it won’t, it’ll likely stall at incredibly unimpressive sub-human levels, so the ML researchers will keep tinkering with the training setups until finding one that does send it over the edge. And there’s no reason whatsoever to expect it to stall again at the human level, instead of way overshooting it.)
we have a lot of data on human values
Which human’s values? IMO, “the AI will fall into the basin of human values” is kind of a weird reassurance, given the sheer diversity of human values – diversity that very much includes xenophobia, genocide, and petty vengeance scaled up to geopolitical scales. And stuff like RLHF designed to fit the aesthetics of modern corporations doesn’t result in deeply thoughtful cosmopolitan philosophers – it results in sycophants concerned with PR as much as with human lives, and sometimes (presumably when not properly adapted to a new model’s scale) in high-strung yanderes.
Let’s grant the premise that the AGI’s values will be restricted to the human range (which I don’t really buy). If the quality of the sample within the human range that we pick will be as good as what GPT-4/Sydney’s masks appeared to be? Yeah, I don’t expect humans to stick around for a while after.
Indeed, that’s basically what happened in the human case: the distributed optimization process of evolution searched over training architectures, and eventually stumbled upon one that was able to bootstrap itself into taking off.
Actually I think the evidence is fairly conclusive that the human brain is a standard primate brain with the only change being nearly a few compute scale dials increased (the number of distinct gene changes is tiny—something like 12 from what I recall). There is really nothing special about the human brain other than 1.) 3x larger than expected size, and 2.) extended neotany (longer training cycle). Neuroscientists have looked extensively for other ‘secret sauce’ and we now have some confidence in a null result: no secret sauce, just much more training compute.
Yes, but: whales and elephants have brains several times the size of humans, and they’re yet to build an industrial civilization. I agree that hitting upon the right architecture isn’t sufficient, you also need to scale it up – but scale alone doesn’t suffice either. You need a combination of scale, and an architecture + training process that would actually transmute the greater scale into more powerful cognitive algorithms.
Evolution stumbled upon the human/primate template brain. One of the forks of that template somehow “took off” in the sense of starting to furiously select for larger brain size. Then, once a certain compute threshold was reached, it took a sharp left turn and started a civilization.
The ML-paradigm analogue would, likewise, involve researchers stumbling upon an architecture that works well at some small scales and has good returns on compute. They’ll then scale it up as far as it’d go, as they’re wont to. The result of that training run would spit out an AGI, not a mere bundle of sophisticated heuristics.
And we have no guarantees that the practical capabilities of that AGI would be human-level, as opposed to vastly superhuman.
(Or vastly subhuman. But if the maximum-scale training run produces a vastly subhuman AGI, the researchers would presumably go back to the drawing board, and tinker with the architectures until they selected for algorithms with better returns on intelligence per FLOPS. There’s likewise no guarantees that this higher-level selection process would somehow result in an AGI of around human level, rather than vastly overshooting it the first time they properly scale it up.)
Yes, but: whales and elephants have brains several times the size of humans, and they’re yet to build an industrial civilization.
Size/capacity isn’t all, but In terms of the capacity which actually matters (synaptic count, and upper cortical neuron count) - from what I recall elephants are at great ape cortical capacity, not human capacity. A few specific species of whales may be at or above human cortical neuron capacity but synaptic density was still somewhat unresolved last I looked.
Then, once a certain compute threshold was reached, it took a sharp left turn and started a civilization.
Human language/culture is more the cause of our brain expansion, not just the consequence. The human brain is impressive because of its relative size and oversized cost to the human body. Elephants/whales are huge and their brains are much smaller and cheaper comparatively. Our brains grew 3x too large/expensive because it was valuable to do so. Evolution didn’t suddenly discover some new brain architecture or trick (it already had that long ago). Instead there were a number of simultaneous whole body coadapations required for larger brains and linguistic technoculture to take off: opposable thumbs, expressive vocal cords, externalized fermentation (gut is as energetically expensive as brain tissue—something had to go), and yes larger brains, etc.
Language enabled a metasystems transition similar to the origin of multicelluar life. Tribes formed as new organisms by linking brains through language/culture. This is not entirely unprecedented—insects are also social organisms of course, but their tiny brains aren’t large enough for interesting world models. The resulting new human social organisms had inter generational memory that grew nearly unbounded with time and creative search capacity that scaled with tribe size.
You can separate intelligence into world model knowledge (crystal intelligence) and search/planning/creativity (fluid intelligence). Humans are absolutely not special in our fluid intelligence—it is just what you’d expect for a large primate brain. Humans raised completely without language are not especially more intelligent than animals. All of our intellectual super powers are cultural. Just as each cell can store the DNA knowledge of the entire organism, each human mind ‘cell’ can store a compressed version of much of human knowledge and gains the benefits thereof.
The cultural metasystems transition which is solely completely responsible for our intellectual capability is a one time qualitative shift that will never reoccur. AI will not undergo the same transition, that isn’t how these work. The main advantage of digital minds is just speed, and to a lesser extent, copying.
I’d say this still applies even to non-LLM architectures like RL, which is the important part, but Jacob Cannell and 1a3orn will have to clarify.
We’ve basically known how to create AGI for at least a decade. AIXI outlines the 3 main components: a predictive world model, a planning engine, and a critic. The brain also clearly has these 3 main components, and even somewhat cleanly separated into modules—that’s been clear for a while.
Transformers LLMs are pretty much exactly the type of generic minimal ULM arch I was pointing at in that post (I obviously couldn’t predict the name but). On a compute scaling basis GPT4 training at 1e25 flops uses perhaps a bit more than human brain training, and its clearly not quite AGI—but mainly because it’s mostly just a world model with a bit of critic: planning is still missing. But its capabilities are reasonably impressive given that the architecture is more constrained than a hypothetical more directly brain equivalent fast-weight RNN of similar size.
Anyway I don’t quite agree with the characterization that these models are just ” interpolating valid completions of any arbitrary prompt sampled from the distribution”. Human intelligence also varies widely on a spectrum with tradeoffs between memorization and creativity. Current LLMs mostly aren’t as creative as the more creative humans and are more impressive in breadth of knowledge, but eh part of that could be simply that they currently completely lack the component essential for creativity? That they accomplish so much without planning/search is impressive.
the short answer is that Steven Byrnes suspects there’s a simple generator of value, so simple that it’s dozens of lines long and if that’s the case,
Interestingly that is closer to my position and I thought that Byrnes thought the generator of value was somewhat more complex, although are views are admittedly fairly similar in general.
I feel pretty frustrated at how rarely people actually bet or make quantitative predictions about existential risk from AI. EG my recent attempt to operationalize a bet with Nate went nowhere. Paul trying to get Eliezer to bet during the MIRI dialogues also went nowhere, or barely anywhere—I think they ended up making some random bet about how long an IMO challenge would take to be solved by AI. (feels pretty weak and unrelated to me. lame. but huge props to Paul for being so ready to bet, that made me take him a lot more seriously.)
This paragraph doesn’t seem like an honest summary to me. Eliezer’s position in the dialogue, as I understood it, was:
The journey is a lot harder to predict than the destination. Cf. “it’s easier to use physics arguments to predict that humans will one day send a probe to the Moon, than it is to predict when this will happen or what the specific capabilities of rockets five years from now will be”. Eliezer isn’t claiming to have secret insights about the detailed year-to-year or month-to-month changes in the field; if he thought that, he’d have been making those near-term tech predictions already back in 2010, 2015, or 2020 to show that he has this skill.
From Eliezer’s perspective, Paul is claiming to know a lot about the future trajectory of AI, and not just about the endpoints: Paul thinks progress will be relatively smooth and continuous, and thinks it will get increasingly smooth and continuous as time passes and more resources flow into the field. Eliezer, by contrast, expects the field to get choppier as time passes and we get closer to ASI.
A way to bet on this, which Eliezer repeatedly proposed but wasn’t able to get Paul to do very much, would be for Paul to list out a bunch of concrete predictions that Paul sees as “yep, this is what smooth and continuous progress looks like”. Then, even though Eliezer doesn’t necessarily have a concrete “nope, the future will go like X instead of Y” prediction, he’d be willing to bet against a portfolio of Paul-predictions: when you expect the future to be more unpredictable, you’re willing to at least weakly bet against any sufficiently ambitious pool of concrete predictions.
(Also, if Paul generated a ton of predictions like that, an occasional prediction might indeed make Eliezer go “oh wait, I do have a strong prediction on that question in particular; I didn’t realize this was one of our points of disagreement”. I don’t think this is where most of the action is, but it’s at least a nice side-effect of the person-who-thinks-this-tech-is-way-more-predictable spelling out predictions.)
Eliezer was also more interested in trying to reach mutual understanding of the views on offer, as opposed to bet let’s bet on things immediately never mind the world-views. But insofar as Paul really wanted to have the bets conversation instead, Eliezer sunk an awful lot of time into trying to find operationalizations Paul and he could bet on, over many hours of conversation.
If your end-point take-away from that (even after actual bets were in fact made, and tons of different high-level predictions were sketched out) is “wow how dare Eliezer be so unwilling to make bets on anything”, then I feel a lot less hope that world-models like Eliezer’s (“long-term outcome is more predictable than the detailed year-by-year tech pathway”) are going to be given a remotely fair hearing.
(Also, in fairness to Paul, I’d say that he spent a bunch of time working with Eliezer to try to understand the basic methodologies and foundations for their perspectives on the world. I think both Eliezer and Paul did an admirable job going back and forth between the thing Paul wanted to focus on and the thing Eliezer wanted to focus on, letting us look at a bunch of different parts of the elephant. And I don’t think it was unhelpful for Paul to try to identify operationalizations and bets, as part of the larger discussion; I just disagree with TurnTrout’s summary of what happened.)
Thanks for you feedback. I certainly appreciate your articles and I share many of your views. Reading what you had to say, along with Quentin, Jacob Cannell, Nora was a very welcome alternative take that expanded my thinking and changed my mind. I have changed my mind a lot over the last year, from thinking AI was a long way off and Yud/Bostrom were basically right to seeing that its a lot closer and theories without data are almost always wrong in may ways—e.g. SUSY was expected to be true for decades by most of the world’s smartest physicists. Many alignment ideas before GPT3.5 are either sufficiently wrong or irrelevant to do more harm than good.
Especially I think the over dependence on analogy, evolution. Sure when we had nothing to go on it was a start, but when data comes in, ideas based on analogies should be gone pretty fast if they disagree with hard data.
(Some background—I read the site for over 10 years have followed AI for my entire career, have an understanding of Maths, Psychology, and have built and deployed a very small NN model commercially. Also as an aside I remember distinctly being surprised that Yud was skeptical of NN/DL in the earlier days when I considered it obviously where AI progress would come from—I don’t have references because I didn’t think that would be disputed afterwards)
I am not sure what the silent majority belief on this site is (by people not Karma)? Is Yud’s worldview basically right or wrong?
Well they definitely can be applied there—though perhaps its a stage further than analogy and direct application of theory? Then of course data can agree/disagree.
gradient descent is not evolution and does not behave like evolution. it may still have problems one can imagine evolution having, but you can’t assume facts about evolution generalize—it’s in fact quite different.
e.g. SUSY was expected to be true for decades by most of the world’s smartest physicists.
I really don’t want to go down a rabbit hole here, so probably won’t engage in further discussion, but I just want to chime in here and say that I’m pretty sure lots of the world’s smartest physicists (not sure what fraction) still expect the fundamental laws of physics in our universe to have (broken) supersymmetry, and I would go further and say that they have numerous very good reasons to expect that, like gauge coupling unification etc. Same as ever. The fact that supersymmetric partners were not found at LHC is nonzero evidence against supersymmetric partners existing, but it’s not strong evidence against them existing, because LHC was very very far from searching the whole space of possibilities. Also, we pretty much know for a fact that the universe contains at least one other yet-to-be-discovered elementary particle beyond the 17 (or whatever, depends on how you count) particles in the Standard Model. So I think it’s extremely premature to imply that the prediction of yet-to-be-discovered supersymmetric partner particles has been ruled out in our universe and haha look at those overconfident theoretical physicists. (A number of specific SUSY-involving theories have been ruled out, but I think the smart physicists knew all along that those were just plausible hypotheses worth checking, not confident theoretical predictions.)
OK you are answering at a level more detailed than I raised and seem to assume I didn’t consider such things. My reason and IMO the expected reading of “SUSY has failed” is not that such particles have been ruled out as I know they havn’t, but that its theoretical benefits are severely weakened or entirely ruled out according to recent data. My reference to SUSY was specifically regarding its opportunity to solve the Hierarchy Problem. This is the common understanding of one of the reasons it was proposed.
I stand by my claim that many/most of the top physicists expected for >1 decade that it would help solve such a problem. I disagree with the claim:
“but I think the smart physicists knew all along that those were just plausible hypotheses worth checking, ” Smart physicists thought SUSY would solve the hierarchy problem.
----
Common knowledge, from GPT4:
“can SUSY still solve the Hierarchy problem with respect to recent results”
Hierarchy Problem: SUSY has been considered a leading solution to the hierarchy problem because it naturally cancels out the large quantum corrections that would drive the Higgs boson mass to a very high value. However, the non-observation of supersymmetric particles at expected energy levels has led some physicists to question whether SUSY can solve the hierarchy problem in its simplest forms.
Fine-Tuning: The absence of low-energy supersymmetry implies a need for fine-tuning in the theory, which contradicts one of the primary motivations for SUSY as a solution to the hierarchy problem. This has led to exploration of more complex SUSY models, such as those with split or high-scale supersymmetry, where SUSY particles exist at much higher energy scales.
----
IMO ever more complex models rapidly become like epi-cycles.
I am not sure what the silent majority belief on this site is (by people not Karma)? Is Yud’s worldview basically right or wrong?
I think this will depend strongly on where you draw the line on “basically”. I think the majority probably thinks:
AI is likely to be a really big deal
Existential risk from AI is at least substantial (e.g. >5%)
AI takeoff is reasonably likely to happen quite quickly in wall clock time if this isn’t actively prevented (e.g. AI will cause there to be <10 years from a 20% annualized GDP growth rate to a 100x annualized growth rate)
The power of full technological maturity is extremely high (e.g. nanotech, highly efficient computing, etc.)
But, I expect that the majority of people don’t think:
Inside view, existential risk is >95%
A century of dedicated research on alignment (targeted as well as society would realistically do) is insufficient to get risk <15%.
Yes to AI being a big deal and extremely powerful ( yes I doubt anyone would be here otherwise)
Yes—Don’t think anyone can reasonably claim its <5% but then so is not having AI if x-risk is defined to be humanity missing practically all of its Cosmic endowment.
Maybe—Even with slow takeoff, and hardware constrained you get much greater GDP, though I don’t agree with 100x (for the critical period that is, 100x could happen later). E.g. car factories are made to produce robots, we get 1-10 billion more minds and bodies per year, but not quite 100X. ~10x per year is enough to be extremely disruptive and x-risk anyway.
---
(1)
Yes I don’t think x-risk is >95% - say 20% as a very rough guess that humanity misses all its Cosmic endowment. I think AI x-risk needs to be put in this context—say you ask someone
“What’s the chance that humanity becomes successfully interstellar?”
If they say 50⁄50 then being OK with any AI x-risk less than 50% is quite defensible if getting AI right means that its practically certain you get your cosmic endowment etc.
---
(2)
I do think its defensible that a century of dedicated research on alignment doesn’t get risk <15% but because alignment research is only useful a little bit in advance of capabilities—say we had a 100 year pause, then I wouldn’t have confidence in our alignment plan at the end of it.
Anyway regarding x-risk I don’t think there is a completely safe path. Too fast with AI and obvious risk, too slow and there is also other obvious risks. Our current situation is likely unstable. For example the famous quote
“If you want a picture of the future, imagine a boot stamping on a human face— forever.”
I believe that is now possible with current tech, where it was not say for Soviet Russia. So we may be in the situation where societies can go 1984 totalitarian bad, but not come back because our tech coordination skills are sufficient to stop centralized empires from collapsing. LLM of course make censorship even easier. (I am sure there are other ways our current tech could destroy most societies also)
If that’s the case, a long pause could result in all power being in such societies which when the pause ended would be very likely to screw up alignment.
That makes me unsure what regulation to advocate for, though I am in favor of slowing down hardware AI progress but fully exploring the capabilities of our current HW.
Most importantly I think we should hugely speed up Neuralink type devices and brain uploading. I would identify much more with an uploaded human that was then carefully, appropriately upgraded to superintelligence than an alternative path where a pure AI superintelligence was made.
We have to accept that we live in critical times and just slowing things down is not necessarily the safest option.
> (High confidence) I feel like the project of thinking more clearly has largely fallen by the wayside, and that we never did that great of a job at it anyways.
I’m new to this community. I’ve skimmed quite a few articles, and this sentence resonates with me for several reasons.
1) It’s very difficult in general to find websites like LessWrong these days. And among the few that exist, I’ve found that the intellectuals on them are so incredibly doubtful of their own intellect. This creates a sort of Ouroboros phenomenon where intellects just eat themselves into oblivion. Like, maybe I’m wrong but this site’s popularity seems to be going down?
2) At least from what I’ve noticed, when I compare articles in the last 2 months, to ones from about a decade ago, there is an alarming truth in your sentence. A decade ago, there were questions left in the articles for commenters to answer, there was a willingness to change one’s mind and to add/enhance ideas in a good faith manner. Now, it seems that many have confused this website for LinkedIn, posting their own personal paper trails (which is largely in a tone that isn’t unique anyways.)
It’s really unfortunate, since I was excited upon being greeted with much older articles. And then realising “Oh… that was from… holy! 10 years ago!?” To then be disappointed by our articles from today.
I feel pretty frustrated at how rarely people actually bet or make quantitative predictions about existential risk from AI.
I think that might be a result of how the topic is, well, just really fucking grim. I think part of what allows discussion of it and thought about it for a lot of people (including myself) is a certain amount of detachment. “AI doomers” get often accused of being LARPers or not taking their own ideas seriously because they don’t act like people who believe the world is ending in 10 years, but I’d flip that around—a person who believes the world is ending in 10 years probably acts absolutely insane, and so people to keep their maximum possible sanity establish a sort of barrier and discuss these things as they would a game or a really interesting scientific question. But actually placing a bet on it? Shorting your own future on the premise that you won’t have a future? That breaks the barrier, and it becomes just really uncomfortable. I know I’d still rather live as if I was dead wrong no matter how confident I am in being theoretically right. I wonder in fact whether this feeling was shared by e.g. game theorists working on nuclear strategy.
I think there are some great points in this comment but I think it’s overly negative about the LessWrong community. Sure, maybe there is a vocal and influential minority of individuals who are not receptive to or appreciative of your work and related work. But I think a better measure of the overall community’s culture than opinions or personal interactions is upvotes and downvotes which are much more frequent and cheap actions and therefore more representative. For example, your posts such as Reward is not the optimization target have received hundreds of upvotes, so apparently they are positively received.
LessWrong these days is huge with probably over 100,000 monthly readers so I think it’s challenging to summarize its culture in any particularly way (e.g. probably most users on LessWrong live outside the bay area and maybe even outside the US). I personally find that LessWrong as a whole is fairly meritocratic and not that dogmatic, and that a wide variety of views are supported provided that they are sufficiently well-argued.
In addition to LessWrong, I use some other related sites such as Twitter, Reddit, and Hacker News and although there may be problems with the discourse on LessWrong, I think it’s generally significantly worse on these other sites. Even today, I’m sure you can find people saying things on Twitter about how AIs can’t have goals or that wanting paperclips is stupid. These kinds of comments wouldn’t be tolerated on LessWrong because they’re ignorant and a waste of time. Human nature can be prone to ignorance, rigidness of opinions and so on but I think the LessWrong walled garden has been able to counteract these negative tendencies better than most other sites.
No disagreement here that this place does this. I also think we should attempt to change many of these things. However, I don’t expect the lesswrong team to do anything sufficiently drastic to counter the hero-worship. Perhaps they could consider hiding usernames by default, hiding vote counts until things have been around for some period of time, or etc.
Hmm, my sense is Eliezer very rarely comments, and the people who do comment a lot don’t have a ton of hero worship going on (like maybe Wentworth?). So I don’t super believe that hiding usernames would do much about this.
Agree, and my guess is that the hero worship, to the extent it happens, is caused by something like
for Eliezer: people finding the rationality community and observing that they were less crazy than most other communities about various things, and Eliezer was a very prolific and persuasive writer
for Paul: Paucity of empirical alignment work before 2021 meant that Paul was one of the few people with formal CS experience and good alignment ideas, and had good name recognition due to posting on LW
I think one of the issues with Eliezer is that he sees himself as a hero, and it comes through both explicitly and in vibes in the writing, and Eliezer is also a persuasive writer.
Nothing wrong with it, in fact I recommend it. But seeing oneself as a hero and persuading others of it will indeed be one of the main issues leading to hero worship.
how would you operationalize a bet on this? I’d take “yes” on “will hiding usernames by default decrease hero worship on lesswrong” on manifold, if you want to do an AB test or something.
Hacker News shows you the vote counts on your comments privately. I think that’s a significant improvement. It nudges people towards thinking for themselves rather than trying to figure out where the herd is going. At least, I think it does, because HN seems to have remarkable viewpoint diversity compared with other forums.
I think it’s fine for there to be a status hierarchy surrounding “good alignment research”. It’s obviously bad if that becomes mismatched with reality, as it almost certainly is to some degree, but I think people getting prestige for making useful progress is essentially what happens for it to be done at all.
If we aren’t good at assessing alignment research, there’s the risk that people substitute the goal of “doing good alignment research” with “doing research that’s recognized as good alignment research”. This could lead to a feedback loop where a particular notion of “good research” gets entrenched: Research is considered good if high status researchers think it’s good; the way to become a high status researcher is to do research which is considered good by the current definition, and have beliefs that conform with those of high status researchers.
A number of TurnTrout’s points were related to this (emphasis mine):
I think we’ve kinda patted ourselves on the back for being awesome and ahead of the curve, even though, in terms of alignment, I think we really didn’t get anything done until 2022 or so, and a lot of the meaningful progress happened elsewhere. [MY NOTE: I suspect more could have been done prior to 2022 if our notion of “good research” had been better calibrated, or even just broader]
(Medium confidence) It seems possible to me that “taking ideas seriously” has generally meant something like “being willing to change your life to further the goals and vision of powerful people in the community, or to better accord with socially popular trends”, and less “taking unconventional but meaningful bets on your idiosyncratic beliefs.”
Somewhat relatedly, there have been a good number of times where it seems like I’ve persuaded someone of A and of A⟹B, and they still don’t believe B, and coincidentally B is unpopular.
...
(Medium-high confidence) I think that alignment “theorizing” is often a bunch of philosophizing and vibing in a way that protects itself from falsification (or even proof-of-work) via words like “pre-paradigmatic” and “deconfusion.” I think it’s not a coincidence that many of the “canonical alignment ideas” somehow don’t make any testable predictions until AI takeoff has begun. 🤔
I’d like to see more competitions related to alignment research. I think it would help keep assessors honest if they were e.g. looking at 2 anonymized alignment proposals, trying to compare them on a point-by-point basis, figuring out which proposal has a better story for each possible safety problem. If competition winners subsequently become high status, that could bring more honesty to the entire ecosystem. Teach people to focus on merit rather than politics.
Somewhat relatedly, there have been a good number of times where it seems like I’ve persuaded someone of A and of A ⟹ B and they still don’t believe B, and coincidentally B is unpopular.
Would you mind sharing some specifiexamples? (Not of people of but of beliefs)
I mostly feel bad about LessWrong these days. I slightly dread logging on, I don’t expect to find much insightful on the website, and think the community has a lot of groupthink / other “ew” factors that are harder for me to pin down (although I think that’s improved over the last year or two). I also feel some dread at posting this because it might burn social capital I have with the mods, but whatever.
(Also, most of this stuff is about the community and not directly in the purview of the mods anyways.)
Here are some rambling thoughts, though:
I think there are pretty good reasons that the broader AI community hasn’t taken LW seriously.
I feel a lot of cynicism. I worry that colors my lens here. But I’ll just share what I see looking through that lens.
Also some of my cynicism comes from annoying-feeling object-level disagreements driving me away from the website. Probably other people are having more fun.
(High confidence) I feel like the project of thinking more clearly has largely fallen by the wayside, and that we never did that great of a job at it anyways.
Over time, I’ve felt myself grow more distant from this community and website. At times, it feels sad. At times, it feels correct. Sometimes it feels both.
(Medium confidence, unsure if relevant to LW itself) In the bay area community, there are lots of professionally relevant events which are de facto gated by how much random people like you on a personal level (namely, the organizers). There’s also a lot of weird social stuff but IDK how relevant that is to LW.
(Medium confidence) It seems to me that often people rehearse fancy and cool-sounding reasons for believing roughly the same things they always believed, and comment threads don’t often change important beliefs. Feels more like people defensively explaining why they aren’t idiots, or why they don’t have to change their mind. I mean, if so—I get it, sometimes I feel that way too. But it sucks and I think it happens a lot.
I feel worried that there are a bunch of people with entrenched worldviews who basically never change their minds about anything important. Seems unhealthy on a community level.
Like, there is a way that it feels to be defending yourself or sailing against the winds of counterevidence to your beliefs, and it’s really really important to not do that. Come on guys :(
(When Wei_Dai introduced Updateless Decision Theory, it wasn’t about this kind of “updatelessness”! :( )
(High confidence) I think this community has engaged in a lot of hero worship. I think to some extent I have benefited from this, though I don’t think I’m the prototype. But, seriously guys, looking back, I think this place has been pretty creepy in some ways.
The way people praise/exalt Eliezer and Paul is just… weird. The times I’d be at an in-person workshop, and people would spend time “ranking” alignment researchers. Feels like a social status horse race, and probably LessWrong has some direct culpability here.
But people don’t seem to take Eliezer as seriously these days, which I think is great, so maybe it’s less of a problem now.
I think this is Eliezer’s fault in his case and mostly not Paul’s fault for his own rep, but IDK.
I think we’ve kinda patted ourselves on the back for being awesome and ahead of the curve, even though, in terms of alignment, I think we really didn’t get anything done until 2022 or so, and a lot of the meaningful progress happened elsewhere.
(Medium confidence) It seems possible to me that “taking ideas seriously” has generally meant something like “being willing to change your life to further the goals and vision of powerful people in the community, or to better accord with socially popular trends”, and less “taking unconventional but meaningful bets on your idiosyncratic beliefs.”
Somewhat relatedly, there have been a good number of times where it seems like I’ve persuaded someone of A and of A⟹B, and they still don’t believe B, and coincidentally B is unpopular.
I feel pretty frustrated at how rarely people actually bet or make quantitative predictions about existential risk from AI. EG my recent attempt to operationalize a bet with Nate went nowhere. Paul trying to get Eliezer to bet during the MIRI dialogues also went nowhere, or barely anywhere—I think they ended up making some random bet about how long an IMO challenge would take to be solved by AI. (feels pretty weak and unrelated to me. lame. but huge props to Paul for being so ready to bet, that made me take him a lot more seriously.)
(Medium-high confidence) I think that alignment “theorizing” is often a bunch of philosophizing and vibing in a way that protects itself from falsification (or even proof-of-work) via words like “pre-paradigmatic” and “deconfusion.” I think it’s not a coincidence that many of the “canonical alignment ideas” somehow don’t make any testable predictions until AI takeoff has begun. 🤔
I expect there to be a bunch of responses which strike me as defensive, revisionist gaslighting, and I don’t know if/when I’ll reply.
This sentiment resonates strongly with me.
A personal background: I remember getting pretty heavily involved in AI alignment discussions on LessWrong in 2019. Back then I think there were a lot of assumptions people had about what “the problem” was that are, these days, often forgotten, brushed aside, or sometimes even deliberately minimized post-hoc in order to give the impression that the field has a better track record than it actually does. [ETA: but to be clear, I don’t mean to say everyone made the same mistake I describe here]
This has been a bit shocking and disorienting to me, honestly, because at the time in 2019 I didn’t get the strong impression that people were deliberately constructing unfalsifiable models of the problem. I had the vague impression that people had relatively firm views that made predictions about the world prior to superintelligence, and that these views were open to revision upon new evidence. And that naiveté led me to experience the last few years of evidence a bit differently than I think some other people.
To give a cursory taste of what I’m talking about, we can consider what I think of as a representative blog post from the genre of pre-2020 alignment content: Rohin Shah and Dmitrii Krasheninnikov’s “Learning Preferences by Looking at the World”. This is by no means a cherry-picked example either, in my opinion (and I don’t mean to criticize Rohin specifically, I’m just using this blog post as a representative example of what people talked about at the time). In the blog post, they state,
Suppose in 2024-2029, someone constructs an intelligent robot that is able clean a room to a high level of satisfaction, consistent with the user’s intentions, without any major negative side effects or general issues of misspecification. It doesn’t break any vases while cleaning. It respects all basic moral norms you can think of. It lets you shut it down whenever you want. And it actually does its job of cleaning the room in a reasonable amount of time.
If that were to happen, I think an extremely natural reading of the situation is that a substantial part of what we thought “the problem” was in value alignment has been solved, from the perspective of this blog post from 2019. That is cause for an updating of our models, and a verbal recognition that our models have updated in this way.
Yet, that’s not how I think everyone on LessWrong would react to the development of such a robot. My impression is that a large fraction, perhaps a majority, of LessWrongers would not share my interpretation here, despite the plain language in the post explaining what they thought the problem was. Instead, I imagine many people would respond to this argument basically saying the following:
“We never thought that was the hard bit of the problem. We always thought it would be easy to get a human-level robot to follow instructions reliably, do what users intend without major negative side effects, follow moral constraints including letting you shut it down, and respond appropriately given unusual moral dilemmas. The idea that we thought that was ever the problem is a misreading of what we wrote. The problem was always purely that alignment issues would arise after we far surpassed human intelligence, at which point entirely novel problems will arise.”
But the blog post said there was a problem, gave an example of the problem manifesting, and then spent the rest of the post trying to come up with solutions. The authors gave no indication that this particular problem was trivial, or that the example used was purely illustrative and had nothing to do with the type of real-world issues that might arise if we fail to solve the problem. If I was misreading the blog post at the time, how come it seems like almost no one ever explicitly predicted at the time that these particular problems were trivial for systems below or at human-level intelligence?!?
To clarify, I’m in full agreement with anyone who simply says that the alignment problem looks like it might still be hard, based on different arguments than the one presented in this blog post. There were a lot of arguments people gave back then, and some of the older arguments still look correct. Perhaps most significantly, robustness to distribution shifts still looks reasonably hard as a problem. But the blog post I cited explicitly said “Note that we’re not talking about problems with robustness and distributional shift”!
At this point I think there are a number of potential replies from people who still insist that the LW models of AI alignment were never wrong, which I (depending on the speaker) think can often border on gaslighting:
“Rohin’s point wasn’t that this problem would be hard. He was using it as a mere stepping stone to explain much harder problems of misspecification that were, at that time, purely theoretical.”
Then why did the paper explicitly say, “for many real-world tasks it can be challenging to specify a reward function that captures human preferences, particularly the preference for avoiding unnecessary side effects while still accomplishing the goal (Amodei et al., 2016).”
Why is it so hard to find people explicitly saying that this specific problem, and the examples illustrating it, were not meant to be seriously representative of the hard parts of alignment at the time?
Isn’t it still pretty valuable to point out that we’re solving stepping stones on the path towards the ‘real problem’?
“Sure, Rohin thought that was a major problem, but we [our organization/thought cluster/ideological group] never agreed with him.”
Oh really? Did you ever explicitly highlight this particular disagreement at the time? He wasn’t exactly a minor researcher at the time. And this blog post is only one of a number of blog posts expressing essentially an identical sentiment.
[ETA: to be fair, I do think there were some people who did genuinely disagree with Rohin’s framing. I don’t mean to accuse everyone of making the same error.]
“Yes, this particular part of the alignment problem looks easier than we thought, but serious people always thought that this was going to be one of the easiest subproblems, compared to other things. This problem was considered a very minor sub-problem of value alignment that merited like 1% of researcher-hours.”
Then why is it so easy to find countless blog posts of a similar nature from alignment researchers at the time, presenting pretty much the same problem and then presenting an attempt to solve it? Did all those people simply knowingly work on one of the easiest sub-problems of alignment?
I wrote a fair amount about alignment from 2014-2020[1] which you can read here. So it’s relatively easy to get a sense for what I believed.
Here are some summary notes about my views as reflected in that writing, though I’d encourage you to just judge for yourself[2] by browsing the archives:
I expected AI systems to be pretty good at predicting what behaviors humans would rate highly, long before they were catastrophically risky. This comes up over and over again in my writing. In particular, I repeatedly stated that it was very unlikely that an AI system would kill everyone because it didn’t understand that people would disapprove of that action, and therefore this was not the main source of takeover concerns. (By 2017 I expected RLHF to work pretty well with language models, which was reflected in my research prioritization choices and discussions within OpenAI though not clearly in my public writing.)
I consistently expressed that my main concerns were instead about (i) systems that were too smart for humans to understand the actions they proposed, (ii) treacherous turns from deceptive alignment. This comes up a lot, and when I talk about other problems I’m usually clear that they are prerequisites that we should expect to succeed. Eg.. see an unaligned benchmark. I don’t think this position was an extreme outlier, my impression at the time was that other researchers had broadly similar views.
I think the biggest-alignment relevant update is that I expected RL fine-tuning over longer horizons (or even model-based RL a la AlphaZero) to be a bigger deal. I was really worried about it significantly improving performance and making alignment harder. In 2018-2019 my mainline picture was more like AlphaStar or AlphaZero, with RL fine-tuning being the large majority of compute. I’ve updated about this and definitely acknowledge I was wrong.[3] I don’t think it totally changes the picture though: I’m still scared of RL, I think it is very plausible it will become more important in the future, and think that even the kind of relatively minimal RL we do now can introduce many of the same risks.
In 2016 I pointed out that ML systems being misaligned on adversarial inputs and exploitable by adversaries was likely to be the first indicator of serious problems, and therefore that researchers in alignment should probably embrace a security framing and motivation for their research.
I expected LM agents to work well (see this 2015 post). Comparing this post to the world of 2023 I think my biggest mistake was overestimating the importance of task decomposition vs just putting everything in a single in-context chain of thought. These updates overall make crazy amplification schemes seem harder (and to require much smarter models than I originally expected, if they even make sense at all) but at the same time less necessary (since chain of thought works fine for capability amplification for longer than I would have expected).
I overall think that I come out looking somewhat better than other researchers working in AI alignment, though again I don’t think my views were extreme outliers (and during this period I was often pointed to as a sensible representative of fairly hardcore and traditional alignment concerns).
Like you, I am somewhat frustrated that e.g. Eliezer has not really acknowledged how different 2023 looks from the picture that someone would take away from his writing. I think he’s right about lots of dynamics that would become relevant for a sufficiently powerful system, but at this point it’s pretty clear that he was overconfident about what would happen when (and IMO is still very overconfident in a way that is directly relevant to alignment difficulty). The most obvious one is that ML systems have made way more progress towards being useful R&D assistants way earlier than you would expect if you read Eliezer’s writing and took it seriously. By all appearances he didn’t even expect AI systems to be able to talk before they started exhibiting potentially catastrophic misalignment.
I think my opinions about AI and alignment were much worse from 2012-2014, but I did explicitly update and acknowledge many mistakes from that period (though some of it was also methodological issues, e.g. I believe that “think about a utility function that’s safe to optimize” was a useful exercise for me even though by 2015 I no longer thought it had much direct relevance).
I’d also welcome readers to pull out posts or quotes that seem to indicate the kind of misprediction you are talking about. I might either acknowledge those (and I do expect my historical reading is very biased for obvious reasons), or I might push back against them as a misreading and explain why I think that.
That said, in fall 2018 I made and shared some forecasts which were the most serious forecasts I made from 2016-2020. I just looked at those again to check my views. I gave a 7.5% chance of TAI by 2028 using short-horizon RL (over a <5k word horizon using human feedback or cheap proxies rather than long-term outcomes), and a 7.5% chance that by 2028 we would be able to train smart enough models to be transformative using short-horizon optimization but be limited by engineering challenges of training and integrating AI systems into R&D workflows (resulting in TAI over the following 5-10 years). So when I actually look at my probability distributions here I think they were pretty reasonable. I updated in favor of alignment being easier because of the relative unimportance of long-horizon RL, but the success of imitation learning and short-horizon RL was still a possibility I was taking very seriously and overall probably assigned higher probability to than almost anyone in ML.
I agree, your past views do look somewhat better. I painted alignment researchers with a fairly broad brush in my original comment, which admittedly might have been unfair to many people who departed from the standard arguments (alternatively, it gives those researchers a chance to step up and receive credit for having been in the minority who weren’t wrong). Partly I portrayed the situation like this because I have the sense that the crucial elements of your worldview that led you to be more optimistic were not disseminated anywhere close to as widely as the opposite views (e.g. “complexity of wishes”-type arguments), at least on LessWrong, which is where I was having most of these discussions.
My general impression is that it sounds like you agree with my overall take although you think I might have come off too strong. Perhaps let me know if I’m wrong about that impression.
Some thoughts on my journey in particular:
When I joined AI safety in late 2017 (having read approximately nothing in the field), I thought of the problem as “construct a utility function for an AI system to optimize”, with a key challenge being the fragility of value. In hindsight this was clearly wrong.
The Value Learning sequence was in large part a result of my journey away from the utility function framing.
That being said, I suspect I continued to think that fragility-of-value type issues were a significant problem, probably until around mid-2019 (see next point).
(I did continue some projects more motivated from a fragility-of-value perspective, partly out of a heuristic of actually finishing things I start, and partly because I needed to write a PhD thesis.)
Early on, I thought of generalization as a key issue for deep learning and expected that vanilla deep learning would not lead to AGI for this reason. Again, in hindsight this was clearly wrong.
I was extremely surprised by OpenAI Five in June 2018 (not just that it worked, but also the ridiculous simplicity of the methods, in particular the lack of any hierarchical RL) and had to think through that.
I spent a while trying to understand that (at least months, e.g. you can see me continuing to be skeptical of deep learning in this Dec 2018 post).
I think I ended up close to my current views around early-to-mid-2019, e.g. I still broadly agree with the things I said in this August 2019 conversation (though I’ll note I was using “mesa optimizer” differently than it is used today—I agree with what I meant in that conversation, though I’d say it differently today).
I think by this point I was probably less worried about fragility of value. E.g. in that conversation I say a bunch of stuff that implies it’s less of a problem, most notably that AI systems will likely learn similar features as humans just from gradient descent, for reasons that LW would now call “natural abstractions”.
Note that this comment is presenting the situation as a lot cleaner than it actually was. I would bet there were many ways in which I was irrational / inconsistent, probably some times where I would have expressed verbally that fragility of value wasn’t a big deal but would still have defended research projects centered around it from some other perspective, etc.
Some thoughts on how to update based on past things I wrote:
I don’t think I’ve ever thought of myself as largely agreeing with LW: my relationship to LW has usually been “wow, they seem to be getting some obvious stuff wrong” (e.g. I was persuaded of slow takeoff basically when Paul’s post and AI Impacts’ post came out in Feb 2018, the Value Learning sequence in late 2018 was primarily in response to my perception that LW was way too anchored on the “construct a utility function” framing).
I think you don’t want to update too hard on the things that were said on blog posts addressed to an ML audience, or in papers that were submitted to conferences. Especially for the papers there’s just a lot of random stuff you couldn’t say about why you’re doing the work because then peer reviewers will object (e.g. I heard second hand of a particularly egregious review to the effect of: “this work is technically solid, but the motivation is AGI safety; I don’t believe in AGI so this paper should be rejected”).
I remember explicit discussion about how solving this problem shouldn’t even count as part of solving long-term / existential safety, for example:
“What I understand this as saying is that the approach is helpful for aligning housecleaning robots (using near extrapolations of current RL), but not obviously helpful for aligning superintelligence, and likely stops being helpful somewhere between the two. [...] There is a risk that a large body of safety literature which works for preventing today’s systems from breaking vases but which fails badly for very intelligent systems actually worsens the AI safety problem” https://www.lesswrong.com/posts/H7KB44oKoSjSCkpzL/worrying-about-the-vase-whitelisting?commentId=rK9K3JebKDofvJA3x
See also The Main Sources of AI Risk? where this problem was only part of one bullet point, out of 30 (as of 2019, 35 now).
Two points:
I have a slightly different interpretation of the comment you linked to, which makes me think it provides only weak evidence for your claim. (Though it’s definitely still some evidence.)
I agree some people deserve credit for noticing that human-level value specification might be kind of easy before LLMs. I don’t mean to accuse everyone in the field of making the same mistake.
Anyway, let me explain the first point.
I interpret Abram to be saying that we should focus on solutions that scale to superintelligence, rather than solutions that only work on sub-superintelligent systems but break down at superintelligence. This was in response to Alex’s claim that “whitelisting contributes meaningfully to short- to mid-term AI safety, although I remain skeptical of its robustness to scale.”
In other words, Alex said (roughly): “This solution seems to work for sub-superintelligent AI, but might not work for superintelligent AI.” Abram said in response that we should push against such solutions, since we want solutions that scale all the way to superintelligence. This is not the same thing as saying that any solution to the house-cleaning robot provides negligible evidence of progress, because some solutions might scale.
It’s definitely arguable, but I think it’s likely that any realistic solution to the human-level house cleaning robot problem—in the strong sense of getting a robot to genuinely follow all relevant moral constraints, allow you to shut it down, and perform its job reliably in a wide variety of environments—will be a solution that scales reasonably well above human intelligence (maybe not all the way to radical superintelligence, but at the very least I don’t think it’s negligible evidence of progress).
If you merely disagree that any such solutions will scale, and you’ve been consistent on this point for the last five years, then I guess I’m not really addressing you in my original comment, but I still think what I wrote applies to many other researchers.
I think it’s actually 2 points,
“Misspecified or incorrectly learned goals/values”
“Inability to specify any ‘real-world’ goal for an artificial agent (suggested by Michael Cohen)”
I’m not sure how much to be compelled by this piece of evidence. I’ll point out that naming the same problem multiple times might have gotten repetitive, and there was also no explicit ranking of the problems from most important to least important (or from hardest to easiest). If the order you wrote them in can be (perhaps uncharitably) interpreted as the order of importance, then I’ll note that it was listed as problem #3, which I think supports my original thesis adequately.
This matches my sense of how a lot of people seem to have… noticed that GPT-4 is fairly well aligned to what the OpenAI team wants it to be, in ways that Yudkowsky et al said would be very hard, and still not view this as at a minimum a positive sign?
Ie problems of the class ‘I told the intelligence to get my mother out of the burning building and it blew her up so the dead body flew out the window, this is because I wasn’t actually specific enough’ just don’t seem like they are a major worry anymore?
Usually when GPT-4 doesn’t understand what I’m asking, I wouldn’t be surprised if a human was confused also.
Quoting the abstract of MIRI’s “The Value Learning Problem” paper (emphasis added):
And quoting from the first page of that paper:
I won’t weigh in on how many LessWrong posts at the time were confused about where the core of the problem lies. But “The Value Learning Problem” was one of the seven core papers in which MIRI laid out our first research agenda, so I don’t think “we’re centrally worried about things that are capable enough to understand what we want, but that don’t have the right goals” was in any way hidden or treated as minor back in 2014-2015.
I also wouldn’t say “MIRI predicted that NLP will largely fall years before AI can match e.g. the best human mathematicians, or the best scientists”, and if we saw a way to leverage that surprise to take a big bite out of the central problem, that would be a big positive update.
I’d say:
MIRI mostly just didn’t make predictions about the exact path ML would take to get to superintelligence, and we’ve said we didn’t expect this to be very predictable because “the journey is harder to predict than the destination”. (Cf. “it’s easier to use physics arguments to predict that humans will one day send a probe to the Moon, than it is to predict when this will happen or what the specific capabilities of rockets five years from now will be”.)
Back in 2016-2017, I think various people at MIRI updated to median timelines in the 2030-2040 range (after having had longer timelines before that), and our timelines haven’t jumped around a ton since then (though they’ve gotten a little bit longer or shorter here and there).
So in some sense, qualitatively eyeballing the field, we don’t feel surprised by “the total amount of progress the field is exhibiting”, because it looked in 2017 like the field was just getting started, there was likely an enormous amount more you could do with 2017-style techniques (and variants on them) than had already been done, and there was likely to be a lot more money and talent flowing into the field in the coming years.
But “the total amount of progress over the last 7 years doesn’t seem that shocking” is very different from “we predicted what that progress would look like”. AFAIK we mostly didn’t have strong guesses about that, though I think it’s totally fine to say that the GPT series is more surprising to the circa-2017 MIRI than a lot of other paths would have been.
(Then again, we’d have expected something surprising to happen here, because it would be weird if our low-confidence visualizations of the mainline future just happened to line up with what happened. You can expect to be surprised a bunch without being able to guess where the surprises will come from; and in that situation, there’s obviously less to be gained from putting out a bunch of predictions you don’t particularly believe in.)
Pre-deep-learning-revolution, we made early predictions like “just throwing more compute at the problem without gaining deep new insights into intelligence is less likely to be the key thing that gets us there”, which was falsified. But that was a relatively high-level prediction; post-deep-learning-revolution we haven’t claimed to know much about how advances are going to be sequenced.
We have been quite interested in hearing from others about their advance prediction record: it’s a lot easier to say “I personally have no idea what the qualitative capabilities of GPT-2, GPT-3, etc. will be” than to say ”… and no one else knows either”, and if someone has an amazing track record at guessing a lot of those qualitative capabilities, I’d be interested to hear about their further predictions. We’re generally pessimistic that “which of these specific systems will first unlock a specific qualitative capability?” is particularly predictable, but this claim can be tested via people actually making those predictions.
I think you missed my point: my original comment was about whether people are updating on the evidence from instruction-tuned LLMs, which seem to actually act on human values (i.e., our actual intentions) quite well, as opposed to mis-specified versions of our intentions.
I don’t think the Value Learning Problem paper said that it would be easy to make human-level AGI systems act on human values in a behavioral sense, rather than merely understand human values in a passive sense.
I suspect you are probably conflating two separate concepts:
It is easy to create a human-level AGI that can passively learn and understand human values (I am not saying people said this would be difficult in the past)
It is easy to create a human-level AGI that acts on human values, in the sense of actually executing instructions that follow our intentions, rather than following a dangerously mis-specified version of what we asked for.
I do not think the Value Learning Paper asserted that (2) was true. To the extent it asserted that, I would prefer to see quotes that back up that claim explicitly.
Your quote from the paper illustrates that it’s very plausible that people thought (1) was true, but that seems separate to my main point: that people thought (2) was not true. (1) and (2) are separate and distinct concepts. And my comment was about (2), not (1).
There is simply a distinction between a machine that actually acts on and executes your intended commands, and a machine that merely understands your intended commands, but does not necessarily act on them as you intend. I am talking about the former, not the latter.
From the paper,
Indeed, and GPT-4 does not base its decisions on a misrepresentation of its programmers intentions, most of the time. It generally both correctly understands our intentions, and more importantly, actually acts on them!
No? GPT-4 predicts text and doesn’t care about anything else. Under certain conditions it predicts nice text, under other not very nice and we don’t know what happens if we create GPT actually capable to, say, bulid nanotech.
For what it’s worth I do remember lots of people around the MIRI-sphere complaining at the time that that kind of prosaic alignment work was kind of useless, because it missed the hard parts of aligning superintelligence.
I agree some people in the MIRI-sphere did say this, and a few of them get credit for pointing out things in this vicinity, but I personally don’t remember reading many strong statements of the form:
“Prosaic alignment work is kind of useless because it will actually be easy to get a roughly human-level machine to interpret our commands reliably, do what you want without significant negative side effects, and let you shut it down whenever you want etc. The hard part is doing this for superintelligence.”
My understanding is that a lot of the time the claim was instead something like:
“Prosaic alignment work is kind of useless because machine learning is natively not very transparent and alignable, and we should focus instead on creating alignable alternatives to ML, or building the conceptual foundations that would let us align powerful AIs.”
As some evidence, I’d point to Rob Bensinger’s statement that,
I do also think a number of people on LW sometimes said a milder version of the thing I mentioned above, which was something like:
“Prosaic alignment work might help us get narrow AI that works well in various circumstances, but once it develops into AGI, becomes aware that it has a shutdown button, and can reason through the consequences of what would happen if it were shut down, and has general situational awareness along with competence across a variety of domains, these strategies won’t work anymore.”
I think this weaker statement now looks kind of false in hindsight, since I think current SOTA LLMs are already pretty much weak AGIs, and so they already seem close to the threshold at which we were supposed to start seeing these misalignment issues come up. But they are not coming up (yet). I think near-term multimodal models will be even closer to the classical “AGI” concept, complete with situational awareness and relatively strong cross-domain understanding, and yet I also expect them to mostly be fairly well aligned to what we want in every relevant behavioral sense.
Well, for instance, I watched Ryan Carey give a talk at CHAI about how Cooperative Inverse Reinforcement Learning didn’t give you corrigibility. (That CIRL didn’t tackle the hard part of the problem, despite seeming related on the surface.)
I think that’s much more an example of
than of
This doesn’t seem to be the same thing as what I was talking about.
Yes, people frequently criticized particular schemes for aligning AI systems, arguing that the scheme doesn’t address some key perceived obstacle. By itself, this is pretty different from predicting both:
It will be easy to get behavioral alignment on slightly-sub-AGI, and maybe even par-human systems, including on shutdown problems
The problem is that these schemes don’t scale well all the way to radical superintelligence.
I remember a lot of people making the second point, but not nearly as many making the first point.
I think I’m missing you then.
If (some) people already had the view that this kind of prosaic alignment wouldn’t scale to Superintelligence, but didn’t express an opinion about whether behavioral alignment of slightly-sub-AGI would be solved, what in what way do you want them to be updating that they’re not?
Or do you mean they weren’t just agnostic about the behavioral alignment of near-AGIs, they specifically thought that it wouldn’t be easy? Is that right?
Two points:
One, I think being able to align AGI and slightly sub-AGI successfully is plausibly very helpful for making the alignment problem easier. It’s kind of like learning that we can create more researchers on demand if we ever wanted to.
Two, the fact that the methods scale surprisingly well to human-level is evidence that they actually work pretty well in general, even if they don’t scale all the way into some radical regime way above human-level. For example, Eliezer talked about how he expected you’d need to solve the suspend button problem by the time your AI has situational awareness, but I think you can interpret this prediction as either becoming increasingly untenable, or that we appear close to a solution to the problem since our AIs don’t seem to be resisting shutdown.
Again, presumably once you get the aligned AGI, you can use many copies of the aligned AGI to help you with the next iteration, AGI+. This seems plausibly very positive as an update. I can sympathize with those who say it’s only a minor update because they never thought the problem was merely aligning human-level AI, but I’m a bit baffled by those who say it’s not an update at all from the traditional AI risk models, and are still very pessimistic.
I feel like I’m being obstinate or something, but I think that the linked article is still basically correct, and not particularly untenable.
From the article...
The key word in that sentence is “consequentialist”. Current LLMs are pretty close (I think!) to having pretty detailed situational awareness. But, as near as I can tell, LLMs are, at best, barely consequentialist.
I agree that that is a surprise, on the old school LessWrong / MIRI world view. I had assumed that “intelligence” and “agency” were way more entangled, way more two sides of the same coin, than they apparently are.
And the framing of the article focuses on situational awareness and not on consequentialism because of that error. Because Eliezer (and I) thought at the time that situational awareness would come after consequentialist reasoning in the tech tree.
But I expect that we’ll have consequentialist agents eventually (if not, that’s a huge crux for how dangerous I expect AGI to be), and I expect that you’ll have “off button” problems at the point when you have “enough” consequentialism aimed at some goal, “enough” strategic awareness, and strong “enough” capabilities that the AI can route around the humans and the human safeguards.
In my opinion, the extent to which the linked article is correct is roughly the extent to which the article is saying something trivial and irrelevant.
The primary thing I’m trying to convey here is that we now have helpful, corrigible assistants (LLMs) that can aid us in achieving our goals, including alignment, and the rough method used to create these assistants seems to scale well, perhaps all the way to human level or slightly beyond it.
Even if the post is technically correct because a “consequentialist agent” is still incorrigible (perhaps by definition), and GPT-4 is not a “consequentialist agent”, this doesn’t seem to matter much from the perspective of alignment optimism, since we can just build helpful, corrigible assistants to help us with our alignment work instead of consequentialist agents.
A side-note to this conversation, but I basically still buy the quoted text and don’t think it now looks false in hindsight.
We (apparently) don’t yet have models that have robust longterm-ish goals. I don’t know how natural it will be for models to end up with long term goals: the MIRI view says that anything that can do science will definitely have long-term planning abilities which fundamentally, entails having goals that are robust to changing circumstances. I don’t know if that’s true, but regardless, I expect that we’ll specifically engineer agents with long term goals. (Whether or not those agents will have “robust” long term goals, over and above what they were prompted to do in a specific situation is also something that I don’t know.)
What I expect to see is agents that have a portfolio of different drives and goals, some of which are more like consequentialist objectives (eg “I want to make the number in this bank account go up”) and some of which are more like deontological injunctions (“always check with my user/ owner before I make a big purchase or take a ‘creative’ action, one that is outside of my training distribution”).
My prediction is that the consequentialist parts of the agent will basically route around any deontological constraints that are trained in.
For instance, the your personal assistant AI does ask your permission before it does anything creative, but also, it’s superintelligently persuasive and so it always asks your permission in exactly the way that will result in it accomplishing what it wants. If there are a thousand action sequences in which it asks for permission, it picks the one that has the highest expected value with regard to whatever it wants. This basically nullifies the safety benefit of any deontological injunction, unless there are some injunctions that can’t be gamed in this way.
To do better than this, it seems like you do have to solve the Agent Foundations problem of corrigibility (getting the agent to be sincerely indifferent between your telling it to take the action or not take the action) or you have to train in, not a deontological injunction, but an active consequentialist goal of serving the interests of the human (which means you have find a way to get the agent to be serving some correct enough idealization of human values).
But I think we mostly won’t see this kind of thing until we get quite high levels of capability, where it is transparent to the agent that some ways of asking for permission have higher expected value than others.
Or rather, we might see a little of this effect early on, but until your assistant is superhumanly persuasive, it’s pretty small. Maybe we’ll see a bias toward accepting actions that serve the AI agent’s goals (if we even know what those are) more, as capability goes up, but we won’t be able to distinguish “the AI is getting better at getting what it wants from the human” from “the AIs are just more capable, and so they come up with plans that work better.” It’ll just look like the numbers going up.
To be clear “superhumanly persuasive” is only one, particularly relevant, example of a superhuman capability that allows you to route around deontological injunctions that the agent is committed to. My claim is weaker if you remove that capability in particular, but mostly what I’m wanting to say is that powerful consequentialism find and “squeezes through” the gaps in your oversight and control and naive-corrigibility schemes, unless you figure out corrigibility in the Agent Foundations sense.
This is one of the main reasons I’m not excited about engaging with LessWrong. Why bother? It feels like nothing I say will matter. Apparently, no pre-takeoff experiments matter to some folk.[1] And even if I successfully dismantle some philosophical argument, there’s a good chance they will use another argument to support their beliefs instead. Nothing changes.
So there we are. It doesn’t matter what my experiments say, because (it is claimed) there are no testable predictions before The End. But also, everyone important already knew in advance that it’d be easy to get GPT-4 to interpret and execute your value-laden requests in a human-reasonable fashion. Even though ~no one said so ahead of time.
When talking with pre-2020 alignment folks about these issues, I feel gaslit quite often. You have no idea how many times I’ve been told things like “most people already understood that reward is not the optimization target”[2] and “maybe you had a lesson you needed to learn, but I feel like I got this in 2018″, and so on. Almost always this comes from people who seem to still not understand what I’m talking about. I feel fine if[3] they disagree with me about specific ideas, but what really bothers me is the revisionism. It’s so annoying.
Like, just look at this quote from the post you mentioned:
And you probably didn’t even select that post for this particular misunderstanding. (EDIT: Note that I am not accusing Rohin of gaslighting on this topic, and I also think he already understood the “reward is not the optimization target” point when he wrote the above sentence. My critique was that the statement is false and would probably lead readers to incorrect beliefs about the purpose of reward in RL.)
I feel a lot of disappointment and sadness. In 2018, I came to this website when I really needed a new way to understand the world. I’d made a lot of epistemic mistakes I wasn’t proud of, and I didn’t want to live that way anymore. I wanted to think more clearly. I wanted it so, so badly. I came to rely and depend on this place and the fellow users. I looked up to and admired a bunch of them (and I still do so for a few).
But the things you mention—the revisionism, the unfalsifiability, the apparent gaslighting? We set out to do better than science. I think we often do worse.
As a general principle, truths are entangled with each other. It’s OK if a theory’s most extreme prediction (e.g. extinction from AI) is not testable at the current moment. It is a highly suspicious state of affairs if a theory yields no other testable predictions. Truths are generally entangled with each other in intricate and manifold ways. There are generally many clever ways to test a theory, given the necessary will and curiosity.
I could give more concrete examples, but that feels indecorous.
Sometimes I instead get pushback like “it seems to me like I’ve grasped the insights you’re trying to communicate, but I totally acknowledge that I might just not be seeing what you’re saying yet.” I respect and appreciate that response. It communicates the other person’s true perception (that they already understand) while not invalidating or assuming away my perspective.
I get why you feel that way. I think there are a lot of us on LessWrong who are less vocal and more openminded, and less aligned with either optimistic network thinkers or pessimistic agent foundations thinkers. People newer to the discussion and otherwise less polarized are listening and changing their minds in large or small ways.
I’m sorry you’re feeling so pessimistic about LessWrong. I think there is a breakdown in communication happening between the old guard and the new guard you exemplify. I don’t think that’s a product of venue, but of the sheer difficulty of the discussion. And polarization between different veiwpoints on alignment.
I think maintaining a good community falls on all of us. Formats and mods can help, but communities set their own standards.
I’m very, very interested to see a more thorough dialogue between you and similar thinkers, and MIRI-type thinkers. I think right now both sides feel frustrated that they’re not listened to and understood better.
(Presumably you are talking about how reward is not the optimization target.)
While I agree that the statement is not literally true, I am still basically on board with that sentence and think it’s a reasonable shorthand for the true thing.
I expect that I understood the “reward is not the optimization target” point at the time of writing that post (though of course predicting what your ~5-years-ago self knew is quite challenging without specific quotes to refer to).
I am confident I understood the point by the time I was working on the goal misgeneralization project (late 2021), since almost every example we created involved predicting ahead of time a specific way in which reward would fail to be the optimization target.
(I didn’t follow this argument at the time, so I might be missing key context.)
The blog post “Reward is not the optimization target” gives the following summary of its thesis,
I hope it doesn’t come across as revisionist to Alex, but I felt like both of these points were made by people at least as early as 2019, after the Mesa-Optimization sequence came out in mid-2019. As evidence, I’ll point to my post from December 2019 that was partially based on a conversation with Rohin, who seemed to agree with me,
I think in this passage I’m imagining that “reward is not the trained agent’s optimization target” quite explicitly, since I’m pointing out that a neural network trained by RL will not necessarily optimize anything at all. In a subsequent post from January 2020 I gave a more explicit example, said this fact doesn’t merely apply to simple neural networks, and then offered my opinion that “it’s inaccurate to say that the source of malign generalization must come from an internal search being misaligned with the objective function we used during training”.
From the comments, and from my memory of conversations at the time, many people disagreed with my framing. They disagreed even when I pointed out that humans don’t seem to be “optimizers” that select for actions that maximize our “reward function” (I believe the most common response was to deny the premise, and say that humans are actually roughly optimizers. Another common response was to say that AI is different for some reason.)
However, even though some people disagreed with this framing, not everyone did. As I pointed out, Rohin seemed to agree with me at the time, and so at the very least I think there is credible evidence that this insight was already known to a few people in the community by late 2019.
I have no stake in this debate, but how is this particular point any different than what Eliezer says when he makes the point about humans not optimizing for IGF? I think the entire mesaoptimization concern is built around this premise, no?
I didn’t mean to imply that you in particular didn’t understand the reward point, and I apologize for not writing my original comment more clearly in that respect. Out of nearly everyone on the site, I am most persuaded that you understood this “back in the day.”
I meant to communicate something like “I think the quoted segment from Rohin and Dmitrii’s post is incorrect and will reliably lead people to false beliefs.”
Thanks for the edit :)
As I mentioned elsewhere (not this website) I don’t agree with “will reliably lead people to false beliefs”, if we’re talking about ML people rather than LW people (as was my audience for that blog post).
I do think that it’s a reasonable hypothesis to have, and I assign it more likelihood than I would have a year ago (in large part from you pushing some ML people on this point, and them not getting it as fast as I would have expected).
FWIW at the time I wasn’t working on value learning and wasn’t incredibly excited about work in that direction, despite the fact that that’s what the rest of my lab was primarily focussed on. I also wrote a blog post in 2020, based off a conversation I had with Rohin in 2018, where I mention how important it is to work on inner alignment stuff and how those issues got brought up by the ‘paranoid wing’ of AI alignment. My guess is that my view was something like “stuff like reward learning from the state of the world doesn’t seem super important to me because of inner alignment etc, but for all I know cool stuff will blossom out of it, so I’m happy to hear about your progress and try to offer constructive feedback”, and that I expressed that to Rohin in person.
Of course, the fact that I think the same thing now as I did in 2020 isn’t much evidence that I’m right.
My sense is that this is an inevitable consequence of low-bandwidth communication. I have no idea whether you’re referring to me or not, and I am really not saying you are doing so, but I think an interesting example (whether you’re referring to it or not) are some of the threads recently where we’ve been discussing deceptive alignment. My sense is that neither of us have been very persuaded by those conversations, and I claim that’s not very surprising, in a way that’s epistemically defensible for both of us. I’ve spent literal years working through the topic myself in great detail, so it would be very surprising if my view was easily swayed by a short comment chain—and similarly I expect that the same thing is true of you, where you’ve spent much more time thinking about this and have much more detailed thoughts than are easy to represent in a simple comment chain.
My long-standing position has been and continues to be that the only good medium of communication for this sort of stuff is direct, non-public, in-person communication. That being said, obviously that’s not always workable, and I do think that LessWrong is one of the least bad of all the bad options. Certainly I think it’s preferable to any of the other social media platforms on offer—you mention the broader AI community as not liking LessWrong, but I think they mostly use Twitter for this instead, which seems substantially worse on all of the axes that you criticize. My impression of the quality of AI discourse on Twitter on all sides of the AI safety debate has been very negative, with it mostly just rewarding cheap dunks and increasing polarization—e.g. I felt like I saw this a lot during the OpenAI fiasco. At least on LessWrong I think it is still sometimes possible for nuance to be rewarded rather than punished.
FWIW, LessWrong does seem—in at least one or two ways—saner than other communities of similar composition. I agree it’s better than Twitter overall. But in many ways it seems worse than other communities. I don’t know what to do about it, and to be honest I don’t have much faith in e.g. the mods.[1]
Hopefully my comments do something anyways, though. I do have some hope because it seems like a good amount has improved over the last year or two.
Despite thinking that many of them are cool people.
There’s a caveat here. It’s inevitable for communication that veers towards the emotional/subjective/sympathetic.
When the average writer tries to compress it down to a few hundred or thousand letters on a screen it does often seem ridiculous.
Even from moderately above average writers it often sounds more like anxious upper-middle-class virtue signalling then meaningful conversations.
I think it takes a really really clever writer to make it more substantial than that and escape the perception entirely.
On the other hand, discussions of purely objective topics, that are falsifiable and verifiable by independent third parties, don’t suffer the same pitfalls.
As long as you really know what you are talking about, or willing to learn, even the below average writer can communicate just fine.
Why are you so focused on Eliezer/MIRI yourself? If you think you (or events in general) have adequately shown that their specific concerns are not worth worrying about, maybe turn your attention elsewhere for a bit? For example you could look into other general concerns about AI risk, or my specific concerns about AIs based on shard theory. I don’t think I’ve seen shard theory researchers address many of these yet.
I’ll answer this descriptively.
When I trace the dependencies of common alignment beliefs and claims, a lot of them come back to e.g. RFLO and other ideas put forward by the MIRI cluster. Since I often find myself arguing against common alignment claims, I often argue against the historical causes of those ideas, which involves arguing against MIRI-takes.
I’m personally satisfied that their concerns are (generally) not worth worrying about. However, often people in my social circles are not. And such beliefs will probably have real-world consequences for governance.
Neargroup—I have a few friends who work at MIRI, and debate them on alignment ideas pretty often. I also sometimes work near MIRI people.
Because I disagree with them very sharply, their claims bother me more and are rendered more salient.
I feel bothered about MIRI still (AFAICT) getting so much funding/attention (even though it’s relatively lower than it used to be), because it seems to me that since e.g. 2016 they have released ~zero technical research that helps us align AI in the present or in the future. It’s been five years since they stopped disclosing any of their research, and it seems like no one else really cares anymore. That bothers me.
As to why I haven’t responded to e.g. your concerns in detail:
I currently don’t put much value on marginal theoretical research (even in shard theory, which I think is quite a bit better than other kinds of theory).
I feel less hopeful about LessWrong debate doing much, as I have described elsewhere. It feels like a better use of my time to put my head down, read a bunch of papers, and do good empirical work at GDM.
I am generally worn out of arguing about theory on the website, and have been since last December. (I will note that I have enjoyed our interactions and appreciated your contributions.)
Sounds like to the extent that you do have time/energy for theory, you might want to strategically reallocate your attention a bit? I get that you think a bunch of people are wrong and you’re worried about the consequences of that, but diminishing returns is a thing, and you could be too certain yourself (that MIRI concerns are definitely wrong).
And then empirical versus theory, how much do you worry about architectural changes obsoleting your empirical work? I noticed for example that in image generation GAN was recently replaced by latent diffusion, which probably made a lot of efforts to “control” GAN-based image generation useless.
That aside, “heads down empirical work” only makes sense if you picked a good general direction before putting your head down. Should it not worry people that shard theory researchers do not seem to have engaged with (or better yet, preemptively addressed) basic concerns/objections about their approach?
For what it’s worth, I would be up for a dialogue or some other context where I can make concrete predictions. I do think it’s genuinely hard, since I do think there is a lot of masking of problems going on, and optimization pressure that makes problems harder to spot (both internally in AI systems and institutionally), so asking me to make predictions feels a bit like asking me to make predictions about FTX before it collapsed.
Like, yeah, I expect it to look great, until it explodes. Similarly I expect AI to look pretty great until it explodes. That seems like kind of a core part of the argument for difficulty for me.
I would nevertheless be happy to try to operationalize some bets, and still expect we would have lots of domains where we disagree, and would be happy to bet on those.
If your hypothesis smears probability over a wider range of outcomes than mine, while I can more sharply predict events using my theory of how alignment works—that constitutes a Bayes-update towards my theory and away from yours. Right?
“Anything can happen before the explosion” is not a strength for a theory. It’s a vulnerability. If probability is better-concentrated by any other theories which make claims about both the present and the future of AI, then the noncommittal theory gets dropped.
Sure, yeah, though like, I don’t super understand. My model will probably make the same predictions as your model in the short term. So we both get equal Bayes points. The evidence that distinguishes our models seems further out, and in a territory where there is a decent chance that we will be dead, which sucks, but isn’t in any way contradictory with Bayes rule. I don’t think I would have put that much probability on us being dead at this point, so I don’t think that loses much of any bayes points. I agree that if we are still alive in 20-30 years, then that’s definitely bayes points, and I am happy to take that into account then, but I’ve never had timelines or models that predicted things to look that different from now (or like, where there were other world models that clearly predicted things much better).
No, I don’t think so. My model(s) I use for AGI risk is an outgrowth of the model I use for normal AI research, and so it makes tons of detailed predictions. That’s why my I have weekly fluctuations in my beliefs about alignment difficulty.
Overall question I’m interested in: What, if any, catastrophic risks are posed by advanced AI? By what mechanisms do they arise, and by what solutions can risks be addressed?
Making different predictions. The most extreme prediction of AI x-risk is that AI presents, well, an x-risk. But theories gain and lose points not just on their most extreme predictions, but on all their relevant predictions.
I have a bunch of uncertainty about how agentic/transformative systems will look, but I put at least 50% on “They’ll be some scaffolding + natural outgrowth of LLMs.” I’ll focus on that portion of my uncertainty in order to avoid meta-discussions on what to think of unknown future systems.
I don’t know what your model of AGI risk is, but I’m going to point to a cluster of adjacent models and memes which have been popular on LW and point out a bunch of predictions they make, and why I think my views tend to do far better.
Format:
Historical claim or meme relevant to models of AI ruin. [Exposition]
[Comparison of model predictions]
The historical value misspecification argument. Consider a model which involves the claim “it’s really laborious and fragile to specify complex human goals to systems, such that the systems actually do what you want.”
This model naturally predicts things like “it’s intractably hard/fragile to get GPT-4 to help people with stuff.” Sure, the model doesn’t predict this with probability 1, but it’s definitely an obvious prediction. (As an intuition pump, if we observed the above, we’d obviously update towards fragility/complexity of value; so since we don’t observe the above, we have to update away from that.)
My models involve things like “most of the system’s alignment properties will come from the training data” (and not e.g. from initialization or architecture), and also “there are a few SGD-plausible generalizations of any large dataset data” and also “to first order, overparameterized LLMs generalize how a naive person would expect after seeing the training behavior” (IE “edge instantiation isn’t a big problem.”) Also “the reward model doesn’t have to be perfect or even that good in order to elicit desired behavior from the policy.” Also noticing that DL just generalizes really well, despite classical statistical learning theory pointing out that almost all expressive models will misgeneralize!
(All of these models offer testable predictions!)
So overall, I think the second view predicts reality much more strongly than the first view.
It’s important to make large philosophical progress on an AI reasoning about its own future outputs. In Constitutional AI, an English-language “constitutional principle” (like “Be nice”) is chosen for each potential future training datapoint. The LLM then considers whether the datapoint is in line with that constitutional principle. The datapoint is later trained on if and only if the LLM concludes that the datapoint accords with the principle. The AI is, in effect, reasoning about its future training process, which will affect its future cognition.
The above “embedded agency=hard/confusing” model would naturally predict that reflection is hard and that we’d need to put in a lot of work to solve the “reflection problem.” While this setup is obviously a simple, crude form of reflection, it’s still valid. Therefore, the model predicts with increased confidence that constitutional AI would go poorly. But… Constitutional AI worked pretty well! RL from AI feedback also works well! There are a bunch of nice self-supervised alignment-boosting methods (one recent one I read about is RAIN).
One reason this matters: Under the “AGI from scaffolded super-LLMs” model, the scaffolding will probably prompt the LLM to evaluate its own plans. If we observe that current models do a good job of self-evaluation,[1] that’s strong evidence that future models will too. If strong models do a good job of moral and accurate self-evaluation, that decreases the chance that the future AI will execute immoral / bad plans.
I expect AIs to do very well here because AIs will reliably pick up a lot of nice “values” from the training corpus. Empirically that seems to happen, and theoretically you’d get some of the way there from natural abstractions + “there are a few meaningful generalizations” + “if you train the AI to do thing X when you prompt it, it will do thing X when prompted.”
Intelligence is a “package deal” / tool AI won’t work well / intelligence comes in service of goals. There isn’t a way to take AIXI and lop off the “dangerous capabilities” part of the algorithm and then have an AI which can still do clever stuff on your behalf. It’s all part of the argmax, which holds both the promise and peril of these (unrealistic) AIXI agents. Is this true for LLMs?
But what if you just subtract a “sycophancy vector” and add a “truth vector” and maybe subtract a power-seeking vector? According to current empirical results, these modularly control those properties, with minimal apparent reduction in capabilities!
So I think the “intelligence is a package deal” philosophy isn’t holding up that great. (And we had an in-person conversation where I had predicted these steering vector results, and you had expected the opposite.)
The steering vectors were in fact derived using shard theory reasoning (about activating certain shards more or less strongly by adding a direction to the latent space). So this is a strong prediction of my models.
If intelligence isn’t a package deal, then tool AI becomes far more technically probable (but still maybe not commercially probable). This means we can maybe extract reasonably consequentialist reasoning with “deontological compulsions” against e.g. powerseeking, and have that make the AI agent not want to seek power.
There are certain training assumptions which are likely to be met by future systems but not present systems, by default and for all powerful systems we expect to know how to build build), the AI will develop internal goals which it pursues ~coherently across situations.[2] (This would be a knock against smart tool AI.)
Risks from Learned Optimization posited that a “simple” way to “do well in training” is to learn a unified goal and then a bunch of generalized machinery to achieve that goal. This model naturally predicts that when you train overparameterized networks on a wide range of tasks, then . Even if that network isn’t an AGI.
My MATS 3.0 team and I partially interpreted an overparameterized maze-solving network which was trained to convergence on a wide range of mazes. However, we didn’t find any “simple, unified” goal representation. In fact, it had redundant internal representations of the goal square! Due to how CNNs work, that should be literally meaningless!
That’s a misprediction of the “unified motivations are simple” frame; if we have the theoretical precision to describe the simplicity biases of unknown future systems, that model should crank out good predictions for modern systems too.
I’m happy to bet on any additional experiments related to the above.
There are probably a bunch of other things, and I might come back with more, but I’m getting tired of writing this comment. The main point is that common components of threat models regularly make meaningful mispredictions. My models often do better (though once I misread some data and strongly updated against my models, so I think I’m amenable to counterevidence here). Therefore, I’m able to refine my models of AGI risk. I certainly don’t think we’re in the dark and unable to find experimental evidence.
I expect you to basically disagree about future AI being a separate magisterium or something, but I don’t know why that’d be true.
Often the claimed causes of future doom imply models which make pre-takeoff predictions, as shown above (e.g. fragility of value). But even if your model doesn’t make pre-takeoff predictions… Since my model is unified for both present and future AI, I can keep gaining Bayes points and refining my model! This happens whether or not your model makes predictions here. This is useful insofar as the observations I’m updating on actually update me on mechanisms in my model which are relevant for AGI alignment.
If you think I just listed a bunch of irrelevant stuff, well… I guess I super disagree! But I’ll keep updating anyways.
The Emulated Finetuning paper found that GPT-4 is superhuman at grading helpfulness/harmlessness. In the cases of disagreements between GPT-4 and humans, a more careful analysis revealed that 80% of the time the disagreement was caused by errors in the human judgment, rather than GPT-4’s analysis.
I recently explained more of my skepticism of the coherent-inner-goal claim.
Another point is that I think GPT-4 straightforwardly implies that various naive supervision techniques work pretty well. Let me explain.
From the perspective of 2019, it was plausible to me that getting GPT-4-level behavioral alignment would have been pretty hard, and might have needed something like AI safety via debate or other proposals that people had at the time. The claim here is not that we would never reach GPT-4-level alignment abilities before the end, but rather that a lot of conceptual and empirical work would be needed in order to get models to:
Reliably perform tasks how I intended as opposed to what I literally asked for
Have negligible negative side effects on the world in the course of its operation
Responsibly handle unexpected ethical dilemmas in a way that is human-reasonable
Well, to the surprise of my 2019-self, it turns out that naive RLHF with a cautious supervisor designing the reward model seems basically sufficient to do all of these things in a reasonably adequate way. That doesn’t mean that RLHF scales all the way to superintelligence, but it’s very significant nonetheless and interesting that it scales as far as it does.
You might think “why does this matter? We know RLHF will break down at some point” but I think that’s missing the point. Suppose right now, you learned that RLHF scales reasonably well all the way to John von Neumann-level AI. Or, even more boldly, say, you learned it scaled to 20 IQ points past John von Neumann. 100 points? Are you saying you wouldn’t update even a little bit on that knowledge?
The point at which RLHF breaks down is enormously important to overall alignment difficulty. If it breaks down at some point before the human range, that would be terrible IMO. If it breaks down at some point past the human range, that would be great. To see why, consider that if RLHF breaks down at some point past the human range, that implies that we could build aligned human-level AIs, who could then help us align slighter smarter AIs!
If you’re not updating at all on observations about when RLHF breaks down, then you probably either (1) think it doesn’t matter when RLHF breaks down, or (2) you already knew in advance exactly when it would break down. I think position 1 is just straight-up unreasonable, and I’m highly skeptical of most people who claim position 2. This basic perspective is a large part of why I’m making such a fuss about how people should update on current observations.
What did you think would happen, exactly? I’m curious to learn what your 2019-self was thinking would happen, that didn’t happen.
On the other hand, it could be considered bad news that IDA/Debate/etc. haven’t been deployed yet, or even that RLHF is (at least apparently) working as well as it is. To quote a 2017 post by Paul Christiano (later reposted in 2018 and 2019):
It seems that AI labs are not yet actually holding themselves to producing scalable systems, and it may well be better if RLHF broke down in some obvious way before we reach potentially dangerous capabilities, to force them to do that.
(I’ve pointed Paul to this thread to get his own take, but haven’t gotten a response yet.)
ETA: I should also note that there is a lot of debate about whether IDA and Debate are actually scalable or not, so some could consider even deployment of IDA or Debate (or these techniques appearing to work well) to be bad news. I’ve tended to argue on the “they are too risky” side in the past, but am conflicted because maybe they are just the best that we can realistically hope for and at least an improvement over RLHF?
I think these methods are pretty clearly not indefinitely scalable, but they might be pretty scalable. E.g., perhaps scalable to somewhat smarter than human level AI. See the ELK report for more discussion on why these methods aren’t indefinitely scalable.
A while ago, I think Paul had maybe 50% that with simple-ish tweaks IDA could be literally indefinitely scalable. (I’m not aware of an online source for this, but I’m pretty confident this or something similar is true.) IMO, this seems very predictably wrong.
TBC, I don’t think we should necessarily care very much about whether a method is indefinitely scalable.
Sometimes people do seem to think that debate or IDA could be indefinitely scalable, but this just seems pretty wrong to me (what is your debate about alphafold going to look like...).
I think the first presentation of the argument that IDA/Debate aren’t indefinitely scalable was in Inaccessible Information, fwiw.
I’ve been struggling with whether to upvote or downvote this comment btw. I think the point about how it’s really important when RLHF breaks down and more attention needs to be paid to this is great. But the other point about how RLHF hasn’t broke yet and this is evidence against the standard misalignment stories is very wrong IMO. For now I’ll neither upvote nor downvote.
I agree that if RLHF scaled all the way to von neumann then we’d probably be fine. I agree that the point at which RLHF breaks down is enormously important to overall alignment difficulty.
I think if you had described to me in 2019 how GPT4 was trained, I would have correctly predicted its current qualitative behavior. I would not have said that it would do 1, 2, or 3 to a greater extent than it currently does.
I’m in neither category (1) or (2); it’s a false dichotomy.
The categories were conditioned on whether you’re “not updating at all on observations about when RLHF breaks down”. Assuming you are updating, then I think you’re not really the the type of person who I’m responding to in my original comment.
But if you’re not updating, or aren’t updating significantly, then perhaps you can predict now when you expect RLHF to “break down”? Is there some specific prediction that you would feel comfortable making at this time, such that we could look back on this conversation in 2-10 years and say “huh, he really knew broadly what would happen in the future, specifically re: when alignment would start getting hard”?
(The caveat here is that I’d be kind of disappointed by an answer like “RLHF will break down at superintelligence” since, well, yeah, duh. And that would not be very specific.)
I’m not updating significantly because things have gone basically exactly as I expected.
As for when RLHF will break down, two points:
(1) I’m not sure, but I expect it to happen for highly situationally aware, highly agentic opaque systems. Our current systems like GPT4 are opaque but not very agentic and their level of situational awareness is probably medium. (Also: This is not a special me-take. This is basically the standard take, no? I feel like this is what Risks from Learned Optimization predicts too.)
(2) When it breaks down I do not expect it to look like the failures you described—e.g. it stupidly carries out your requests to the letter and ignores their spirit, and thus makes a fool of itself and is generally thought to be a bad chatbot. Why would it fail in that way? That would be stupid. It’s not stupid.
(Related question: I’m pretty sure on r/chatgpt you can find examples of all three failures. They just don’t happen often enough, and visibly enough, to be a serious problem. Is this also your understanding? When you say these kinds of failures don’t happen, you mean they don’t happen frequently enough to make ChatGPT a bad chatbot?)
Re: Missing the point: How?
Re: Elaborating: Sure, happy to, but not sure where to begin. All of this has been explained before e.g. in Ajeya’s Training Game report for example. Also Joe Carlsmith’s thing. Also the original mesaoptimizers paper, though I guess it didn’t talk about situational awareness idk. Would you like me to say more about what situational awareness is, or what agency is, or why I think both of those together are big risk factors for RLHF breaking down?
From a technical perspective I’m not certain if Direct Preference Optimization is theoretically that much different from RLHF beyond being much quicker and lower friction at what it does, but so far it seems like it has some notable performance gains over RLHF in ways that might indicate a qualitative difference in effectiveness. Running a local model with a bit of light DPO training feels more intent-aligned compared to its non-DPO brethren in a pretty meaningful way. So I’d probably be considering also how DPO scales, at this point. If there is a big theoretical difference, it’s likely in not training a separate model, and removing whatever friction or loss of potential performance that causes.
What does this mean? I don’t know as much about CNNs as you—are you saying that their architecture allows for the reuse of internal representations, such that redundancy should never arise? Or are you saying that the goal square shouldn’t be representable by this architecture?
There is a reference class judgement in this. If I have a theory of good moves in Go (and absently dabble in chess a little bit), while you have a great theory of chess, looking at some move in chess shouldn’t lead to a Bayes-update against ability of my theory to reason about Go. The scope of classical alignment worries is typically about the post-AGI situation. If it manages to say something uninformed about the pre-AGI situation, that’s something out of its natural scope, and shouldn’t be meaningful evidence either way.
I think the correct way of defeating classical alignment worries (about the post-AGI situation) is on priors, looking at the arguments themselves, not on observations where the theory doesn’t expect to have clear or good predictions (and empirically doesn’t). If the arguments appear weak, there is no recourse without observation of the post-AGI world, it remains weak at least until then. Even if it happened to have made good predictions about the current situation, it shouldn’t count in its favor.
He didn’t say “anything can happen before AI explodes”. He said “I expect AI to look pretty great until it explodes.” And he didn’t say that his model about AGI safety generated that prediction; maybe his model about AGI safety generates some long-run predictions and then he’s using other models to make the “look pretty great” prediction.
Thinking about this:
This is why I hate a lot of mathematical universe hypothesis/simulation hypothesis discourse, since they both predict anything, which is not a strength for these theories, even though I do think they’re true, they’re just too trivial as theories to work.
Without commenting on how often people do or don’t bet, I think overall betting is great and I’d love to see more it!
I’m also excited how much of it I’ve seen since Manifold started gaining traction. So I’d like to give a shout out to LessWrong users who are active on Manifold, in particular on AI questions. Some I’ve seen are:
Rob Bensinger
Jonas Vollmer
Arthur Conmy
Jaime Sevilla Molina
Isaac King
Eliezer Yudkowsky
Noa Nabeshima
Mikhail Samin
Daniel Filan
Daniel Kokotajlo
Zvi
Eli Tyre
Ben Pace
Allison Duettmann
Matthew Barnett
Peter Barnett
Joe Brenton
Austin Chen
lc
Good job everyone for betting on your beliefs :)
There are definitely more folks than this: feel free to mention more folks in the comments who you want to give kudos to (though please don’t dox anyone who’s name on either platforms is pseudonymous and doesn’t match the other).
Others include:
Zack M. Davis
Ben Weinstein-Raun
1a3orn
Tetraspace
Jeremy Gillen
Thomas Kwa
Loppukilpailija
Niplav
Adele Dewey-Lopez
Nate Soares
Aella
Ozzie Gooen
Oliver Habryka
Here’s a couple of mine:
Yeah I mean the answer is, just make prediction markets and bet on them. I think we are getting a lot better at that.
(Also I’m a lesswrong user who makes a lot of prediction markets about AI)
In particular:
A real money version of Yud and Paul’s bet https://polymarket.com/event/will-an-ai-win-the-5-million-ai-math-olympiad-prize-before-august?tid=1702634083181
An attempt at clustering the best AI progress markets into a dashboard https://manifold.markets/dashboard/ai-progress
Yeah, I’m not really happy with the state of discourse on this matter either.
As a proponent of an AI-risk model that does this, I acknowledge that this is an issue, and I indeed feel pretty defensive on this point. Mainly because, as @habryka pointed out and as I’d outlined before, I think there are legitimate reasons to expect no blatant evidence until it’s too late, and indeed, that’s the whole reason AI risk is such a problem. As was repeatedly stated.
So all these moves to demand immediate well-operationalized bets read a bit like tactical social attacks that are being unintentionally launched by people who ought to know better, which are effectively exploiting the territory-level insidious nature of the problem to undermine attempts to combat it, by painting the people pointing out the problem as blind believers. Like challenges that you’re set up to lose if you take them on, but which make you look bad if you turn them down.
And the above, of course, may read exactly like a defense attempt a particularly self-aware blind believer might construct. Which doesn’t inspire much self-doubt in me[1], but it does make me feel like I’m– no, not like I’m sailing against the winds of counterevidence – like I’m playing the social game on the side that’s poised to lose it in the long run, so I should switch up to the winning side to maximize my status, even if its position is wrong.
I’m somewhat hopeful about navigating to some concrete empirical or mathematical evidence within the next couple years. But in the meanwhile, yeah, discussing the matter just makes me feel weary and tired.
(Edit, because I’m concerned I’d been too subtle there: I am not accusing anyone, and especially not @TurnTrout, of deliberately employing social tactics to undermine their opponents rather than cooperatively seeking the truth. I’m only saying that the (usually extremely reasonable) requests for well-operationalized bets effectively have this result in this particular case.
Neither am I suggesting that the position I’m defending should be immune to criticism. Empirical evidence easily tied to well-operationalized bets is usually an excellent way to resolve disagreements and establish truth. But it’s not the only one, and it just so happens that this specific position can’t field many good predictions in this field.)
“But of course it won’t,” you might think – which, fair enough. But what’s your policy for handling problems that really are this insidious?
Your post defending the least forgiving take on alignment basically relies on a sharp/binary property of AGI, and IMO a pretty large crux is that either this property probably doesn’t exist, or if it does exist, it is not universal, and IMO I think tends to be overused.
To be clear, I’m increasingly agreeing with a weak version of the hypothesis, and I also think you are somewhat correct, but IMO I dont think your stronger hypothesis is correct, and I think that the lesson of AI progress is that it’s less sharp the more tasks you want, and the more general intelligence you want, which is in opposition to your hypothesis on AI progress being sharp.
I actually kinda agree with you here, but unfortunately, this is very, very important, since your allies are trying to gain real-life political power over AI, and given this is extremely impactful, it is basically required for us to discuss it.
There’s a bit of “one man’s modus ponens is another’s modus tollens” going on. I assume that when you look at a new AI model, and see how it’s not doing instrumental convergence/value reflection/whatever, you interpret it as evidence against “canonical” alignment views. I interpret it as evidence that it’s not AGI yet; or sometimes, even evidence that this whole line of research isn’t AGI-complete.
E. g., I’ve updated all the way on this in the case of LLMs. I think you can scale them a thousandfold, and it won’t give you AGI. I’m mostly in favour of doing that, too, or at least fully realizing the potential of the products already developed. Probably same for Gemini and Q*. Cool tech. (Well, there are totalitarianism concerns, I suppose.)
I also basically agree with all the takes in the recent “AI is easy to control” post. But what I take from it isn’t “AI is safe”, it’s “the current training methods aren’t gonna give you AGI”. Because if you put a human – the only known type of entity with the kinds of cognitive capabilities we’re worrying about – into a situation isomorphic to a DL AI’s, the human would exhibit all the issues we’re worrying about.
Like, just because something has a label of “AI” and is technically an AI doesn’t mean studying it can give you lessons about “AGI”, the scary lightcone-eating thing all the fuss is about, yeah? Any more than studying GOFAI FPS bots is going to teach you lessons about how LLMs work?
And that the Deep Learning paradigm can probably scale to AGI doesn’t mean that studying the intermediary artefacts it’s currently producing can teach us much about the AGI it’ll eventually spit out. Any more than studying a MNIST-classifier CNN can teach you much about LLMs; any more than studying squirrel neurology can teach you much about winning moral-philosophy debates.
That’s basically where I’m at. LLMs and such stuff is just in the entirely wrong reference class for studying “generally intelligent”/scary systems.
No, but my point here is that once we increase the complexity of the domain, and require more tasks to be done, things start to smooth over, and we don’t have nearly as sharp.
I suspect a big part of that is the effects of Amdahl’s law kicking in combined with Baumol’s cost disease and power law scaling, which means you are always bottlenecked on the least automatable and doable tasks, so improvements in one area like Go don’t exactly matter as much as you think.
I’d say the main lesson of AI progress, one that might even have been formulatable in the 1970s-1980s days, is that compute and data were the biggest factors, by a wide margin, and these grow smoothly. Only now are algorithms starting to play a role, and even then, it’s only because of the fact that transformers turn out to be fairly terrible at generalizing or doing stuff, which is related to your claim about LLMs being not real AGI, but I think this effect is weaker than you think, and I’m sympathetic to the continuous view as well. There probably will be some discontinuities, but IMO LWers have fairly drastically overstated how discontinuous progress was, especially if we realize that a lot of the outliers were likely simpler than the real world (Though Go comes close to it, at least for it’s domain, the problem is that the domain is far too small to matter.)
I think this roughly tracks how we updated, though there was a brief phase where I became more pessimistic as I learned that LLMs probably wasn’t going to scale to AGI, and broke a few of my alignment plans, but I found other reasons to be more optimistic that didn’t depend on LLMs nearly as much.
My worry is that while I think it’s fine enough to update towards “it’s not going to have any impact on anything, and that’s the reason it’s safe.” I worry that this is basically defining away the possibility of safety, and thus making the model useless:
I think a potential crux here is whether to expect some continuity at all, or whether there is reason to expect a discontinuous step change for AI, which is captured in this post: https://www.lesswrong.com/posts/cHJxSJ4jBmBRGtbaE/continuity-assumptions
I basically disagree entirely with that, and I’m extremely surprised you claimed that. If we grant that we get the same circumstances to control humans as we can do for DL AIs, then alignment becomes basically trivial in my view, since human control research would have way better ability to study humans, and in particular there is no IRB/FDA or regulation to control you, which would be huge changes to how science basically works today. It may take a lot of brute force work, but I think it basically becomes trivial to align human beings if humans could be put into a situation isomorphic to DL AIs.
As far as producing algorithms that are able to, once trained on a vast dataset of [A, B] samples, interpolate a valid completion B for an arbitrary prompt sampled from the distribution of A? Yes, for sure.
As far as producing something that can genuinely generalize off-distribution, strike way outside the boundaries of interpolation? Jury’s still out.
Like, I think my update on all the LLM stuff is “boy, who knew interpolation can get you this far?”. The concept-space sure turned out to have a lot of intricate structure that could be exploited via pure brute force.
Oh, I didn’t mean “if we could hook up a flesh-and-blood human (or a human upload) to the same sort of cognition-shaping setup as we subject our AIs to”. I meant “if the forward-pass of an LLM secretly simulated a human tasked with figuring out what token to output next”, but without the ML researchers being aware that it’s what’s going on, and with them still interacting with the thing as with a token-predictor. It’s a more literal interpretation of the thing sometimes called an “inner homunculus”.
I’m well aware that the LLM training procedure is never going to result in that. I’m just saying that if it did, and if the inner homunculus became smart enough, that’d cause all the deceptive-alignment/inner-misalignment/wrapper-mind issues. And that if you’re not modeling the AI as being/having a homunculus, you’re not thinking about an AGI, so it’s no wonder the canonical AI-risk arguments fail for that system and it’s no wonder it’s basically safe.
I’d say this still applies even to non-LLM architectures like RL, which is the important part, but Jacob Cannell and 1a3orn will have to clarify.
I agree, but with a caveat, in that I think we do have enough evidence to rule out extreme importance on algorithms, ala Eliezer, and compute is not negligible. Epoch estimates a 50⁄50 split between compute and algorithmic progress being important. Algorithmic progress will likely matter IMO, just not nearly as much as some LWers think it is.
I definitely updated something in this direction, which is important, but I now think the AI optimist arguments are general enough to not rely on LLMs, and sometimes not even relying on a model of what future AI will look like beyond the fact that capabilities will grow, and people expect to profit from it.
Not automatically, and there are potential paths to AGI like Steven Byrnes’s path to Brain-like AGI that either outright avoid deceptive alignment altogether or make it far easier to solve (the short answer is that Steven Byrnes suspects there’s a simple generator of value, so simple that it’s dozens of lines long and if that’s the case, then the corrigible alignment/value learning agent’s simplicity gap is either 0, negative, or a very small positive gap, so small that very little data is required to pick out the honest value learning agent over the deceptive aligned agent, and we have a lot of data on human values, so this is likely to be pretty easy.)
I think a crux is that I think that AIs will basically always have much more white-boxness to them than any human mind, and I think that a lot of future paradigms of AI, including the ones that scale to superintelligence, that the AI control research is easier point to still mostly be true, especially since I think AI control is fundamentally very profitable and AIs have no legal rights/IRB boards to slow down control research.
Mm, I think the “algorithms vs. compute” distinction here doesn’t quite cleave reality at its joints. Much as I talked about interpolation before, it’s a pretty abstract kind of interpolation: LLMs don’t literally memorize the data points, their interpolation relies on compact generative algorithms they learn (but which, I argue, are basically still bounded by the variance in the data points they’ve been shown). The problem of machine learning, then, is in finding some architecture + training-loop setup that would, over the course of training, move the ML model towards implementing some high-performance cognitive algorithms.
It’s dramatically easier than hard-coding the algorithms by hand, yes, and the learning algorithms we do code are very simple. But you still need to figure out in which direction to “push” your model first. (Pretty sure if you threw 2023 levels of compute at a Very Deep fully-connected NN, it won’t match a modern LLM’s performance, won’t even come close.)
So algorithms do matter. It’s just our way of picking the right algorithms consists of figuring out the right search procedure for these algorithms, then throwing as much compute as we can at it.
So that’s where, I would argue, the sharp left turn would lie. Not in-training, when a model’s loss suddenly drops as it “groks” general intelligence. (Although that too might happen.) It would happen when the distributed optimization process of ML researchers tinkering with training loops stumbles upon a training setup that actually pushes the ML model in the direction of the basin of general intelligence. And then that model, once scaled up enough, would suddenly generalize far off-distribution. (Indeed, that’s basically what happened in the human case: the distributed optimization process of evolution searched over training architectures, and eventually stumbled upon one that was able to bootstrap itself into taking off. The “main” sharp left turn happens during the architecture search, not during the training.)
And I’m reasonably sure we’re in an agency overhang, meaning that the newborn GI would pass human intelligence in an eye-blink. (And if it won’t, it’ll likely stall at incredibly unimpressive sub-human levels, so the ML researchers will keep tinkering with the training setups until finding one that does send it over the edge. And there’s no reason whatsoever to expect it to stall again at the human level, instead of way overshooting it.)
Which human’s values? IMO, “the AI will fall into the basin of human values” is kind of a weird reassurance, given the sheer diversity of human values – diversity that very much includes xenophobia, genocide, and petty vengeance scaled up to geopolitical scales. And stuff like RLHF designed to fit the aesthetics of modern corporations doesn’t result in deeply thoughtful cosmopolitan philosophers – it results in sycophants concerned with PR as much as with human lives, and sometimes (presumably when not properly adapted to a new model’s scale) in high-strung yanderes.
Let’s grant the premise that the AGI’s values will be restricted to the human range (which I don’t really buy). If the quality of the sample within the human range that we pick will be as good as what GPT-4/Sydney’s masks appeared to be? Yeah, I don’t expect humans to stick around for a while after.
Actually I think the evidence is fairly conclusive that the human brain is a standard primate brain with the only change being nearly a few compute scale dials increased (the number of distinct gene changes is tiny—something like 12 from what I recall). There is really nothing special about the human brain other than 1.) 3x larger than expected size, and 2.) extended neotany (longer training cycle). Neuroscientists have looked extensively for other ‘secret sauce’ and we now have some confidence in a null result: no secret sauce, just much more training compute.
Yes, but: whales and elephants have brains several times the size of humans, and they’re yet to build an industrial civilization. I agree that hitting upon the right architecture isn’t sufficient, you also need to scale it up – but scale alone doesn’t suffice either. You need a combination of scale, and an architecture + training process that would actually transmute the greater scale into more powerful cognitive algorithms.
Evolution stumbled upon the human/primate template brain. One of the forks of that template somehow “took off” in the sense of starting to furiously select for larger brain size. Then, once a certain compute threshold was reached, it took a sharp left turn and started a civilization.
The ML-paradigm analogue would, likewise, involve researchers stumbling upon an architecture that works well at some small scales and has good returns on compute. They’ll then scale it up as far as it’d go, as they’re wont to. The result of that training run would spit out an AGI, not a mere bundle of sophisticated heuristics.
And we have no guarantees that the practical capabilities of that AGI would be human-level, as opposed to vastly superhuman.
(Or vastly subhuman. But if the maximum-scale training run produces a vastly subhuman AGI, the researchers would presumably go back to the drawing board, and tinker with the architectures until they selected for algorithms with better returns on intelligence per FLOPS. There’s likewise no guarantees that this higher-level selection process would somehow result in an AGI of around human level, rather than vastly overshooting it the first time they properly scale it up.)
Size/capacity isn’t all, but In terms of the capacity which actually matters (synaptic count, and upper cortical neuron count) - from what I recall elephants are at great ape cortical capacity, not human capacity. A few specific species of whales may be at or above human cortical neuron capacity but synaptic density was still somewhat unresolved last I looked.
Human language/culture is more the cause of our brain expansion, not just the consequence. The human brain is impressive because of its relative size and oversized cost to the human body. Elephants/whales are huge and their brains are much smaller and cheaper comparatively. Our brains grew 3x too large/expensive because it was valuable to do so. Evolution didn’t suddenly discover some new brain architecture or trick (it already had that long ago). Instead there were a number of simultaneous whole body coadapations required for larger brains and linguistic technoculture to take off: opposable thumbs, expressive vocal cords, externalized fermentation (gut is as energetically expensive as brain tissue—something had to go), and yes larger brains, etc.
Language enabled a metasystems transition similar to the origin of multicelluar life. Tribes formed as new organisms by linking brains through language/culture. This is not entirely unprecedented—insects are also social organisms of course, but their tiny brains aren’t large enough for interesting world models. The resulting new human social organisms had inter generational memory that grew nearly unbounded with time and creative search capacity that scaled with tribe size.
You can separate intelligence into world model knowledge (crystal intelligence) and search/planning/creativity (fluid intelligence). Humans are absolutely not special in our fluid intelligence—it is just what you’d expect for a large primate brain. Humans raised completely without language are not especially more intelligent than animals. All of our intellectual super powers are cultural. Just as each cell can store the DNA knowledge of the entire organism, each human mind ‘cell’ can store a compressed version of much of human knowledge and gains the benefits thereof.
The cultural metasystems transition which is solely completely responsible for our intellectual capability is a one time qualitative shift that will never reoccur. AI will not undergo the same transition, that isn’t how these work. The main advantage of digital minds is just speed, and to a lesser extent, copying.
We’ve basically known how to create AGI for at least a decade. AIXI outlines the 3 main components: a predictive world model, a planning engine, and a critic. The brain also clearly has these 3 main components, and even somewhat cleanly separated into modules—that’s been clear for a while.
Transformers LLMs are pretty much exactly the type of generic minimal ULM arch I was pointing at in that post (I obviously couldn’t predict the name but). On a compute scaling basis GPT4 training at 1e25 flops uses perhaps a bit more than human brain training, and its clearly not quite AGI—but mainly because it’s mostly just a world model with a bit of critic: planning is still missing. But its capabilities are reasonably impressive given that the architecture is more constrained than a hypothetical more directly brain equivalent fast-weight RNN of similar size.
Anyway I don’t quite agree with the characterization that these models are just ” interpolating valid completions of any arbitrary prompt sampled from the distribution”. Human intelligence also varies widely on a spectrum with tradeoffs between memorization and creativity. Current LLMs mostly aren’t as creative as the more creative humans and are more impressive in breadth of knowledge, but eh part of that could be simply that they currently completely lack the component essential for creativity? That they accomplish so much without planning/search is impressive.
Interestingly that is closer to my position and I thought that Byrnes thought the generator of value was somewhat more complex, although are views are admittedly fairly similar in general.
This paragraph doesn’t seem like an honest summary to me. Eliezer’s position in the dialogue, as I understood it, was:
The journey is a lot harder to predict than the destination. Cf. “it’s easier to use physics arguments to predict that humans will one day send a probe to the Moon, than it is to predict when this will happen or what the specific capabilities of rockets five years from now will be”. Eliezer isn’t claiming to have secret insights about the detailed year-to-year or month-to-month changes in the field; if he thought that, he’d have been making those near-term tech predictions already back in 2010, 2015, or 2020 to show that he has this skill.
From Eliezer’s perspective, Paul is claiming to know a lot about the future trajectory of AI, and not just about the endpoints: Paul thinks progress will be relatively smooth and continuous, and thinks it will get increasingly smooth and continuous as time passes and more resources flow into the field. Eliezer, by contrast, expects the field to get choppier as time passes and we get closer to ASI.
A way to bet on this, which Eliezer repeatedly proposed but wasn’t able to get Paul to do very much, would be for Paul to list out a bunch of concrete predictions that Paul sees as “yep, this is what smooth and continuous progress looks like”. Then, even though Eliezer doesn’t necessarily have a concrete “nope, the future will go like X instead of Y” prediction, he’d be willing to bet against a portfolio of Paul-predictions: when you expect the future to be more unpredictable, you’re willing to at least weakly bet against any sufficiently ambitious pool of concrete predictions.
(Also, if Paul generated a ton of predictions like that, an occasional prediction might indeed make Eliezer go “oh wait, I do have a strong prediction on that question in particular; I didn’t realize this was one of our points of disagreement”. I don’t think this is where most of the action is, but it’s at least a nice side-effect of the person-who-thinks-this-tech-is-way-more-predictable spelling out predictions.)
Eliezer was also more interested in trying to reach mutual understanding of the views on offer, as opposed to bet let’s bet on things immediately never mind the world-views. But insofar as Paul really wanted to have the bets conversation instead, Eliezer sunk an awful lot of time into trying to find operationalizations Paul and he could bet on, over many hours of conversation.
If your end-point take-away from that (even after actual bets were in fact made, and tons of different high-level predictions were sketched out) is “wow how dare Eliezer be so unwilling to make bets on anything”, then I feel a lot less hope that world-models like Eliezer’s (“long-term outcome is more predictable than the detailed year-by-year tech pathway”) are going to be given a remotely fair hearing.
(Also, in fairness to Paul, I’d say that he spent a bunch of time working with Eliezer to try to understand the basic methodologies and foundations for their perspectives on the world. I think both Eliezer and Paul did an admirable job going back and forth between the thing Paul wanted to focus on and the thing Eliezer wanted to focus on, letting us look at a bunch of different parts of the elephant. And I don’t think it was unhelpful for Paul to try to identify operationalizations and bets, as part of the larger discussion; I just disagree with TurnTrout’s summary of what happened.)
Thanks for you feedback. I certainly appreciate your articles and I share many of your views. Reading what you had to say, along with Quentin, Jacob Cannell, Nora was a very welcome alternative take that expanded my thinking and changed my mind. I have changed my mind a lot over the last year, from thinking AI was a long way off and Yud/Bostrom were basically right to seeing that its a lot closer and theories without data are almost always wrong in may ways—e.g. SUSY was expected to be true for decades by most of the world’s smartest physicists. Many alignment ideas before GPT3.5 are either sufficiently wrong or irrelevant to do more harm than good.
Especially I think the over dependence on analogy, evolution. Sure when we had nothing to go on it was a start, but when data comes in, ideas based on analogies should be gone pretty fast if they disagree with hard data.
(Some background—I read the site for over 10 years have followed AI for my entire career, have an understanding of Maths, Psychology, and have built and deployed a very small NN model commercially. Also as an aside I remember distinctly being surprised that Yud was skeptical of NN/DL in the earlier days when I considered it obviously where AI progress would come from—I don’t have references because I didn’t think that would be disputed afterwards)
I am not sure what the silent majority belief on this site is (by people not Karma)? Is Yud’s worldview basically right or wrong?
analogies based on evolution should be applied at the evolutionary scale: between competing organizations.
Well they definitely can be applied there—though perhaps its a stage further than analogy and direct application of theory? Then of course data can agree/disagree.
gradient descent is not evolution and does not behave like evolution. it may still have problems one can imagine evolution having, but you can’t assume facts about evolution generalize—it’s in fact quite different.
I really don’t want to go down a rabbit hole here, so probably won’t engage in further discussion, but I just want to chime in here and say that I’m pretty sure lots of the world’s smartest physicists (not sure what fraction) still expect the fundamental laws of physics in our universe to have (broken) supersymmetry, and I would go further and say that they have numerous very good reasons to expect that, like gauge coupling unification etc. Same as ever. The fact that supersymmetric partners were not found at LHC is nonzero evidence against supersymmetric partners existing, but it’s not strong evidence against them existing, because LHC was very very far from searching the whole space of possibilities. Also, we pretty much know for a fact that the universe contains at least one other yet-to-be-discovered elementary particle beyond the 17 (or whatever, depends on how you count) particles in the Standard Model. So I think it’s extremely premature to imply that the prediction of yet-to-be-discovered supersymmetric partner particles has been ruled out in our universe and haha look at those overconfident theoretical physicists. (A number of specific SUSY-involving theories have been ruled out, but I think the smart physicists knew all along that those were just plausible hypotheses worth checking, not confident theoretical predictions.)
OK you are answering at a level more detailed than I raised and seem to assume I didn’t consider such things. My reason and IMO the expected reading of “SUSY has failed” is not that such particles have been ruled out as I know they havn’t, but that its theoretical benefits are severely weakened or entirely ruled out according to recent data. My reference to SUSY was specifically regarding its opportunity to solve the Hierarchy Problem. This is the common understanding of one of the reasons it was proposed.
I stand by my claim that many/most of the top physicists expected for >1 decade that it would help solve such a problem. I disagree with the claim:
“but I think the smart physicists knew all along that those were just plausible hypotheses worth checking, ” Smart physicists thought SUSY would solve the hierarchy problem.
----
Common knowledge, from GPT4:
“can SUSY still solve the Hierarchy problem with respect to recent results”
Hierarchy Problem: SUSY has been considered a leading solution to the hierarchy problem because it naturally cancels out the large quantum corrections that would drive the Higgs boson mass to a very high value. However, the non-observation of supersymmetric particles at expected energy levels has led some physicists to question whether SUSY can solve the hierarchy problem in its simplest forms.
Fine-Tuning: The absence of low-energy supersymmetry implies a need for fine-tuning in the theory, which contradicts one of the primary motivations for SUSY as a solution to the hierarchy problem. This has led to exploration of more complex SUSY models, such as those with split or high-scale supersymmetry, where SUSY particles exist at much higher energy scales.
----
IMO ever more complex models rapidly become like epi-cycles.
I think this will depend strongly on where you draw the line on “basically”. I think the majority probably thinks:
AI is likely to be a really big deal
Existential risk from AI is at least substantial (e.g. >5%)
AI takeoff is reasonably likely to happen quite quickly in wall clock time if this isn’t actively prevented (e.g. AI will cause there to be <10 years from a 20% annualized GDP growth rate to a 100x annualized growth rate)
The power of full technological maturity is extremely high (e.g. nanotech, highly efficient computing, etc.)
But, I expect that the majority of people don’t think:
Inside view, existential risk is >95%
A century of dedicated research on alignment (targeted as well as society would realistically do) is insufficient to get risk <15%.
Which I think are both beliefs Yudkowsky has.
For me -
Yes to AI being a big deal and extremely powerful ( yes I doubt anyone would be here otherwise)
Yes—Don’t think anyone can reasonably claim its <5% but then so is not having AI if x-risk is defined to be humanity missing practically all of its Cosmic endowment.
Maybe—Even with slow takeoff, and hardware constrained you get much greater GDP, though I don’t agree with 100x (for the critical period that is, 100x could happen later). E.g. car factories are made to produce robots, we get 1-10 billion more minds and bodies per year, but not quite 100X. ~10x per year is enough to be extremely disruptive and x-risk anyway.
---
(1)
Yes I don’t think x-risk is >95% - say 20% as a very rough guess that humanity misses all its Cosmic endowment. I think AI x-risk needs to be put in this context—say you ask someone
“What’s the chance that humanity becomes successfully interstellar?”
If they say 50⁄50 then being OK with any AI x-risk less than 50% is quite defensible if getting AI right means that its practically certain you get your cosmic endowment etc.
---
(2)
I do think its defensible that a century of dedicated research on alignment doesn’t get risk <15% but because alignment research is only useful a little bit in advance of capabilities—say we had a 100 year pause, then I wouldn’t have confidence in our alignment plan at the end of it.
Anyway regarding x-risk I don’t think there is a completely safe path. Too fast with AI and obvious risk, too slow and there is also other obvious risks. Our current situation is likely unstable. For example the famous quote
“If you want a picture of the future, imagine a boot stamping on a human face— forever.”
I believe that is now possible with current tech, where it was not say for Soviet Russia. So we may be in the situation where societies can go 1984 totalitarian bad, but not come back because our tech coordination skills are sufficient to stop centralized empires from collapsing. LLM of course make censorship even easier. (I am sure there are other ways our current tech could destroy most societies also)
If that’s the case, a long pause could result in all power being in such societies which when the pause ended would be very likely to screw up alignment.
That makes me unsure what regulation to advocate for, though I am in favor of slowing down hardware AI progress but fully exploring the capabilities of our current HW.
Most importantly I think we should hugely speed up Neuralink type devices and brain uploading. I would identify much more with an uploaded human that was then carefully, appropriately upgraded to superintelligence than an alternative path where a pure AI superintelligence was made.
We have to accept that we live in critical times and just slowing things down is not necessarily the safest option.
Yup, and this is why I’m more excited to supervise MATS mentees who haven’t read The Sequences.
Hi there.
> (High confidence) I feel like the project of thinking more clearly has largely fallen by the wayside, and that we never did that great of a job at it anyways.
I’m new to this community. I’ve skimmed quite a few articles, and this sentence resonates with me for several reasons.
1) It’s very difficult in general to find websites like LessWrong these days. And among the few that exist, I’ve found that the intellectuals on them are so incredibly doubtful of their own intellect. This creates a sort of Ouroboros phenomenon where intellects just eat themselves into oblivion. Like, maybe I’m wrong but this site’s popularity seems to be going down?
2) At least from what I’ve noticed, when I compare articles in the last 2 months, to ones from about a decade ago, there is an alarming truth in your sentence. A decade ago, there were questions left in the articles for commenters to answer, there was a willingness to change one’s mind and to add/enhance ideas in a good faith manner. Now, it seems that many have confused this website for LinkedIn, posting their own personal paper trails (which is largely in a tone that isn’t unique anyways.)
It’s really unfortunate, since I was excited upon being greeted with much older articles. And then realising “Oh… that was from… holy! 10 years ago!?” To then be disappointed by our articles from today.
I think that might be a result of how the topic is, well, just really fucking grim. I think part of what allows discussion of it and thought about it for a lot of people (including myself) is a certain amount of detachment. “AI doomers” get often accused of being LARPers or not taking their own ideas seriously because they don’t act like people who believe the world is ending in 10 years, but I’d flip that around—a person who believes the world is ending in 10 years probably acts absolutely insane, and so people to keep their maximum possible sanity establish a sort of barrier and discuss these things as they would a game or a really interesting scientific question. But actually placing a bet on it? Shorting your own future on the premise that you won’t have a future? That breaks the barrier, and it becomes just really uncomfortable. I know I’d still rather live as if I was dead wrong no matter how confident I am in being theoretically right. I wonder in fact whether this feeling was shared by e.g. game theorists working on nuclear strategy.
I think there are some great points in this comment but I think it’s overly negative about the LessWrong community. Sure, maybe there is a vocal and influential minority of individuals who are not receptive to or appreciative of your work and related work. But I think a better measure of the overall community’s culture than opinions or personal interactions is upvotes and downvotes which are much more frequent and cheap actions and therefore more representative. For example, your posts such as Reward is not the optimization target have received hundreds of upvotes, so apparently they are positively received.
LessWrong these days is huge with probably over 100,000 monthly readers so I think it’s challenging to summarize its culture in any particularly way (e.g. probably most users on LessWrong live outside the bay area and maybe even outside the US). I personally find that LessWrong as a whole is fairly meritocratic and not that dogmatic, and that a wide variety of views are supported provided that they are sufficiently well-argued.
In addition to LessWrong, I use some other related sites such as Twitter, Reddit, and Hacker News and although there may be problems with the discourse on LessWrong, I think it’s generally significantly worse on these other sites. Even today, I’m sure you can find people saying things on Twitter about how AIs can’t have goals or that wanting paperclips is stupid. These kinds of comments wouldn’t be tolerated on LessWrong because they’re ignorant and a waste of time. Human nature can be prone to ignorance, rigidness of opinions and so on but I think the LessWrong walled garden has been able to counteract these negative tendencies better than most other sites.
No disagreement here that this place does this. I also think we should attempt to change many of these things. However, I don’t expect the lesswrong team to do anything sufficiently drastic to counter the hero-worship. Perhaps they could consider hiding usernames by default, hiding vote counts until things have been around for some period of time, or etc.
Hmm, my sense is Eliezer very rarely comments, and the people who do comment a lot don’t have a ton of hero worship going on (like maybe Wentworth?). So I don’t super believe that hiding usernames would do much about this.
Agree, and my guess is that the hero worship, to the extent it happens, is caused by something like
for Eliezer: people finding the rationality community and observing that they were less crazy than most other communities about various things, and Eliezer was a very prolific and persuasive writer
for Paul: Paucity of empirical alignment work before 2021 meant that Paul was one of the few people with formal CS experience and good alignment ideas, and had good name recognition due to posting on LW
Both of these seem to be solving themselves.
I think one of the issues with Eliezer is that he sees himself as a hero, and it comes through both explicitly and in vibes in the writing, and Eliezer is also a persuasive writer.
What is wrong with seeing oneself as a hero?
Nothing wrong with it, in fact I recommend it. But seeing oneself as a hero and persuading others of it will indeed be one of the main issues leading to hero worship.
how would you operationalize a bet on this? I’d take “yes” on “will hiding usernames by default decrease hero worship on lesswrong” on manifold, if you want to do an AB test or something.
Hacker News shows you the vote counts on your comments privately. I think that’s a significant improvement. It nudges people towards thinking for themselves rather than trying to figure out where the herd is going. At least, I think it does, because HN seems to have remarkable viewpoint diversity compared with other forums.
I think it’s fine for there to be a status hierarchy surrounding “good alignment research”. It’s obviously bad if that becomes mismatched with reality, as it almost certainly is to some degree, but I think people getting prestige for making useful progress is essentially what happens for it to be done at all.
If we aren’t good at assessing alignment research, there’s the risk that people substitute the goal of “doing good alignment research” with “doing research that’s recognized as good alignment research”. This could lead to a feedback loop where a particular notion of “good research” gets entrenched: Research is considered good if high status researchers think it’s good; the way to become a high status researcher is to do research which is considered good by the current definition, and have beliefs that conform with those of high status researchers.
A number of TurnTrout’s points were related to this (emphasis mine):
I’d like to see more competitions related to alignment research. I think it would help keep assessors honest if they were e.g. looking at 2 anonymized alignment proposals, trying to compare them on a point-by-point basis, figuring out which proposal has a better story for each possible safety problem. If competition winners subsequently become high status, that could bring more honesty to the entire ecosystem. Teach people to focus on merit rather than politics.
Would you mind sharing some specifiexamples? (Not of people of but of beliefs)