“the thing where it keeps being literally him doing this stuff is quite a bad sign”
I’m a bit confused by this part. Some thoughts on why it seems odd for him (or others) to express that sentiment...
1. I parse the original as, “a collection of EY’s thoughts on why safe AI is hard”. It’s EY’s thoughts, why would someone else (other than @robbensinger) write a collection of EY’s thoughts?
(And if we generalize to asking why no-one else would write about why safe AI is hard, then what about Superintelligence, or the AI stuff in cold-takes, or …?)
2. Was there anything new in this doc? It’s prob useful to collect all in one place, but we don’t ask, “why did no one else write this” for every bit of useful writing out there, right?
Why was it so overwhelmingly important that someone write this summary at this time, that we’re at all scratching our heads about why no one else did it?
My shoulder Eliezer (who I agree with on alignment, and who speaks more bluntly and with less hedging than I normally would) says:
The list is true, to the best of my knowledge, and the details actually matter.
Many civilizations try to make a canonical list like this in 1980 and end up dying where they would have lived just because they left off one item, or under-weighted the importance of the last three sentences of another item, or included ten distracting less-important items.
There are probably not many civilizations that wait until 2022 to make this list, and yet survive.
It’s true that many of the points in the list have been made before. But it’s very doomy that they were made by me.
Nearly all of the field’s active alignment research is predicated on a false assumption that’s contradicted by one of the items in sections A or B. If the field had recognized everything in A and B sooner, we could have put our recent years of effort into work that might actually help on the mainline, as opposed to work that just hopes a core difficulty won’t manifest and has no Plan B for what to do when reality says “no, we’re on the mainline”.
So the answer to ‘Why would someone else write EY’s thoughts?’ is ‘It has nothing to do with an individual’s thoughts; it’s about civilizations needing a very solid and detailed understanding of what’s true on these fronts, or they die’.
Re “(And if we generalize to asking why no-one else would write about why safe AI is hard, then what about Superintelligence, or the AI stuff in cold-takes, or …?)”:
The point is not ‘humanity needs to write a convincing-sounding essay for the thesis Safe AI Is Hard, so we can convince people’. The point is ‘humanity needs to actually have a full and detailed understanding of the problem so we can do the engineering work of solving it’.
If it helps, imagine that humanity invents AGI tomorrow and has to actually go align it now. In that situation, you need to actually be able to do all the requisite work, not just be able to write essays that would make a debate judge go ‘ah yes, well argued.’
When you imagine having water cooler arguments about the importance of AI alignment work, then sure, it’s no big deal if you got a few of the details wrong.
When you imagine actually trying to build aligned AGI the day after tomorrow, I think it comes much more into relief why it matters to get those details right, when the “details” are as core and general as this.
I think that this is a really good exercise that more people should try. Imagine that you’re running a project yourself that’s developing AGI first, in real life. Imagine that you are personally responsible for figuring out how to make the thing go well. Yes, maybe you’re not the perfect person for the job; that’s a sunk cost. Just think about what specific things you would actually do to make things go well, what things you’d want to do to prepare 2 years or 6 years in advance, etc.
Try to think your way into near-mode with regard to AGI development, without thereby assuming (without justification) that it must all be very normal just because it’s near. Be able to visualize it near-mode and weird/novel. If it helps, start by trying to adopt a near-mode, pragmatic, gearsy mindset toward the weirdest realistic/plausible hypothesis first, then progress to the less-weird possibilities.
I think there’s a tendency for EAs and rationalists to instead fall into one of these two mindsets with regard to AGI development, pivotal acts, etc.:
Fun Thought Experiment Mindset. On this mindset, pivotal acts, alignment, etc. are mentally categorized as a sort of game, a cute intellectual puzzle or a neat thing to chat about.
This is mostly a good mindset, IMO, because it makes it easy to freely explore ideas, attend to the logical structure of arguments, brainstorm, focus on gears, etc.
Its main defect is a lack of rigor and a more general lack of drive: because on some level you’re not taking the question seriously, you’re easily distracted by fun, cute, or elegant lines of thought, and you won’t necessarily push yourself to red-team proposals, spontaneously take into account other pragmatic facts/constraints you’re aware of from outside the current conversational locus, etc. The whole exercise sort of floats in a fantasy bubble, rather than being a thing people bring their full knowledge, mental firepower, and lucidity/rationality to bear on.
Serious Respectable Person Mindset. Alternatively, when EAs and rationalists do start taking this stuff seriously, I think they tend to sort of turn off the natural flexibility, freeness, and object-levelness of their thinking, and let their mind go to a very fearful or far-mode place. The world’s gears become a lot less salient, and “Is it OK to say/think that?” becomes a more dominant driver of thought.
Example: In Fun Thought Experiment Mindset, IME, it’s easier to think about governments in a reductionist and unsentimental way, as specific messy groups of people with specific institutional dysfunctions, psychological hang-ups, etc. In Serious Respectable Person Mindset, there’s more of a temptation to go far-mode, glom on to happy-sounding narratives and scenarios, or even just resist the push to concretely visualize the future at all—thinking instead in terms of abstract labels and normal-sounding platitudes.
The entire fact that EA and rationalism mostly managed to avert their gaze from the concept of “pivotal acts” for years, is in my opinion an example of how these two mindsets often fail.
“In the endgame, AGI will probably be pretty competitive, and if a bunch of people deploy AGI then at least one will destroy the world” is a thing I think most LWers and many longtermist EAs would have considered obvious. As a community, however, we mostly managed to just-not-think the obvious next thought, “In order to prevent the world’s destruction in this scenario, one of the first AGI groups needs to find some fast way to prevent the proliferation of AGI.”
Fun Thought Experiment Mindset, I think, encouraged this mental avoidance because it thought of AGI alignment (to some extent) as a fun game in the genre of “math puzzle” or “science fiction scenario”, not as a pragmatic, real-world dilemma we actually have to solve, taking into account all of our real-world knowledge and specific facts on the ground. The ‘rules of the game’, many people apparently felt, were to think about certain specific parts of the action chain leading up to an awesome future lightcone, rather than taking ownership of the entire problem and trying to figure out what humanity should in-real-life do, start to finish.
(What primarily makes this weird is that many alignment questions crucially hinge on ‘what task are we aligning the AGI on?’. These are not remotely orthogonal topics.)
Serious Respectable Person Mindset, I think, encouraged this mental avoidance more actively, because pivotal acts are a weird and scary-sounding idea once you leave ‘these are just fun thought experiments’ land.
What I’d like to see instead is something like Weirdness-Tolerant Project Leader Mindset, or Thought Experiments Plus Actual Rigor And Pragmatism And Drive Mindset, or something.
I think a lot of the confusion around EY’s post comes from the difference between thinking of these posts (on some level) as fun debate fodder or persuasion/outreach tools, versus attending to the fact that humanity has to actually align AGI systems if we’re going to make it out of this problem, and this is an attempt by humanity to distill where we’re currently at, so we can actually proceed to go solve alignment right now and save the world.
Imagine that this is v0 of a series of documents that need to evolve into humanity’s (/ some specific group’s) actual business plan for saving the world. The details really, really matter. Understanding the shape of the problem really matters, because we need to engineer a solution, not just ‘persuade people to care about AI risk’.
If you disagree with the OP… that’s pretty important! Share your thoughts. If you agree, that’s important to know too, so we can prioritize some disagreements over others and zero in on critical next actions. There’s a mindset here that I think is important, that isn’t about “agree with Eliezer on arbitrary topics” or “stop thinking laterally”; it’s about approaching the problem seriously, neither falling into despair nor wishful thinking, neither far-mode nor forced normality, neither impracticality nor propriety.
There are probably not many civilizations that wait until 2022 to make this list, and yet survive.
I don’t think making this list in 1980 would have been meaningful. How do you offer any sort of coherent, detailed plan for dealing with something when all you have is toy examples like Eliza?
We didn’t even have the concept of machine learning back then—everything computers did in 1980 was relatively easily understood by humans, in a very basic step-by-step way. Making a 1980s computer “safe” is a trivial task, because we hadn’t yet developed any technology that could do something “unsafe” (i.e. beyond our understanding). A computer in the 1980s couldn’t lie to you, because you could just inspect the code and memory and find out the actual reality.
What makes you think this would have been useful?
Do we have any historical examples to guide us in what this might look like?
I think most worlds that successfully navigate AGI risk have properties like:
AI results aren’t published publicly, going back to more or less the field’s origin.
The research community deliberately steers toward relatively alignable approaches to AI, which includes steering away from approaches that look like ‘giant opaque deep nets’.
This means that you need to figure out what makes an approach ‘alignable’ earlier, which suggests much more research on getting de-confused regarding alignable cognition.
Many such de-confusions will require a lot of software experimentation, but the kind of software/ML that helps you learn a lot about alignment as you work with it is itself a relatively narrow target that you likely need to steer towards deliberately, based on earlier, weaker deconfusion progress. I don’t think having DL systems on hand to play with has helped humanity learn much about alignment thus far, and by default, I don’t expect humanity to get much more clarity on this before AGI kills us.
Researchers focus on trying to predict features of future systems, and trying to get mental clarity about how to align such systems, rather than focusing on ‘align ELIZA’ just because ELIZA is the latest hot new thing. Make and test predictions, back-chain from predictions to ‘things that are useful today’, and pick actions that are aimed at steering — rather than just wandering idly from capabilities fad to capabilities fad.
(Steering will often fail. But you’ll definitely fail if you don’t even try. None of this is easy, but to date humanity hasn’t even made an attempt.)
In this counterfactual world, deductive reasoners and expert systems were only ever considered a set of toy settings for improving our intuitions, never a direct path to AGI.
(I.e., the civilization was probably never that level of confused about core questions like ‘how much of cognition looks like logical deduction?’; their version of Aristotle or Plato, or at least Descartes, focused on quantitative probabilistic reasoning. It’s an adequacy red flag that our civilization was so confused about so many things going into the 20th century.)
To me, all of this suggests a world where you talk about alignment before you start seeing crazy explosions in capabilities. I don’t know what you mean by “we didn’t even have the concept of machine learning back then”, but I flatly don’t buy that the species that landed on the Moon isn’t capable of generating a (more disjunctive version of) the OP’s semitechnical concerns pre-AlexNet.
You need the norm of ‘be able to discuss things before you have overwhelming empirical evidence’, and you need the skill of ‘be good at reasoning about such things’, in order to solve alignment at all; so it’s a no-brainer that not-wildly-incompetent civilizations at least attempt literally any of this.
“most worlds that successfully navigate AGI risk” is kind of a strange framing to me.
For one thing, it represents p(our world | success) and we care about p(success | our world). To convert between the two you of course need to multiply by p(success) / p(our world). What’s the prior distribution of worlds? This seems underspecified.
For another, using the methodology “think about whether our civilization seems more competent than the problem is hard” or “whether our civilization seems on track to solve the problem” I might have forecast nuclear annihilation (not sure about this).
The methodology seems to work when we’re relatively certain about the level of difficulty on the mainline, so if I were more sold on that I would believe this more. It would still feel kind of weird though.
I don’t think making this list in 1980 would have been meaningful. How do you offer any sort of coherent, detailed plan for dealing with something when all you have is toy examples like Eliza?
I mean, I think many of the computing pioneers ‘basically saw’ AI risk. I noted some surprise that IJ Good didn’t write the precursor to this list in 1980, and apparently Wikipedia claims there was an unpublished statement in 1998 about AI x-risk; it’d be interesting to see what it contains and how much it does or doesn’t line up with our modern conception of why the problem is hard.
The historical figures who basically saw it (George Eliot 1879: “will the creatures who are to transcend and finally supersede us be steely organisms [...] performing with infallible exactness more than everything that we have performed with a slovenly approximativeness and self-defeating inaccuracy?”; Turing 1951: “At some stage therefore we should have to expect the machines to take control”) seem to have done so in the spirit of speculating about the cosmic process. The idea of coming up with a plan to solve the problem is an additional act of audacity; that’s not really how things have ever worked so far. (People make plans about their own lives, or their own businesses; at most, a single country; no one plans world-scale evolutionary transitions.)
I’m tempted to call this a meta-ethical failure. Fatalism, universal moral realism, and just-world intuitions seem to be the underlying implicit hueristics or principals that would cause this “cosmic process” thought-blocker.
Imagine that this is v0 of a series of documents that need to evolve into humanity’s (/ some specific group’s) actual business plan for saving the world.
That’s part of what I meant to be responding to — not that this post is not useful, but that I don’t see what makes it so special compared to all the other stuff that Eliezer and others have already written.
To put it another way, I would agree that Eliezer has made (what seem to me like) world-historically-significant contributions to understanding and advocating for (against) AI risk.
So, if 2007 Eliezer was asking himself, “Why am I the only one really looking into this?”, I think that’s a very reasonable question.
But here in 2022, I just don’t see this particular post as that significant of a contribution compared to what’s already out there.
On Twitter, Eric Rogstad wrote:
Copying over my reply to Eric:
I think that this is a really good exercise that more people should try. Imagine that you’re running a project yourself that’s developing AGI first, in real life. Imagine that you are personally responsible for figuring out how to make the thing go well. Yes, maybe you’re not the perfect person for the job; that’s a sunk cost. Just think about what specific things you would actually do to make things go well, what things you’d want to do to prepare 2 years or 6 years in advance, etc.
Try to think your way into near-mode with regard to AGI development, without thereby assuming (without justification) that it must all be very normal just because it’s near. Be able to visualize it near-mode and weird/novel. If it helps, start by trying to adopt a near-mode, pragmatic, gearsy mindset toward the weirdest realistic/plausible hypothesis first, then progress to the less-weird possibilities.
I think there’s a tendency for EAs and rationalists to instead fall into one of these two mindsets with regard to AGI development, pivotal acts, etc.:
Fun Thought Experiment Mindset. On this mindset, pivotal acts, alignment, etc. are mentally categorized as a sort of game, a cute intellectual puzzle or a neat thing to chat about.
This is mostly a good mindset, IMO, because it makes it easy to freely explore ideas, attend to the logical structure of arguments, brainstorm, focus on gears, etc.
Its main defect is a lack of rigor and a more general lack of drive: because on some level you’re not taking the question seriously, you’re easily distracted by fun, cute, or elegant lines of thought, and you won’t necessarily push yourself to red-team proposals, spontaneously take into account other pragmatic facts/constraints you’re aware of from outside the current conversational locus, etc. The whole exercise sort of floats in a fantasy bubble, rather than being a thing people bring their full knowledge, mental firepower, and lucidity/rationality to bear on.
Serious Respectable Person Mindset. Alternatively, when EAs and rationalists do start taking this stuff seriously, I think they tend to sort of turn off the natural flexibility, freeness, and object-levelness of their thinking, and let their mind go to a very fearful or far-mode place. The world’s gears become a lot less salient, and “Is it OK to say/think that?” becomes a more dominant driver of thought.
Example: In Fun Thought Experiment Mindset, IME, it’s easier to think about governments in a reductionist and unsentimental way, as specific messy groups of people with specific institutional dysfunctions, psychological hang-ups, etc. In Serious Respectable Person Mindset, there’s more of a temptation to go far-mode, glom on to happy-sounding narratives and scenarios, or even just resist the push to concretely visualize the future at all—thinking instead in terms of abstract labels and normal-sounding platitudes.
The entire fact that EA and rationalism mostly managed to avert their gaze from the concept of “pivotal acts” for years, is in my opinion an example of how these two mindsets often fail.
“In the endgame, AGI will probably be pretty competitive, and if a bunch of people deploy AGI then at least one will destroy the world” is a thing I think most LWers and many longtermist EAs would have considered obvious. As a community, however, we mostly managed to just-not-think the obvious next thought, “In order to prevent the world’s destruction in this scenario, one of the first AGI groups needs to find some fast way to prevent the proliferation of AGI.”
Fun Thought Experiment Mindset, I think, encouraged this mental avoidance because it thought of AGI alignment (to some extent) as a fun game in the genre of “math puzzle” or “science fiction scenario”, not as a pragmatic, real-world dilemma we actually have to solve, taking into account all of our real-world knowledge and specific facts on the ground. The ‘rules of the game’, many people apparently felt, were to think about certain specific parts of the action chain leading up to an awesome future lightcone, rather than taking ownership of the entire problem and trying to figure out what humanity should in-real-life do, start to finish.
(What primarily makes this weird is that many alignment questions crucially hinge on ‘what task are we aligning the AGI on?’. These are not remotely orthogonal topics.)
Serious Respectable Person Mindset, I think, encouraged this mental avoidance more actively, because pivotal acts are a weird and scary-sounding idea once you leave ‘these are just fun thought experiments’ land.
What I’d like to see instead is something like Weirdness-Tolerant Project Leader Mindset, or Thought Experiments Plus Actual Rigor And Pragmatism And Drive Mindset, or something.
I think a lot of the confusion around EY’s post comes from the difference between thinking of these posts (on some level) as fun debate fodder or persuasion/outreach tools, versus attending to the fact that humanity has to actually align AGI systems if we’re going to make it out of this problem, and this is an attempt by humanity to distill where we’re currently at, so we can actually proceed to go solve alignment right now and save the world.
Imagine that this is v0 of a series of documents that need to evolve into humanity’s (/ some specific group’s) actual business plan for saving the world. The details really, really matter. Understanding the shape of the problem really matters, because we need to engineer a solution, not just ‘persuade people to care about AI risk’.
If you disagree with the OP… that’s pretty important! Share your thoughts. If you agree, that’s important to know too, so we can prioritize some disagreements over others and zero in on critical next actions. There’s a mindset here that I think is important, that isn’t about “agree with Eliezer on arbitrary topics” or “stop thinking laterally”; it’s about approaching the problem seriously, neither falling into despair nor wishful thinking, neither far-mode nor forced normality, neither impracticality nor propriety.
I don’t think making this list in 1980 would have been meaningful. How do you offer any sort of coherent, detailed plan for dealing with something when all you have is toy examples like Eliza?
We didn’t even have the concept of machine learning back then—everything computers did in 1980 was relatively easily understood by humans, in a very basic step-by-step way. Making a 1980s computer “safe” is a trivial task, because we hadn’t yet developed any technology that could do something “unsafe” (i.e. beyond our understanding). A computer in the 1980s couldn’t lie to you, because you could just inspect the code and memory and find out the actual reality.
What makes you think this would have been useful?
Do we have any historical examples to guide us in what this might look like?
I think most worlds that successfully navigate AGI risk have properties like:
AI results aren’t published publicly, going back to more or less the field’s origin.
The research community deliberately steers toward relatively alignable approaches to AI, which includes steering away from approaches that look like ‘giant opaque deep nets’.
This means that you need to figure out what makes an approach ‘alignable’ earlier, which suggests much more research on getting de-confused regarding alignable cognition.
Many such de-confusions will require a lot of software experimentation, but the kind of software/ML that helps you learn a lot about alignment as you work with it is itself a relatively narrow target that you likely need to steer towards deliberately, based on earlier, weaker deconfusion progress. I don’t think having DL systems on hand to play with has helped humanity learn much about alignment thus far, and by default, I don’t expect humanity to get much more clarity on this before AGI kills us.
Researchers focus on trying to predict features of future systems, and trying to get mental clarity about how to align such systems, rather than focusing on ‘align ELIZA’ just because ELIZA is the latest hot new thing. Make and test predictions, back-chain from predictions to ‘things that are useful today’, and pick actions that are aimed at steering — rather than just wandering idly from capabilities fad to capabilities fad.
(Steering will often fail. But you’ll definitely fail if you don’t even try. None of this is easy, but to date humanity hasn’t even made an attempt.)
In this counterfactual world, deductive reasoners and expert systems were only ever considered a set of toy settings for improving our intuitions, never a direct path to AGI.
(I.e., the civilization was probably never that level of confused about core questions like ‘how much of cognition looks like logical deduction?’; their version of Aristotle or Plato, or at least Descartes, focused on quantitative probabilistic reasoning. It’s an adequacy red flag that our civilization was so confused about so many things going into the 20th century.)
To me, all of this suggests a world where you talk about alignment before you start seeing crazy explosions in capabilities. I don’t know what you mean by “we didn’t even have the concept of machine learning back then”, but I flatly don’t buy that the species that landed on the Moon isn’t capable of generating a (more disjunctive version of) the OP’s semitechnical concerns pre-AlexNet.
You need the norm of ‘be able to discuss things before you have overwhelming empirical evidence’, and you need the skill of ‘be good at reasoning about such things’, in order to solve alignment at all; so it’s a no-brainer that not-wildly-incompetent civilizations at least attempt literally any of this.
“most worlds that successfully navigate AGI risk” is kind of a strange framing to me.
For one thing, it represents p(our world | success) and we care about p(success | our world). To convert between the two you of course need to multiply by p(success) / p(our world). What’s the prior distribution of worlds? This seems underspecified.
For another, using the methodology “think about whether our civilization seems more competent than the problem is hard” or “whether our civilization seems on track to solve the problem” I might have forecast nuclear annihilation (not sure about this).
The methodology seems to work when we’re relatively certain about the level of difficulty on the mainline, so if I were more sold on that I would believe this more. It would still feel kind of weird though.
I mean, I think many of the computing pioneers ‘basically saw’ AI risk. I noted some surprise that IJ Good didn’t write the precursor to this list in 1980, and apparently Wikipedia claims there was an unpublished statement in 1998 about AI x-risk; it’d be interesting to see what it contains and how much it does or doesn’t line up with our modern conception of why the problem is hard.
The historical figures who basically saw it (George Eliot 1879: “will the creatures who are to transcend and finally supersede us be steely organisms [...] performing with infallible exactness more than everything that we have performed with a slovenly approximativeness and self-defeating inaccuracy?”; Turing 1951: “At some stage therefore we should have to expect the machines to take control”) seem to have done so in the spirit of speculating about the cosmic process. The idea of coming up with a plan to solve the problem is an additional act of audacity; that’s not really how things have ever worked so far. (People make plans about their own lives, or their own businesses; at most, a single country; no one plans world-scale evolutionary transitions.)
I’m tempted to call this a meta-ethical failure. Fatalism, universal moral realism, and just-world intuitions seem to be the underlying implicit hueristics or principals that would cause this “cosmic process” thought-blocker.
Why is this v0 and not https://arbital.com/explore/ai_alignment/, or the Sequences, or any of the documents that Evan links to here?
That’s part of what I meant to be responding to — not that this post is not useful, but that I don’t see what makes it so special compared to all the other stuff that Eliezer and others have already written.
To put it another way, I would agree that Eliezer has made (what seem to me like) world-historically-significant contributions to understanding and advocating for (against) AI risk.
So, if 2007 Eliezer was asking himself, “Why am I the only one really looking into this?”, I think that’s a very reasonable question.
But here in 2022, I just don’t see this particular post as that significant of a contribution compared to what’s already out there.
Wrote a long comment here. (Which you’ve seen, but linking since your comment started as a response to me.)