Recently, a discussion of potential AGI interventions and potential futures was posted to LessWrong. The picture Eliezer presented was broadly consistent with my existing model of Eliezer’s model of reality, and most of it was also consistent with my own model of reality.
Those two models overlap a lot, but they are different and my model of Eliezer strongly yells at anyone who thinks they shouldn’t be different that Eliezer wrote a technically not infinite but rather very large number of words explaining that you need to think for real and evaluate such things for yourself. On that, our models definitely agree.
It seemed like a useful exercise to reread the transcript of Eliezer’s discussion, and explicitly write out the world model it seems to represent, so that’s what I’m going to do here.
Here are some components of Eliezer’s model, directly extracted from the conversation, rewritten to be third person. It’s mostly in conversation order but a few things got put in logical order.
Before publishing, I consulted with Rob Bensinger, who helped refine several statements to be closer to what Eliezer actually endorses. I explicitly note the changes where they involve new info, so it’s clear what is coming from the conversation and what is coming from elsewhere. In other places it caused me to clean up my wording, which isn’t noted. It’s worth pointing out that the corrections often pointed in the ‘less doom’ direction, both in explicit claims and in tone/implication, so chances are this comes off as generally implying more doom than is appropriate.
Nate rather than Eliezer, but offered as a preface with p~0.85: AGI is probably coming within 50 years. Rob notes that Eliezer may or may not agree with this timeline, and that it shortens if you condition on ‘unaligned’ and lengthens conditional on ‘aligned.’
By default this AGI will come from something similar to some of today’s ML paradigms. Think enormous inscrutable floating-point vectors.
AGI that isn’t aligned ends the world.
AGI that isn’t aligned carefully and on purpose isn’t aligned, period.
It may be possible to align an AGI carefully and have it not end the world.
Right now we don’t know how to do it at all, but in theory we might learn.
The default situation is an AGI system arises that can be made more powerful by adding more compute, and there’s an extended period where it’s not aligned yet and if you add too much compute the world ends, but it’s possible that if you had enough time to work on it and no one did that, you’d have a shot.
More specifically, when combined with other parts of the model detailed later: “I think we’re going to be staring down the gun of a completely inscrutable model that would kill us all if turned up further, with no idea how to read what goes on inside its head, and no way to train it on humanly scrutable and safe and humanly-labelable domains in a way that seems like it would align the superintelligent version, while standing on top of a whole bunch of papers about “small problems” that never got past “small problems”.”
If we don’t learn how to align an AGI via safety research, nothing else can save us period.
Thus, all scenarios where we win are based on a technical surprising positive development of unknown shape, and all plans worth having should assume such a surprising positive development is possible in technical space. In the post this is called a ‘miracle’ but this has misleading associations – it was not meant to imply a negligible probability, only surprise, so Rob suggested changing it to ‘surprising positive development.’ Which is less poetic and longer, but I see the problem.
Eliezer does know a lot of ways not to align an AGI, which is helpful (e.g. Edison knew a lot of ways not to build a light bulb) but also isn’t good news.
Carefully aligning an AGI would at best be slow and difficult, requiring years of work, even if we did know how.
Before you could hope to finish carefully aligning an AGI, someone else with access to the code could use that code to end the world. Rob clarifies that good info security still matters and can meaningfully buy you time, and suggests this: “By default (absent strong op-sec and research closure), you should expect that before you can finish carefully aligning an AGI, someone else with access to the code could use that code to end the world. Likewise, by default (absent research closure and a large technical edge), you should expect that other projects will independently figure out how to build AGI shortly after you do.”
There are a few players who we might expect to choose not to end the world like Deepmind or Anthropic, but only a few. There are many actors, each of whom might or might not end the world in such a spot (e.g. home hobbyists or intelligence agencies or Facebook AI research), and it only takes one of them.
Keeping the code and insights involved secret and secure over an extended period is a level of social technology no ML group is close to having. I read the text as making the stronger claim that we lack the social technology for groups of sufficient size to keep this magnitude of secret for the required length of time, even with best known practices.
Trying to convince the folks that would otherwise destroy the world that their actions would destroy the world isn’t impossible on some margins, so in theory some progress could be made, and some time could be bought, but not enough to buy enough time.
Most reactions to such problems by such folks, once their attention is drawn to them, would make things worse rather than better. Tread carefully or not at all, and trying to get the public into an uproar seems worse than useless.
Trying to convince various projects to become more closed rather than open is possible, and (as per Rob) a very good idea if you would actually succeed, but insufficient.
Trying to convince various projects to join together in the endgame, if we were to get to one, is possible, but also insufficient and (as per Rob) matters much less than becoming more closed now.
Closed and trustworthy projects are the key to potentially making technical progress in a safe and useful way. There needs to be a small group that can work on a project and that wouldn’t publish the resulting research or share its findings automatically with a broader organization, via sufficiently robust subpartitions.
Anthropic in particular doesn’t seem open to alternative research approaches and mostly wants to apply More Dakka, and doesn’t seem open to sufficiently robust subpartitians, but those could both change.
Deepmind in particular is a promising potential partner if they could form the required sufficiently robust subpartitions, even if Demis must be in the loop.
OpenAI as a concept (rather than the organization with that name), is a maximally bad concept almost designed to make the playing field as unwinnable as possible, details available elsewhere. Of course, the organization itself could change (with or without being renamed to ClosedAI).
More generally, publishing findings burns the common resource ‘time until AGI’ and the more detail you publish about your findings along {quiet internal result → announced and demonstrated result → paper describing how to get the announced result → code for the result → model for the result} the more of it you burn, but the more money and prestige the researchers get for doing that.
One thing that would be a big win would be actual social and corporate support for subpartitianed projects that didn’t publish their findings, where it didn’t cost lots of social and weirdness points for the researchers, thus allowing researchers to avoid burning the commons.
Redwood Research (RR) is a new research organization that’s going to try and do alignment experiments on toy problems to learn things, in ways people like Eliezer think are useful and valuable and that they wish someone would do. Description not directly from Eliezer but in context seems safe to assume he roughly agrees.
Previously (see Hanson/Eliezer FOOM debate) Eliezer thought you’d need recursive self-improvement first to get fast capability gain, and now it looks like you can get fast capability gain without it, for meaningful levels of fast. This makes ‘hanging out’ at interesting levels of AGI capability at least possible, since it wouldn’t automatically keep going right away.
An AGI that was above humans in all respects would doubtless FOOM anyway, but if ahead in only some it might not.
Trying to set special case logic to tell AGIs to believe false generalizations with a lot of relevance to mapping or steering the world won’t work, they’d notice and fix it.
Manipulating humans is a convergent instrumental strategy.
Hiding what you are doing is a convergent instrumental strategy.
Eliezer expects that when people are trying to stomp out convergent instrumental strategies by training at a safe dumb level of intelligence, this will not be effective at preventing convergent instrumental strategies at smart levels of intelligence.
You have to train in safe domains because if you train in unsafe domains you die, but the solutions you find in safe domains won’t work in unsafe domains.
Attempts to teach corrigibility in safe regimes are unlikely to generalize well to higher levels of intelligence and unsafe regimes.
Explanation of above part 1: Higher levels of intelligence involve qualitatively new thought processes and things being way out of training distribution.
Explanation of above part 2: Corrigibility is ‘anti-natural’ in a certain sense that makes it incredibly hard to, eg, exhibit any coherent planning behavior (“consistent utility function”) which corresponds to being willing to let somebody else shut you off, without incentivizing you to actively manipulate them to shut you off).
Trying to hardcode nonsensical assumptions or arbitrary rules into an AGI will fail because a sufficiently advanced AGI will notice that they are damage and route around them or fix them (paraphrase).
You only get one shot, because the first miss kills you, and your chances of pulling many of these things off on the first try is basically zero, unless (Rob suggests this) you can basically ‘read off’ what the AI is thinking. Nothing like this that involves black boxes ever works the first time. Alignment is hard largely because of ‘you only get one shot.’
Nothing we can do with a safe-by-default AI like GPT-3 would be powerful enough to save the world (to ‘commit a pivotal act’), although it might be fun. In order to use an AI to save the world it needs to be powerful enough that you need to trust its alignment, which doesn’t solve your problem.
Nanosystems are definitely possible, if you doubt that read Drexler’s Nanosystems and perhaps Engines of Creation and think about physics. They’re a core thing one could and should ask an AI/AGI to build for you in order to accomplish the things you want to accomplish.
No existing suggestion for “Scalable Oversight” seems to solve any of the hard problems involved in creating trustworthy systems.
An AGI would be able to argue for/’prove’ arbitrary statements to the satisfaction of humans, including falsehoods.
Furthermore, an unaligned AGI powerful enough to commit pivotal acts should be assumed to be able to hack any human foolish enough to interact with it via a text channel.
The speedup step in “iterated amplification and distillation” will introduce places where the fast distilled outputs of slow sequences are not true to the original slow sequences, because gradient descent is not perfect and won’t be perfect and it’s not clear we’ll get any paradigm besides gradient descent for doing a step like that.
The safety community currently is mostly bouncing off the hard problems and are spending most of their time working on safe, easy, predictable things that guarantee they’ll be able to publish a paper at the end. Actually-useful alignment research will tend to be risky and unpredictable, since it’s advancing the frontier of our knowledge in a domain where we have very little already-accumulated knowledge.
Almost all other work is either fully useless, almost entirely predictable, or both.
Paul Christiano is trying to have real foundational ideas, and they’re all wrong, but he’s one of the few people trying to have foundational ideas at all; if we had another 10 of him, something might go right.
Chris Olah is going to get far too little done far too late but at least is trying to do things on a path to doing anything at all.
Stuart Armstrong did some good work on further formalizing the shutdown problem, an example case in point of why corrigibility is hard, which so far as I know is still resisting all attempts at solution.
Various people who work or worked for MIRI came up with some actually-useful notions here and there, like Jessica Taylor’s expected utility quantilization.
We need much, much more rapid meaningful progress than this to have any chance, and it’s not obvious how to do that, or how to use money usefully. Money by default produces more low-quality work, and low-quality work slash solving small problems rather than the hard problems isn’t quite useless but it’s not going to get us where we need to go.
The AGI approaches that matter are the ones that scale, so they probably look less like GPT-2 and more like Alpha Zero, AlphaFold 2 or in particular Mu Zero.
Proving theorems about the AGI doesn’t seem practical. Even if we somehow managed to get structures far more legible than giant vectors of floats, using some AI paradigm very different from the current one, it still seems like huge key pillars of the system would rely on non-fully-formal reasoning.
Zvi infers this from the text, rather than it being text directly, and it’s possible it’s due to conflating things together and wasn’t intended: A system that is mathematically understood and you can prove lots of stuff about it is not on the table at this point. Agent Foundations is a failure. Everything in that direction is a failure.
Even if you could prove what the utility function was, getting it to actually represent a human-aligned thing when it counts still seems super hard even if it doesn’t involve a giant inscrutable vector of floats, and it probably does involve that.
Eliezer agrees that it seems plausible that the good cognitive operations we want do not in principle require performing bad cognitive operations; the trouble, from his perspective, is that generalizing structures that do lots of good cognitive operations will automatically produce bad cognitive operations, especially when we dump more compute into them; “you can’t bring the coffee if you’re dead”. No known way to pull this off.
Proofs mostly miss the point. Prove whatever you like about that Tensorflow problem; it will make no difference to whether the AI kills you. The properties that can be proven just aren’t related to safety, no matter how many times you prove an error bound on the floating-point multiplications. It wasn’t floating-point error that was going to kill you in the first place.
Now to put the core of that into simpler form, and excluding non-central details, in a more logical order.
Again, this is my model of Eliezer’s model, statements are not endorsed by me, I agree with many but not all of them.
Claim from Nate rather than Eliezer, unclear if Eliezer agrees: AGI is probably coming (p~85%) within 50 years.
AGI that is not aligned ends the world.
Humanity only gets one shot at this. If we fail, we die and can’t try again.
Almost nothing ever succeeds on its first try.
We currently have no idea how to do it at all.
Current alignment methods all fail and we don’t even have good leads to solving the hard questions that matter.
AIs weak enough to be safe-by-default lack sufficient power to solve these problems.
It would take a surprising positive technical development to find a way to do alignment at all.
So all reasonable plans to align an AGI assume at least one surprising positive and technical development.
Current pace of useful safety research is much slower than needed to keep pace with capabilities research.
Even if we did get a surprising positive technical development that let us find a way to proceed, it would probably take additional years to do that rather than turn the AGI on and end the world. Rob Bensinger clarifies that Eliezer’s exact stance is instead: “An aligned advanced AI created by a responsible project that is hurrying where it can, but still being careful enough to maintain a success probability greater than 25%, will take the lesser of (50% longer, 2 years longer) than would an unaligned unlimited superintelligence produced by cutting all possible corners.”
That’s because the AGI we need to align is likely an enormous inscrutable pile of floating-point vectors. Which makes it harder.
AGI likely comes from algorithms that scale with compute, so less like GPT-X and more like Mu Zero.
Such algorithms have to be aligned somehow before anyone scales them up too much, since that would end the world.
Rob’s rewording: In the meantime, it’s likely that the code and/or conceptual insights would leak out, absent a large, competent effort to prevent this. No leading ML organization currently seems to be putting in the required effort. Zvi’s note: I interpreted the relevant claim here as something stronger, that humanity lacks the social technology to do more than probabilistically postpone such a leak even under best practices given the likely surrounding conditions, and that no leading organizations are even doing anything resembling or trying to resemble best practices.
If it were to leak, someone somewhere would run the code and end the world. There are people who probably would know better than to scale it up and end the world, like Deepmind and Anthropic, but it wouldn’t take long for many others to get the code, and then it only takes one someone who didn’t know better (like an intelligence agency) to end the world anyway.
Most people working on safety are working on small problems rather than hard problems, or doing work with predictable outcomes, because incentives, and are therefore mostly useless (or worse).
There are exceptions (Paul Christiano, Chris Olah, Stuart Armstrong and some MIRI-associated people) but they are exceptions and we need vastly more of them.
AI work that is shared or published accelerates AGI.
The more details are shared, the more acceleration happens.
Everyone publishes anyway, in detail, because incentives.
Incentive changes to fix this would need to be sufficiently robust that not publishing wouldn’t hurt your career prospects or cost you points, or they won’t work.
Fixing incentives on publishing, and otherwise making more things more closed, would be helpful.
Ability to do subpartitianed/siloed projects within research organizations (including Deepmind and Anthropic), that would actually stay meaningfully secret, would be helpful.
Research that improves interpretability a lot (like Chris Olah is trying to do, but with faster progress) would be very helpful. Creative new deep alignment ideas (similar to Paul Christiano’s work in depth and novelty, but not in the Paul-paradigm) would be very helpful.
Certain kinds of especially-valuable alignment experiments using present-day ML systems, like the experiments run by Redwood Research, would be helpful.
Nanotechnology is definitely physically doable and a convergent instrumental strategy, see Drexler.
Manipulating humans is a convergent instrumental strategy.
Hiding what you are doing is a convergent instrumental strategy.
Higher intelligence AGIs use qualitatively new thought processes that lie outside your training distribution.
An unaligned AGI would be able to hack any human foolish enough to read its text messages or other outputs, ‘prove’ arbitrary statements to human satisfaction, etc.
Corrigibility is ‘anti-natural’ and incredibly hard. Corrigibility solutions for less intelligent AGIs won’t transfer to higher intelligence AGIs.
‘Scalable oversight’ as proposed so far doesn’t solve any of the hard problems.
‘Iterated amplification and distillation’ based on gradient descent would be imperfect and the nice properties you’re trying to preserve would fail. Currently we have no alternate approach.
Agent Foundations and similar mathematical approaches seem to be dead ends.
‘Good’ cognitive operations grouped together and scaled automatically produce ‘bad’ cognitive operations. You can’t deliver the coffee if you’re dead, etc.
Getting a fixed utility function into an AGI at all is super hard+, getting a utility function to represent human values is super hard+, giant floating point vectors make both harder still.
The stuff you can prove doesn’t prove anything that matters, the stuff that would prove anything that matters you can’t prove.
Special case nonsensical logic or arbitrary rules will be interpreted by an AGI as damage and routed around.
Recursive self-improvement seems not required for fast capability gains. This means having a powerful but not self-improving or world-ending AGI at least possible.
An AGI better at everything than humans FOOMs anyway.
Worth noting that the more precise #12 is substantially more optimistic than 12 as stated explicitly here.
Looking at these 42 claims, I notice my inside view mostly agrees, and would separate them into:
Inside view disagreement but seems plausible: 1
Inside view lacks sufficient knowledge to offer an opinion: 28 (I haven’t looked for myself)
Inside view isn’t sure: 8 (if we add ‘using current ideas’ move to strong agreement), 13, 36
Weak inside view agreement – seems probably true not counting Eliezer’s opinion, but I wouldn’t otherwise be confident: 7, 9, 10, 22, 34, 35, 40
Strong inside view agreement: 2, 3, 4, 5, 6, 11, 12 (original version would be weak agreement, revised version is strong agreement), 14 (conditional on 13), 15, 16 (including the stronger version), 17, 18, 19 (in general, not for specific people), 20, 21, 23, 24, 25, 26, 27, 29, 30, 31, 32, 33, 37, 38, 39 (unless a bunch of other claims also break first), 41, 42
Thus, I have inside view agreement (e.g. I substantively agree with this picture without taking into account anyone’s opinion) on 37 of the 42 claims, including many that I believe to be ‘non-obvious’ on first encounter.
That leaves 5 remaining claims.
For 28 (Nanotechnology) I think it’s probably true, but I notice I’m counting on others models of the technology that would be involved, so I want to be careful to avoid information cascade, but my outside view strongly agrees.
For 8 (Safe AIs lack the power to save us) would require a surprising positive development for it to be wrong, in the sense that no currently proposed methods seem like they’d work. But I notice I instinctively remain hopeful for such a development, and for a solution to be found. I’m not sure how big a disagreement exists here, there might not be one.
That leaves 1 (85% AGI by 2070), 13 (AGI is a giant pile of floating-point vectors) and 36 (Agent Foundations and similar are dead ends) which are largely the same point of doubt.
Which is basically this: I notice my inside view, while not confident in this, continues to not expect current methods to be sufficient for AGI, and expects the final form to be more different than I understand Eliezer/MIRI to think it is going to be, and that theAGI problem (not counting alignment, where I think we largely agree on difficulty) is ‘harder’ than Eliezer/MIRI think it is.
For 36 (Agent Foundations) in particular: I notice a bunch of people in the comments saying Agent Foundations isn’t/wasn’t important, and seemed like a non-useful thing to pursue, and if anything I’m on the flip side of that and am worried it and similar things were abandoned too quickly rather than too late. It’s a case of ‘this probably will do nothing and look stupid but might do a lot or even be the whole ballgame’ and that being hard to sustain even for a group like MIRI in such a spot, but being a great use of resources in a world where things look very bad and all solutions assume surprising (and presumably important) positive developments. Everybody go deep.
For 1 (probability of AGI) in particular: I think in addition to probably thinking inside view that AGI is harder than Eliezer/MIRI think it is, I also think civilization’s dysfunctions are more likely to disrupt things and make it increasingly difficult to do anything at all, or anything new/difficult, and also collapse or other disasters. I know Nate Sores explicitly rejects this mattering much, but it matters inside view to me quite a bit. I don’t have an inside view point estimate, but if I could somehow bet utility (betting money really, really doesn’t work here, at all) and could only bet once, I notice I’d at least buy 30% and sell 80%, or something like that.
Also, I noticed two interrelated things that I figured are worth noting from the comments:
In the comments to the OP that Eliezer’s comments about small problems versus hard problems got condensed down to ‘almost everyone working on alignment is faking it.’ I think that is not only uncharitable, it’s importantly a wrong interpretation, and motivated by viewing the situation (unconsciously?) through the lens of a battle over status and authority and blame, rather than how to collectively win from a position on a game board. The term ‘faking’ here is turning a claim of ‘approaches that are being taken mostly have epsilon probability of creating meaningful progress’ to a social claim about the good faith of those doing said research, and then interpreted as a social attack, and then therefore as an argument from authority and a status claim, as opposed to pointing out that such moves don’t win the game and we need to play to win the game. I see Eliezer as highly sympathetic to how this type of work ends up dominating, and sees the problem as structural incentives that need to be fixed (hence my inclusion of ‘because incentives’ above) combined with genuine disagreement about the state of the game board. ‘Faking it’ is shorthand for ‘you know what you’re doing isn’t real/useful and are doing it anyway’ and introduces the accusation that leads to the rest of the logical sequence, or something. And Eliezer kind of wrote a whole sequence about exactly this, which I consider so important that a quote from it is my Twitter bio.
Eliezer is being accused of making an argument from authority using authority he doesn’t deserve, or in a way that is disruptive, and I assume every time he sees that or anyone saying anything like “Eliezer thinks X therefore it’s irrational to not think X too” or “Who are you (or am I) to disagree with the great Eliezer?” he’s tearing his hair out that he spent years writing the definitive book about why “think for yourself, shmuck” is the way to go. I feel his pain, to a lesser degree. I had a conversation last Friday where my post on the ports where I explain how I want to generate a system of selective amplification where everyone thinks for themselves in levels so as to amplify true and useful things over untrue and useless things was interpreted (by a rather smart and careful reader!) as a request to have a norm that people amplify messages without reading them carefully or evaluating whether they seemed true, the way certain camps do for any messages with the correct tribal coloring. And again, arrrggghh, pull hair out, etc.
Note that there is a quote from Eliezer using the term “fake”:
It could certainly be the case that Eliezer means something else by the word “fake” than the commenters mean when they use the word “fake”; it could also be that Eliezer thinks that only a tiny fraction of the work is “fake” and most is instead “pointless” or “predictable”, but the commenters aren’t just creating the term out of nowhere.
Right, so it didn’t come completely out of nowhere, but it still seems uncharitable at best to go from ‘mostly fake or pointless or predictable’ where mostly is clearly modifying the collective OR statement, to ‘almost everyone else is faking it.’
EDIT: Looks like there’s now a comment apologizing for, among other things, exactly this change.
It also seems uncharitable to go from (A) “exaggerated one of the claims in the OP” to (B) “made up the term ‘fake’ as an incorrect approximation of the true claim, which was not about fakeness”.
You didn’t literally explicitly say (B), but when you write stuff like
I think most (> 80%) reasonable people would take (B) away from your description, rather than (A).
Just to be totally clear: I’m not denying that the original comment was uncharitable, I’m pushing back on your description of it.
It’s not like this is the first time Eliezer has said “fake”, either:
If he viewed almost all alignment work as nonfake, it wouldn’t be worth noting in his praise of RR. I bring this up because “EY thinks most alignment work is fake” seems to me to be a non-crazy takeaway from the post, even if it’s not true.
(I also think that “totally unpromising” is the normal way to express “approaches that are being taken mostly have epsilon probability of creating meaningful progress”, not “fake.”)
Strong agreement with this. I remember being rather surprised and dismayed to hear that MIRI was pivoting away from agent foundations, and while I have more-than-substantial probability on them knowing something I don’t, from my current vantage point I can’t help but continue to feel that decision was premature.
My sense is that Eliezer and Nate (and I think some other researchers) updated towards shorter timelines in late 2016 / early 2017 (“moderately higher probability to AGI’s being developed before 2035” in our 2017 update). This then caused them to think AF was less promising than they’d previously thought, because it would be hard to solve AF by 2035.
On their model, as I understand it, it was good for AF research to continue (among other things, because there was still a lot of probability mass on ‘AGI is more than 20 years away’), and marginal AF progress was still likely to be useful even if we didn’t solve all of AF. And MIRI houses a lot of different views, including (AFAIK) quite different views on the tractability of AF, the expected usefulness of AF progress, the ways in which solving AF would likely be useful, etc. This wasn’t a case of ‘everyone at MIRI giving up on AF’, but it was a case of Eliezer and Nate (and some other researchers) deciding this wasn’t where they should put their own time, because it didn’t feel enough to them like ‘the mainline way things end up going well’.
My simplified story is that in 2017-2020, the place Nate and Eliezer put their time was instead the new non-public research directions Benya Fallenstein had started (which coincided with a big push to hire more people to work with Nate/Eliezer/Benya/etc. on this). In late 2020 / early 2021, Nate and Eliezer decided that this research wasn’t going fast enough, and that they should move on to other things (but they still didn’t think AF was the thing). Throughout all of this, other MIRI researchers like Scott Garrabrant (the research lead for our AF work) have continued to chug away on AF / embedded agency work.
This has been quite confusing even to me from the outside.
Other than the rocket alignment analogy and the general case for deconfusion helping, has anyone ever tried to describe with more concrete (though speculative) detail how AF would help with alignment? I’m not saying it wouldn’t. I just literally want to know if anyone has tried explaining this concretely. I’ve been following for a decade but don’t think I ever saw an attempted explanation.
Example I just made up:
Modern ML is in some sense about passing the buck to gradient-descent-ish processes to find our optimizers for us. This results in very complicated, alien systems that we couldn’t build ourselves, which is a worst-case scenario for interpretability / understandability.
If we better understood how optimization works, we might able to do less buck-passing / delegate less of AI design to gradient-descent-ish processes.
Developing a better formal model of embedded agents could tell us more about how optimization works in this way, allowing the field to steer toward approaches to AGI design that do less buck-passing and produce less opaque systems.
This improved understandability then lets us do important alignment work like understanding what AGIs (or components of AGIs, etc.) are optimizing for, understanding what topics they’re thinking about and what domains they’re skilled in, understanding (and limiting) how much optimization they put into various problems, and generally constructing good safety-stories for why (given assumptions X, Y, Z) our system will only put cognitive work into things we want it to work on.
Thanks. This is great! I hadn’t thought of Embedded Agency as an attempt to understand optimization. I thought it was an attempt to ground optimizers in a formalism that wouldn’t behave wildly once they had to start interacting with themselves. But on second thought it makes sense to consider an optimizer that can’t handle interacting with itself to be a broken or limited optimizer.
I think another missing puzzle piece here is ‘the Embedded Agency agenda isn’t just about embedded agency’.
From my perspective, the Embedded Agency sequence is saying (albeit not super explicitly):
Here’s a giant grab bag of anomalies, limitations, and contradictions in our whole understanding of reasoning, decision-making, self-modeling, environment-modeling, etc.
A common theme in these ways our understanding of intelligence goes on the fritz is embeddedness.
The existence of this common theme (plus various more specific interconnections) suggests it may be useful to think about all these problems in light of each other; and it suggests that these problems might be surprisingly tractable, since a single sufficiently-deep insight into ‘how embedded reasoning works’ might knock down a whole bunch of these obstacles all at once.
The point (in my mind—Scott may disagree) isn’t ‘here’s a bunch of riddles about embeddedness, which we care about because embeddedness is inherently important’; the point is ‘here’s a bunch of riddles about intelligence/optimization/agency/etc., and the fact that they all sort of have embeddedness in common may be a hint about how we can make progress on these problems’.
This is related to the argument made in The Rocket Alignment Problem. The core point of Embedded Agency (again, in my mind, as a non-researcher observing from a distance) isn’t stuff like ‘agents might behave wildly once they get smart enough and start modeling themselves, so we should try to understand reflection so they don’t go haywire’. It’s ‘the fact that our formal models break when we add reflection shows that our models are wrong; if we found a better model that wasn’t so fragile and context-dependent and just-plain-wrong, a bunch of things about alignable AGI might start to look less murky’.
(I think this is oversimplifying, and there are also more direct value-adds of Embedded Agency stuff. But I see those as less core.)
The discussion of Subsystem Alignment in Embedded Agency is I think the part that points most clearly at what I’m talking about:
I am struck by two elements of this conversation, which this post helped clarify did indeed stick out how I thought they did (weigh this lightly if at all, I’m speaking from the motivated peanut gallery here).
A. Eliezer’s commentary around proofs has a whiff of Brouwer’s intuitionism about it to me. This seems to be the case on two levels: first the consistent this is not what math is really about and we are missing the fundamental point in a way that will cripple us tone; second and on a more technical level it seems to be very close to the intuitionist attitude about the law of the excluded middle. That is to say, Eliezer is saying pretty directly that what we need is P, and not-not-P is an unacceptable substitute because it is weaker.
B. That being said, I think Steve Omohundro’s observations about the provability of individual methods wouldn’t be dismissed in the counterfactual world where they didn’t exist; rather I expect that Eliezer would have included some line about how to top it all off, we don’t even have the ability to prove our methods mean what we say they do, so even if we crack the safety problem we can still fuck it up at the level of a logical typo.
C. The part about incentives being bad for researchers which drives too much progress, and lamenting that corporations aren’t more amenable to secrecy around progress, seems directly actionable and literally only requiring money. The solution is to found a ClosedAI (naturally not named anything to do with AI), go ahead and set those incentives, and then go around outbidding the FacebookAIs of the world for talent that is dangerous in the wrong hands. This has even been done before, and you can tell it will work because of the name: Operation Paperclip.
I really think Eliezer and co. should spend more time wish-listing about this, and then it should be solidified into a more actionable plan. Under entirely-likely circumstances, it would be easy to get money from the defense and intelligence establishments to do this, resolving the funding problem.
Thanks, your numbered list was very helpful in encouraging to go through the claims. Just two things that stood out to me:
What exactly makes people sure that something like GPT would be safe/unsafe?
If what is needed is some form of insight/break through: Some smarter version of GPT-3 seems really useful? The idea that GPT-3 produces better poetry than me while GPT-5 could help to come up with better alignment ideas, doesn’t strongly conflict with my current view of the world?
#12:
This might come across as optimistic if this was your median alignment difficulty estimate, but instead Elizier is putting 95% on this, which on the flip side suggests a 5% chance that things turn out to be easier. This seems rather in line with “Carefully aligning an AGI would at best be slow and difficult, requiring years of work, even if we did know how.”
Is there something like the result of a survey of experts about the feasibility of drexlerian nanotechnology? Are there any consensus among specialists about the possibility of a gray goo scenario?
Drexler and Yudkowsky both extremely overestimated the impact of molecular nanotechnology in the past.
not an expert, but I think life is an existence proof for the power of nanotech, even if the specifics of a grey goo scenario seem less than likely possible. Trees turn sunlight and air into wood, ribosomes build peptides and proteins, and while current generation models of protein folding are a ways from having generative capacity, it’s unclear how many breakthroughs are between humanity and that general/generative capacity.
A survey of leading chemists would likely produce dismissals based on a strawmanned version of Drexler’s ideas. If you could survey people who demonstrably understood Drexler, I’m pretty sure they’d say it’s feasible, but critics would plausibly complain about selection bias.
The best analysis of gray goo risk seems to be Some Limits to Global Ecophagy by Biovorous Nanoreplicators, with Public Policy Recommendations.
They badly overestimated how much effort would get put into developing nanotech. That likely says more about the profitability of working on early-stage nanotech than it says about the eventual impact.
I don’t think anyone (e.g., at FHI or MIRI) is worried about human extinction via gray goo anymore.
Like, they expected nanotech to come sooner? Or something else? (What did they say, and where?)
The fate of the concept of nanotechnology has been a curious one. You had the Feynman/Heinlein idea of small machines making smaller machines until you get to atoms. There were multiple pathways towards control over individual atoms, from the usual chemical methods of bulk synthesis, to mechanical systems like atomic force microscopes.
But I think Eric Drexler’s biggest inspiration was simply molecular biology. The cell had been revealed as an extraordinary molecular structure whose parts included a database of designs (the genome) and a place of manufacture (the ribosome). What Drexler did in his books, was to take that concept, and imagine it being realized by something other than the biological chemistry of proteins and membranes and water. In particular, he envisaged rigid mechanical structures, often based on diamond (i.e. a lattice of carbons with a surface of hydrogen), often assembled in hard vacuum by factory-like nano-mechanisms, rather than grown in a fluid medium by redundant, fault-tolerant, stochastic self-assembly (as in the living cell).
Having seen this potential, he then saw this ‘nanotechnology’ as a way to do all kinds of transhuman things: make AI that is human-equivalent, but much smaller and faster (and hotter) than the human brain; grow a starship from a molecularly precise 3d printer in an afternoon; resurrect the cryonically suspended dead. And also, as a way to make replicating artificial life that could render the earth uninhabitable.
For many years, there was an influential futurist subculture around Drexler’s thought and his institute, the Foresight Institute. And nanotechnology made it was into SF pop culture, especially the idea of a ‘nanobot’. Nanobots are still there as an SF trope—and are sometimes cited as an inspiration in real research that involves some kind of controlled nanomechanical process—but I think it’s unquestionable that the hype that surrounded that nano-futurist community has greatly diminished, as the years kept passing without the occurrence of the “assembler breakthrough” (ability to make the nonbiological nano-manufacturing agents).
There is a definite sense in which I think Eliezer eventually took up a place in culture analogous to that once held by Eric Drexler. Drexler had articulated a techno-eschatology in which the entire future revolved around the rise of nanotechnology (and his core idea for how humanity could survive was to spread into space; he had other ideas too, but I’d say that’s the essence of his big-picture strategy), and it was underpinned not just by SF musings but also by nanomachine designs, complete with engineering calculations. With Eliezer, the crucial technology is artificial intelligence, the core idea is alignment versus extinction via (e.g.) paperclip maximizer, and the technical plausibility arguments come from computer science rather than physics.
Those who are suspicious of utopian and dystopian thought in general, including their technologically motivated forms, are happy to say that Drexler’s extreme nano-futurology faded because something about it was never possible, and that the same fate awaits Eliezer’s extreme AI-futurology. But as for me, I find the arguments in both cases quite logical. And that raises the question, even as we live through a rise in AI capabilities that is keeping Eliezer’s concerns very topical, why did Drexler’s nano-futurism fade… not just in the sense that e.g. the assembler breakthrough never became a recurring topic of public concern, the way that climate change did; but also in the sense that, e.g., you don’t see effective altruists worrying about the assembler breakthrough, and this is entirely because they are living in the 2020s; if effective altruism had existed in the 1990s, there’s little doubt that gray goo and nanowar would have been high in the list of existential risks.
Understanding what happened to Drexler’s nano-futurism requires understanding what kind of ‘nano’ or chemical progress has occurred since those days, and whether the failure of certain things to eventuate is because they are impossible, because not enough of the right people were interested, because the relevant research was starved of funds and suppressed (but then, by who, how, and why), or because it’s hard and we didn’t cross the right threshold yet, the way that artificial neural networks couldn’t really take off until the hardware for deep learning existed.
It seems clear that ‘nanotechnology’ in the form of everything biological, is still developing powerfully and in an uninhibited way. The Covid pandemic has actually given us a glimpse of what a war against a nano-replicator is like, in the era of a global information society with molecular tools. And gene editing, synthetic biology, organoids, all kinds of macabre cyborgian experiments on lab animals, etc, develop unabated in our biotech society.
As for the non-biological side… it was sometimes joked that ‘nanotechnology’ is just a synonym for ‘chemistry’. Obviously, the world of chemical experiment and technique, quantum manipulations of atoms, design of new materials—all that continues to progress too. So it seems that what really hasn’t happened, is that specific vision of assemblers, nanocomputers, and nanorobots made from diamond-like substances.
Again, one may say: it’s possible, it just hasn’t happened yet for some reason. The world of 2D carbon substances—buckyballs, buckytubes, graphenes—seems to me the closest that we’ve come so far. All that research is still developing, and perhaps it will eventually bootstrap its way to the Drexlerian level of nanotechnology, once the right critical thresholds are passed… Or, one might say that Eric’s vision (assemblers, nanocomputers, nanorobots) will come to pass, without even requiring “diamondoid” nanotechnology—instead it will happen via synthetic biology and/or other chemical pathways.
My own opinion is that the diamondoid nanotechnology seems like it should be possible, but I wonder about its biocompatibility—a crucial theme in the nanomedical research of Robert Freitas, who was the champion of medical applications as envisaged by Drexler. I am just skeptical about the capacity of such systems to be useful in a biochemical environment. Speaking of astronomically sized intelligences, Stanislaw Lem once wrote that “only a star can survive among stars”, meaning that such intelligences should have superficial similarities to natural celestial bodies, because they are shaped by a common physical regime; and perhaps biomedically useful nanomachines must necessarily resemble and operate like the protein complexes of natural biology, because they have to work in that same regime of soluble biopolymers.
Specifically with respect to ‘gray goo’, i.e. nonbiological replicators that eat the ecosphere (keywords include ‘aerovore’ and ‘ecophagy’), it seems like it ought to be physically possible, and the only reason we don’t need to worry so much about diamondoid aerovores smothering the earth, is that for some reason, the diamondoid kind of nanotechnology has received very little research funding.
Fascinating history, Mitchell! :) I share your confusion about why more EAs aren’t interested in Drexlerian nanotech, but are interested in AGI.
I would indeed guess that this is related to the deep learning revolution making AI-in-general feel more plausible/near/real, while we aren’t experiencing an analogous revolution that feels similarly relevant to nanotech. That is, I don’t think it’s mostly based on EAs having worked out inside-view models of how far off AGI vs. nanotech is.
I’d guess similar factors are responsible for EAs being less interested in whole-brain emulation? (Though in that case there are complicating factors like ‘ems have various conceptual and technological connections to AI’.)
Alternatively, it could be simple founder effects—various EA leaders do have various models saying ‘AGI is likely to come before nanotech or ems’, and then this shapes what the larger community tends to be interested in.
From Drexler’s conversation with Open Phil:
No one has a reason to build grey goo (outside of rare omnicidal crazy people), so it’s not worth worrying about, unless someday random crazy people can create arbitrary nanosystems in their background.
AGI is different because it introduces (very powerful) optimization in bad directions, without requiring any pre-existing ill intent to get the ball rolling.
One view I’ve seen is that perverse incentives did it. Widespread interest in nanotechnology led to governmental funding of the relevant research, which caused a competition within academic circles over that funding, and discrediting certain avenues of research was an easier way to win the competition than actually making progress. To quote:
Source: this review of Where’s My Flying Car?
One wonders if similar institutional sabotage of AI research is possible, but we’re probably past the point where that might’ve worked (if that even was what did nanotech in).
I guess I missed the term gray goo. I apologize for this and for my bad English.
Is it possible to replace it on the ‘using nanotechnologies to attain a decisive strategic advantage’?
I mean the discussion of the prospects for nanotechnologies on SL4 20+ years ago. This is especially:
I understand that since then the views of EY have changed in many ways. But I am interested in the views of experts on the possibility of using nanotechnology for those scenarios that he implies now. That little thing I found.
Makes sense, thanks for the reference! :)
I really like this post format—a numbered list of beliefs that come together to form a model—it makes the model very clear, makes it easier to see where you differ and what the cruxes are, and makes it easier to discuss.
I think a weak proto-AI could be useful for step 1 of the following plan:
Invent some human-enhancement methods to massively amplify the abilities of your researchers.
Use your enhanced researchers to make a friendly AGI faster than your competitors make any AGI.
Some human-enhancement ideas sitting in my mind:
reducing or removing the need for sleep or exercise
reversing age-related cognitive decline (and physical decline for that matter)
electronically facilitated brain-to-brain communication
brain-computer interfaces that are faster and less effortful than keyboards or screens
drugs altering brain chemistry so you think faster or better (e.g. is there any measurable chemical difference between average brains and geniuses’ brains? If so, can you use drugs to make the one work like the other?)
enlarging heads via surgery and growing more brain cells (this is slightly horrifying)
getting something close to The Matrix’s “downloading skills into your brain”—teaching modules that observe your brain-state, possibly alter it directly, and possibly implant some hardware whose interface your brain learns
It seems like any one of these, if it worked, might give you the 1.5x factor mentioned (or more), assuming your competitors didn’t adopt it quickly enough. (Would they? I don’t think I could see a normal company requiring all its researchers to go through these enhancement procedures in, say, less than a few years after they’d been developed; nor Western governments. China, maybe.) A proto-AI is not necessary for any of them, but it might be the fastest way.
I like the idea of enhanced researchers for a few reasons:
going off Fred Brooks’s “The Mythical Man-Month”, a smallish group of enhanced researchers might outperform arbitrarily large groups of “normal” humans
leaks, and other “unilateralist’s curse” issues, are less of an issue with smaller groups
at least for some of these enhancements, the programmers would be less likely to make bugs/mistakes, and therefore, if there are situations where “one stupid bug makes the difference between success and failure”, we’d have a better chance
the rest of humanity would benefit from these enhancement techniques too
The human enhancement part of this would need to move really really really fast to beat the AGI power scaling and proliferation timelines.
Hmm, that seems to depend on what assumptions you make. Suppose it takes N years to develop proto-AI to the point where it can find a sleep-mimicking drug, and after that it would take M more years to develop general AI, and M * 1.5 more years to develop friendly general AI. If M is much higher than how long it takes for the FAI researchers to start using the drug (which I imagine could be a few months), then the FAI researchers might be enhanced for most of the M-year period before competitors make AGI.
I think you’re assuming M is really low. My intuition is that many of these enhancements wouldn’t require much more than a well-funded team and years of work with today’s technology (but fewer years with proto-AI), and that N is much smaller than M. But this depends a lot on the details of the enhancement problems and on the current state of biotechnology and bioinformatics, and I don’t know very much. Are there people associated with MIRI and such who work on human bioenhancement?
I’d like to believe this, but the coronavirus disaster gives me pause. Seems like the ONE relevant bit of powerful science/technology that wasn’t heavily restricted or outright banned was gain-of-function research, which may or may not have been responsible for the whole mess in the first place (and certainly raises the danger of it happening again).
And I notice that the same forces/people/institutions who unpersuasively defend the FDA’s anti-vaccine policies unpersuasively defend the legality of GoF… I honestly don’t have any model of what’s going on there — and what these forces/people/institutions said about the White House’s push boosters convinces me it is more complicated than instinctively aligning with authority, power, or partisan interests. Does anyone have a model for this?
But in lieu of real understanding: I don’t think we can count on our civilizational dysfunction to accidentally coincidentally help us here. If our civilization can’t manage stop GoF, while it simultaneously outlaws vaccines and HCTs, I don’t think we should expect it do slow down AI by very much.
I don’t literally expect the scenario where, say… the outrage machine calls for banning AI Alignment research and successfully restricts it, while our civilization feverishly pours all of its remaining ability to Do into scaling AI. But I don’t expect it to be much better than that, from a civilizational competence point of view. (At least not on the current path, and I don’t currently see any members of this community making any massive heroic efforts to change that that look like they will actually succeed.)
How about the phrase “positive model violation”? Later in that post Eliezer is recorded as saying:
I think “model violation” and “surprising development” point to different things. For example:
If I buy a lottery ticket and win a million dollars from the lottery, that is a “surprising development”. If I don’t buy a lottery ticket and still win a million dollars from the lottery, that is also a “model violation”, as it violates our model of how lotteries work.
If my nemesis is struck by lightning from the sky, that is a “surprising development”. If my nemesis is struck by lightning cast by a magic wand, that is also a “model violation”, as it violates our model of lightning.
If many AI researchers independently decide to switch to safety work, that is a “surprising development”. If corrigibility turns out to be “natural” for an AI, that is also a “model violation” of (my model of) Eliezer’s models of corrigibility and intelligence.
My models of lightning and lotteries are relatively robust, and the model violations are negligible probability. Models of the future of AI development and intelligence and geo-politics and human nature and so forth are presumably much weaker, so we can reasonably expect some model violations, positive and negative.
From what I’ve seen in discussions over the future of humanity, the following options are projected, from worst to best:
Collapse of humanity due to AGI taking over and killing everyone
Collapse of humanity, but some humans remain alive in “The Matrix” or “zoo” of some kind
Collapse due to existing technology like nuclear bombs
Collapse due to climate change or meteor or mega-volcano or alien invasion
Collapse due to exhausting all useful elements and sources of energy
Reversal to pre-20th century levels and stagnation due to combination of 2/3/4
Stagnation at 21st century level tech
Stagnation at a non-crazy level of tech—say good enough to have a colony on Pluto, but no human ever leaving the Solar system
Interstellar civ based on “virtual humans”/”brains in a jar”
Interstellar civ run by fully aligned AGI, with human intelligence being so weak that it basically plays no role
Interstellar civilization primarily based on human intelligence (Star Trek pre-Data?)
Interstellar civ based on humans+AGI working together peacefully (Star Trek’s implied future given Data’s evolution?)
Multiverse civilization of some kind (Star Trek’s Q?)
Is this ranking approximately correct? If so, why do we care so much if “AGI” or “virtual humans” end up ruling the universe? Does it make a difference if the AGI is based on human intelligence and not on some alien brain structure, given that biological humans will stagnate/die out in both cases? Or is “virtual humans” just as bad of an outcome and falls into the same bucket of “unaligned AGI”? What goal are we truly trying to optimize here?
There are fates worse than 1. Fortunately they aren’t particularly likely, but they are scary nonetheless.
s-risks?
Might be good to elaborate on this one a bit, why that might make ‘hanging out’ possible, i.e., diminishing returns. (Though if a substantial improvement can be made by a) tweaks, b) adding another technique or something, then maybe ‘hanging out’ won’t happen.)
Amusingly, true on two levels. (Though there’s worry, people won’t converge on that strategy anyway.)
Sort of ‘corrigibility’ is ‘Corrigibility without (something like) self-shutdown or self-destruct.’
Is this about strategy/techniques, or reward?
And it’s not particularly useful for convincing people to do things like ‘not publish’.
(I expected a caveat here, like ‘if aligned’.)
What are pivotal acts, aside from ‘persuading people’? Nanotech, or is the bar lower?
Is proving things useful to ‘ai’? Like, in Go, or Starcraft? Or are strategies always not handled that way?
>Which is basically this: I notice my inside view, while not confident in this, continues to not expect current methods to be sufficient for AGI, and expects the final form to be more different than I understand Eliezer/MIRI to think it is going to be, and that theAGI problem (not counting alignment, where I think we largely agree on difficulty) is ‘harder’ than Eliezer/MIRI think it is.
Could you share why you think that current methods are not sufficient to produce AGI?
Some context:
After reading Discussion with Eliezer Yudkowsky on AGI interventions I thought about the question “Are current methods sufficient to produce AI?” for a while. I thought I’d check if neural nets are Turing-complete and quick search says they are. To me this looks like a strong clue that we should be able to produce AGI with current methods.
But I remembered reading some people who generally seemed better informed than me having doubts.
I’d like to understand what those doubts are (and why there is apparent disagreement on the subject).
I want to be clear that my inside view is based on less knowledge and less time thinking carefully, and thus has less and less accurate gears, than I would like or than I expect to be true of many others’ here’s models (e.g. Eliezer).
Unpacking my reasoning fully isn’t something I can do in a reply, but if I had to say a few more words, I’d say it’s related to the idea that the AGI will use qualitatively different methods and reasoning, and not thinking that current methods can get there, and that we’re getting our progress out of figuring out how to do more and more things without thinking in this sense, rather than learning how to think in this sense, and also finding out that a lot more of what humans do all day doesn’t require thinking—I felt like GPT-3 taught me a lot about humans and how much they’re on autopilot and how they still get along fine, and I went through an arc where it seemed curious, then scary, then less scary on this front.
I’m emphasizing that this is intuition pumping my inside view, rather than things I endorse or think should persuade anyone, and my focus very much was elsewhere.
Echo the other reply that Turing complete seems like a not-relevant test.
I agree that GPT-3 sounds like a person on autopilot.
As Sarah Constantin said: Humans Who Are Not Concentrating Are Not General Intelligences
I have only a very vague idea what are different reasoning ways (vaguely related to “fast and effortless “ vs “slow and effortful in humans? I don’t know how that translates into what’s actually going on (rather than how it feels to me)).
Thank you for pointing me to a thing I’d like to understand better.
Turing completeness is definitely the wrong metric for determining whether a method is a path to AGI. My learning algorithm of “generate a random Turing machine, test it on the data, and keep it if it does the best job of all the other Turing machines I’ve generated, repeat” is clearly Turing complete, and will eventually learn any computable process, but it’s very inefficient, and we shouldn’t expect AGI to be generated using that algorithm anytime in the near future.
Similarly, neural networks with one hidden layer are universal function approximators, and yet modern methods use very deep neural networks with lots of internal structure (convolutions, recurrences) because they learn faster, even though a single hidden layer is enough in theory to achieve the same tasks.
I was thinking that current methods could produce AGI (because Turing-complete) and they can apparently good at producing some algorithms so they might be reasonably good at producing AGI.
2nd part of that wasn’t explicit for me before your answer so thank you :)
I don’t see any glaring flaws in any of the items on the inside view, and, obviously, I would not be qualified to evaluate them, anyway. However, when I try to take an outside view on this, something doesn’t add up.
Specifically, it looks like anything that looks like a civilization should end up evolving, naturally or artificially, into an unsafe AGI most of the time, some version of Hanson’s grabby aliens. We don’t see anything like that, at least not in any detectable way. And so we hit the Fermi paradox, where an unremarkable backwater system is apparently the first one about to do so, many billions of years after the Big Bang. It is not outright impossible, but the odds do not match up with anything presented by Eliezer. Hanson’s reason for why we don’t see grabby aliens is < 1⁄10,000 “conversion rate” of “non-grabby to grabby transition”:
However, an unaligned AGI that ends humanity ought to have a much higher chance of transition into grabbiness than that, so there is a contradiction between the predictions of unsafe AGI takeover and the lack of evidence of it happening in our past lightcone.
Low conversion rate to grabbiness is only needed in the model if you think there are non-grabby aliens nearby. High conversion rate is possible if the great filter is in our past and industrial civilizations are incredibly rare.
You haven’t commented much on Eliezer’s views on the social approach to slow down the development of AGI—the blocks starting with
and
What’s your take on this?
On slowing down, I’d say strong inside view agreement, I don’t see a way either, not without something far more universal. There’s too many next competitors. Could have been included, probably excluded due to seeming like it followed from other points and was thus too obvious.
On the likelihood of backfire, strong inside view agreement. Not sure why that point didn’t make it into the post, but consider this an unofficial extra point (43?), of something like (paraphrase, attempt 1) “Making the public broadly aware of and afraid of these scenarios is likely to backfire and result in counterproductive action.”
What particular counterproductive actions by the public are we hoping to avoid?
On the object level it looks like there are a spectrum of society-level interventions starting from “incentivizing research that wouldn’t be published” (which is supported by Eliezer) and all the way to “scaring the hell out of general public” and beyond. For example, I can think of removing $FB and $NVDA from ESGs, disincentivizing publishing code and research articles in AI, introducing regulation of compute-producing industry. Where do you think the line should be drawn between reasonable interventions and ones that are most likely to backfire?
On the meta level, the whole AGI foom management/alignment starts not some abstract 50 years in the future, but right now, with the managing of ML/AI research by humans. Do you know of any practical results produced by alignment research community that can be used right now to manage societal backfire / align incentives?
RE: claim 25 about the need for research organisations , my first thought is that government national security organisations might be suitable venues for this kind of research as they have several apparent advantages:
Large budgets
Existing culture and infrastructure for research in secret with internal compartmentalisation
Comparatively good track record for keeping results secret in crypto, such as the NSA with RSA or GCHQ with PGP
Routes to internal prestige and advancement without external publication
Preventing the creation of unaligned AI would accord with their national security goals
However, they may introduce problems of their own:
Clearance requirements limit the talent pool that can work with them
As government organisations with less of a start-up culture, they may be less accommodating of this kind of research
An information leak that one organisation is researching this area could lead to international arms races
Tools suitable for public release that are developed may be seen as untrustworthy by association, such as the skepticism towards the NSA’s crypto advice
A research group would be more beholden to higher-ups who would likely be less sympathetic to the necessity of alignment work compared to capability work
Has this option been discussed already?
[AI risk question, not sure where to ask]
Hey, could you (or someone) help me understand how useful this would be? (Or, what would Yudkowsky say about it?)
I’m asking because this might be something that I, or someone that I know, could do
Some people here inspire me to make predictions ;) So here’s my attempt:
My guess, mainly based on this image (linked from the post):
Is that he’d say it’s a sub category of “getting models to output things based only on their training data, while treating them as a black box and still assuming unexpected outputs will happen sometimes”, as well as “this might work well for training, but obviously not for an AGI” and “if we’re going to talk about limiting a model’s output, Redwood Research is more of a way to go” and perhaps “this will just advance AI faster”
I agree that “think for yourself” is important. That includes updating on the words of the smart thinkers who read a lot of the relevant material. In which category I include Zvi, Eliezer, Nate Soares, Stuart Armstrong, Anders Sandberg, Stuart Russell, Rohin Shah, Paul Chistiano, and on and on.