[RETRACTED] It’s time for EA leadership to pull the short-timelines fire alarm.
[EDIT 4/10/2022: This post was rash and ill-conceived, and did not have clearly defined goals nor meet the vaguely-defined ones. I apologize to everyone on here; you should probably update accordingly about my opinions in the future. In retrospect, I was trying to express an emotion of exasperation related to the recent news I later mention, which I do think has decreased timelines broadly across the ML world.
While I stand by my claims on roughly-human AGI probability, I no longer stand by my statement that “we should pull the fire-alarm”. That is unlikely to lead to the calculated concerted effort we need to maximize our odds of successful coordination. Nor is it at all clear, given the timeline mechanism I described here, that AGI built in this way would be able to quickly FOOM, the primary point of concern for such a fire alarm.
I’ve left the rest of the post here as a record.
]
Based on the past week’s worth of papers, it seems very possible (>30%) that we are now in the crunch-time section of a short-timelines world, and that we have 3-7 years until Moore’s law and organizational prioritization put these systems at extremely dangerous levels of capability.[1]
The papers I’m thinking about:
https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html
https://twitter.com/andyzengtweets/status/1512089759497269251
https://twitter.com/hausman_k/status/1511152160695730181
It seems altogether possible that it would not take long, given these advances and a moderate amount of iteration, to create an agentic system capable of long-term decision-making.
If you want to think of this as the public miscalibrated Bayesian-updating of one person, you should feel free to do that. If this was a conclusion you reached independently, though, I want to make sure we coordinate.
For those who haven’t grappled with what actual advanced AI would mean, especially if many different organizations can achieve it:
No one knows how to build an AI system that accomplishes goals, that also isn’t going to prevent you from turning it off. It’s an unsolved research problem. Researchers have been trying for decades, but none of them think they’ve succeeded yet.
Unfortunately, for most conceivable goals you could give an AI system, the best way to achieve that goal (taken literally, which is the only thing computers know how to do) is to make sure it can’t be turned off. Otherwise, it might be turned off, and then (its version of) the goal is much less likely to happen.
If the AI has any way of accessing the internet, it will copy itself to as many places as it can, and then continue doing whatever it thinks it’s supposed to be doing. At this point, it becomes quite likely that we cannot limit its impact, which is likely to involve much more mayhem, possibly including making itself smarter and making sure that humans aren’t capable of creating other AIs that could turn it off. There’s no off button for the internet.
Most AI researchers do not believe in ~AGI, and thus have not considered the technical details of reward-specification for human-level AI models. Thus, it is as of today very likely that someone, somewhere, will do this anyway. Getting every AI expert in the world, and those they work with, to think through this is the single most important thing we can do.
It is functionally impossible to build a complex system without ever getting to iterate (which we can’t do without an off-switch), and then get lucky and it just works. Every human invention ever has required trial and error to perfect (e.g. planes, computer software). If we have no off-switch, and the system just keeps getting smarter, and we made anything other than the perfect reward function (which, again, no one knows how to do), the global consequences are irreversible.
Do not make it easier for more people to build such systems. Do not build them yourself. If you think you know why this argument is wrong, please please please post it here or elsewhere. Many people have spent their lives trying to find the gap in this logic; if you raise a point that hasn’t previously been refuted, I will personally pay you $1,000.
If this freaks you out, I’m really sorry. I wish we didn’t have to be here. You have permission to listen to everyone else, and not take this that seriously yet. If you’re asking yourself “what can I do”, there are people who’ve spent decades coming up with plans, and we should listen to them.
From my vantage point, the only real answers at this point seem like mass public within-expert advocacy (with as a first step, going through the AI experts who will inevitably be updating on this information) to try and get compute usage restrictions in place, since no one wants anyone else to accidentally deploy an un-airgapped agentic system with no reliable off-switch, even if they think they themselves wouldn’t make that mistake.
Who, in practice, pulls the EA-world fire alarm? Is it Holden Karnofsky? If so, who does he rely on for evidence, and/or what’s stopping those AI alignment-familiar experts from pulling the fire alarm?
The EA community getting on board and collectively switching to short-timelines-AI-public-advocacy efforts seems pretty critical in this situation, to provide talent for mass advocacy among AI experts and their adjacent social/professional networks. The faster and more emphatically it occurs, the more of a chance we stand at propagating the signal to ~all major AI labs (including those in the US, UK, and China).
Who do we need to convince within EA/EA leadership of this? For those of you reading this, do you rate it as less than 30% that we are currently within a fast takeoff, and if not are you basing that on the experts you’d trust having considered the past week’s evidence?
(Crying wolf isn’t really a thing here; the societal impact of these capabilities is undeniable and you will not lose credibility even if 3 years from now these systems haven’t yet FOOMed, because the big changes will be obvious and you’ll have predicted that right.)
EDIT: if anyone adjacent to such a person wants to discuss why the evidence seems very strong, what needs to be done within the next few weeks/months, please do DM me.
- ^
Whether this should also be considered “fast takeoff”, in the sense of recursive self-improvement, is less clear.However, with human improvement alone it seems quite possible we will get to extremely dangerous systems, with no clear deployment limitations. [This was previously the title of the post; I used the term incorrectly.]
- A concrete bet offer to those with short AGI timelines by 9 Apr 2022 21:41 UTC; 199 points) (
- Reshaping the AI Industry by 29 May 2022 22:54 UTC; 147 points) (
- AGI rising: why we are in a new era of acute risk and increasing public awareness, and what to do now by 2 May 2023 10:17 UTC; 68 points) (EA Forum;
- Google AI integrates PaLM with robotics: SayCan update [Linkpost] by 24 Aug 2022 20:54 UTC; 25 points) (
- AGI rising: why we are in a new era of acute risk and increasing public awareness, and what to do now by 3 May 2023 20:26 UTC; 23 points) (
- Pop Culture Alignment Research and Taxes by 16 Apr 2022 15:45 UTC; 16 points) (
- Godshatter Versus Legibility: A Fundamentally Different Approach To AI Alignment by 9 Apr 2022 21:43 UTC; 15 points) (
- 30 Apr 2022 4:02 UTC; 9 points) 's comment on [Linkpost] New multi-modal Deepmind model fusing Chinchilla with images and videos by (
- 30 May 2022 20:47 UTC; 7 points) 's comment on Reshaping the AI Industry by (
- 15 Oct 2023 11:48 UTC; 3 points) 's comment on RSPs are pauses done right by (EA Forum;
- 8 Apr 2022 16:15 UTC; 2 points) 's comment on MIRI announces new “Death With Dignity” strategy by (
- 18 Mar 2023 0:03 UTC; 1 point) 's comment on A concrete bet offer to those with short AGI timelines by (
I’ve found this week’s progress pretty upsetting.
I’m fairly scared that if “the EA community” attempted to pivot to running the fire-alarm, that nothing real would happen, we’d spend our chips, and we’d end up in some complicated plot that had no real chance of working whilst maybe giving up our ability to think carefully any more. Like, there’s no plan stated in the post. If someone has a plan that has a chance of doing any particular thing that’d be more interesting.
I spend various amounts of time in close proximity to a bunch of parts of “EA leadership”, and if you convince me that a strategy will work I could advocate for it.
(Also happy to receive DMs if you want to keep specifics private.)
I find it slightly meta-upsetting that we are already measuring progress in weeks.
Announcements of progress tend to clump together before the major AI conferences.
How much do you think that was a factor in the recent releases happening in short succession? Is there a conference happening soon that this was for?
This was pointed out by someone working in ML—but I can’t find major conference deadlines that this would have been in response to, so maybe it’s not a useful explanation.
But NeurIPS submissions open this week—these may have been intended for that, or precede that to claim priority in case someone else submits something similar. I’d be interested in looking at release dates / times for previous big advances compared to conferences.
No disagreements here; I just want to note that if “the EA community” waits too long for such a pivot, at some point AI labs will probably be faced with people from the general population protesting because even now a substantial share of the US population views the AI progress in a very negative light. Even if these protests don’t accomplish anything directly, they might indirectly affect any future efforts. For example, an EA-run fire alarm might be compromised a bit because the memetic ground would already be captured. In this case, the concept of “AI risk” would, in the minds of AI researchers, shift from “obscure overconfident hypotheticals of a nerdy philosophy” to “people with different demographics, fewer years of education, and a different political party than us being totally unreasonable over something that we understand far better”.
I’m not sure I would agree. The post you linked to is titled “A majority of the public supports AI development.” Only 10% of the population is strongly opposed to. You’re making an implicit assumption that the public is going to turn against the technology in the next couple of years but I see no reason to believe that.
In the past, public opinion really only turns against technology dolloping a big disaster. But we may not see a big AI induced disaster before a change in public opinion will be irrelevant to AGI
And that’s really more like 6% after you take in account the lizardman constant.
Reflecting on this and other comments, I decided to edit the original post to retract the call for a “fire alarm”.
Positive reinforcement for being able to do a retraction! Even when it’s the right thing to do it can be a hard thing to do.
DMed.
I think that we might want to encourage people to contribute what they can towards safety even if they only think they can make small contributions. I think that most people could find something useful to do if they really thought about it and were willing to put any ego aside.
Sounds like a nice thing to think, but I don’t put much stock in it.
Look at the sidebar here? Is this anywhere near optimal? I don’t think so. Surely it should be encouraging people to undertake logical first steps towards becoming involved in alignment (ie. AGI safety fundamentals course, 80,000 hours coaching or booking a call with AI Safety Support).
In a few weeks, I’ll probably be spending a few hours setting up a website for AI Safety Australia and NZ (a prospective org to do local movement-building). Lots of people have web development capabilities, but you don’t even need that with things like Wordpress.
I’ve been spending time going through recent threads and encouraging people who’ve expressed interest in doing something about this, but are unsure what to do, to consider a few logical next steps.
Or maybe just reading about safety and answering questions on the Stampy Wiki (https://stampy.ai)?
Or failing everything else, just do some local EA movement building and make sure to run a few safety events.
I don’t know, it just seems like there’s low-hanging fruit all over the place. Not claiming these are huge impacts, but beats doing nothing.
I think it is good to do things if you have traction. I think it is good to grow the things you can do.
It seems like you’re pointing at a model where society can make progress on safety by having a bunch of people put some marginal effort towards it. That seems insane to me—have I misunderstood you?
Sorry, I don’t quite understand your objection? Is it that you don’t think these are net-positive, that you think all of these little bits will merely add up to a rounding error or that you think timelines are too short for them to make a difference?
I think the impact of little bits of “people engage with the problem” is not significantly positive. Maybe it rounds to zero. Maybe it is negative, if people engaging lightly flood serious people with noisy requests.
Hard research problems just don’t get solved by people thinking for five minutes. There are some people who can make real contributions [0] by thinking for ~five hours per week for a couple of months, but they are quite rare.
(This is orthogonal to the current discussion, but: I had not heard of stampy.ai before your comment. Probably you should refer to it as stampy.ai, because googling “stampy wiki” give sit as the ~fifth result, behind some other stuff that is kind of absurd.)
[0] say, write a blog post that gets read and incorporated into serious people’s world models
I’m not suggesting that they contribute towards research, just that if they were able to reliably get things done they’d be able to find someone who’d benefit from a volunteer. But I’m guessing you think they’d waste people’s time by sending them a bunch of emails asking if they need help? Or that a lot of people who volunteer then cause issues by being unreliable?
The question isn’t so much whether a contribution toward safety is small or big but whether you can actually find a contribution that’s certain to have a small contribution toward safety. If you think there are a bunch of small things toward safety that can be done, what do you have in mind?
See this comment.
About 15 years ago, before I’d started professionally studying and doing machine learning research and development, my timeline had most of its probability mass around 60 − 90 years from then. This was based on my neuroscience studies and thinking about how long I thought it would take to build a sufficiently accurate emulation of the human brain to be functional. About 8 years ago, studying machine learning full time, AlphaGo coming out was inspiration for me to carefully rethink my position, and I realized there were a fair number of shortcuts off my longer figure that made sense, and updated to more like 40 − 60 years. About 3 years ago, GPT-2 gave me another reason to rethink with my then fuller understanding. I updated to 15 − 30 years. In the past couple of years, with the repeated success of various explorations of the scaling law, the apparent willingness of the global community to rapidly scale investments in large compute expenditures, and yet further knowledge of the field, I updated to more like 2 − 15 years as having 80% of my probability mass. I’d put most of that in the 6 − 12 year range, but I wouldn’t be shocked if things turned out to be easier than expected and something really took off next year.
One of the things that makes me think the BioAnchors estimate is a bit too far into the future is that I know from neuroscience that it’s possible for a human to have a sufficient set of brain functions to count as a General Intelligence by a fairly reasonable standard with significant chunks of their brain dead or missing. I mean, they’ll not be in great shape as they’ll be missing some stuff, but if the stuff they’re missing is non-critical they can still function well enough to be a minimal GI. Plenty well enough to be scary if they were a self-improving self-replicating agentive AI.
So anyway, yeah, I’ve been scared for a while now. Latest news has just reinforced my belief we are in a short timeline world, not surprised me. Glad to see more people getting on board with my point of view.
If AI timelines are short, then I wouldn’t focus on public advocacy, but the decision-makers. Public opinion changes slowly and succeeding may even interfere with the ability of experts to make decisions.
I would also suggest that someone should focus on within-EA advocacy too (whilst being open about any possible limitations or uncertainties in their understanding).
To clarify, by “public advocacy” I should’ve said “within-expert advocacy, i.e. AI researchers (not just at AGI-capable organizations)”. I’ll fix that.
Along these lines, I’ve been thinking maybe the best chance we have will be finding ways to directly support the major AI labs most likely to create advanced AI, to help guide their decisions toward better outcomes.
Like perhaps some well-chosen representatives from the EA AI safety community could be doing free, regular consulting with DeepMind and OpenAI, etc. about safety. Find some way to be a resource they consider useful but also help them keep safety top-of-mind. If free isn’t good enough, with the current state of funding in EA, we could even pay the companies just to meet regularly with the AI safety consultants.
If this was realized, of course the consultants would have to sign NDAs and so what’s going on couldn’t be openly discussed on forums like LessWrong. (I suppose this kind of arrangement may already be happening and we just aren’t aware of it because of this.)
Update: Chris’s suggestion in the reply to this comment just for EA funders to offer the labs money to hire more safety researchers seems simpler and more workable than the above consultant model.
This is a rough idea—a lot more thought needs to go into exactly what to do and how to do it. But something like this could be extremely impactful. A handful of people at one of these AI companies could well be soon determining the fate of humanity with their engineering decisions. If we could positively influence them in some way, that may be our best hope.
Yeah, I wonder if we could offer these companies funding to take on more AI Safety researchers? Even if they’re well-resourced, management probably wants to look financially responsible.
DeepMind and OpenAI both already employee teams of existential-risk focused AI safety researchers. While I don’t personally work on any of these teams, I get the impression from speaking to them that they are much more talent-constrained than resource-constrained.
I’m not sure how to alleviate this problem in the short term. My best guess would be free bootcamp-style training for value-aligned people who are promising researchers but lack specific relevant skills. For example, ML engineering training or formal mathematics education for junior AIS researchers who would plausibly be competitive hires if that part of their background were strengthened.
However, I don’t think that offering AI safety researchers as “free consultants” to these organizations would have much impact. I doubt the organizations would accept since they already have relevant internal teams, and AI safety researchers can presumably have greater impact working within the organization than as external consultants.
The low-effort version of this would be, instead of spinning up your own bootcamp, having value-aligned people apply for a grant to the Long-Term Future Fund to participate in a bootcamp.
Well, the mass public advocacy in the strict sense may not change the public opinion in a short time, but I’m still willing to give it a try.
I mean, what would be the actual downsides of a literal mob showing up at DeepMind’s headquarters holding “please align AI” giant banners?
(EDIT: maybe “mob” is not the right word, I’m not advocating for angry mobs burning down the AI labs… “crowd” would have been better).
I’m not completely opposed to public outreach. I think there should be some attempts to address misconceptions (ie. such that it is like Terminator; or at least what people remember/infer about Terminator).
I haven’t really thought that through. It might be worth talking to people and see what they say.
I’m pretty opposed to public outreach to get support for alignment, but the alternative goal of whipping up enough hysteria to destroy the field of AI/the AGI development groups killing us seems much more doable. Reason being from my lifelong experience observing public discourse on topics I have expert knowledge on (e.g. nuclear weapons, China), it seems completely impossible to implant the exact right ideas into the public mind, especially for a complex subject. Once you attract attention to a topic, no matter how much effort you put into presenting the proper arguments, the conversation and people’s beliefs inevitably trend toward simple & meme-y/emotionally riveting ideas, instead of the accurate ones. (Looking at the popular discourse on climate change is another good illustration of this.)
But in this case, maybe even if people latch onto misguided fears about Terminator or whatever, as long as they have some sort of intense fear of AI, it can still produce the intended actions. To be clear I’m still very unsure whether such a campaign is a good idea at this point, just a thought.
I think reaching out to governments is a more direct lever: civilians don’t have the power to shut down AI themselves (unless mobs literally burn down all the AGI offices), the goal with public messaging would be to convince them to pressure the leadership to ban it right? Why not cut out the middleman and make the leaders see the dire danger directly?
Holding please align AI signs in front of DeepMind’s headquarters is an idea. Attempting to persuade “the general public” is a bad idea. “The general public” will react too slowly and without any competence. We need to target people actually doing the burning.
I am deeply worried about the prospect of a botched fire alarm response. In my opinion, the most likely result of a successful fire alarm would not be that society suddenly gets its act together and finds the best way to develop AI safely. Rather, the most likely result is that governments and other institutions implement very hasty and poorly thought-out policy, aimed at signaling that they are doing “everything they can” to prevent AI catastrophe. In practice, this means poorly targeted bans, stigmatization, and a redistribution of power from current researchers to bureaucratic agencies that EAs have no control over.
I do concede this is a real enough risk that this is played wrong, and would very strongly encourage those considering independent efforts to centrally coordinate so we maximize the odds of any distributed actions going “right”.
Reflecting on this and other comments, I decided to edit the original post to retract the call for a “fire alarm”.
That is, to put it mildly, a pretty strong claim, and one I don’t think the rest of your post really justifies. Without which it’s still just listing a theoretical thing to worry about
You’re completely right. If you don’t believe it, this post isn’t really trying to update you. This is more to serve as a coordination mechanism for the people who do think the rest isn’t very difficult (which I am assuming is a not-small-number).
Note that I also don’t think the actions advocated by the post are suboptimal even if you only place 3-7 years at 30% probability.
I’m a little worried about what might happen if different parts of the community end up with very different timelines, and thus very divergent opinions on what to do.
It might be useful if we came up with some form of community governance mechanism or heuristics to decide when it becomes justified to take actions that might be seen as alarmist by people with longer timelines. On the one hand, we want to avoid stuff like the unilateralist’s curse, on the other, we can’t wait for absolutely everyone to agree before raising the alarm.
One probably-silly idea: We could maybe do is some kind of trade. Long-timelines people agree to work on short-timelines people’s projects over the next 3 years. Then if the world isn’t destroyed, the short-timelines people work for the long-timelines people’s projects for the following 15 years. Or something.
My guess is that the details are too fraught to get something like this to work (people will not be willing to give up so much value), but maybe there’s a way to get it to work.
I don’t know enough to evaluate this post. I don’t know if it is correct or not. However, a completely convincing explanation could possibly shorten timelines. So is that satisfying? Not really. But the universe doesn’t have to play nice.
One year later, in light of the ChatGPT internet and shell plugins, do you still think this is just a theoretical thing? Should we worry about it yet?
The fire alarm sentiment seems to have been entirely warranted, even if the plan proposed in the post wouldn’t have been helpful.
As a non-expert, I’m confused about what exactly was so surprising in the works which causes a strong update. “The intersection of many independent, semi-likely events is unlikely” could be one answer, but I’m wondering whether there is more to it. In particular, I’m confused why the data is evidence for a fast take-off in contrast to a slow one.
First, I mistitled the post, and as a result your response is very reasonable. This is less clearly evidence for “fast takeoff” and more clearly evidence for “fast timelines”.
In terms of why, the different behaviors captured in the papers constitute a large part of what you’d need to implement something like AlphaGo in a real-world environment. Will stitching them together work immediately? Almost certainly not. Will it work given not-that-much creative iteration, say over 5 years of parallel research? It seems not unlikely, I’d give it >30%.
You can edit the post title.
Done! Do you think it should be edited further?
No, this seems to capture it. No need to make it complicated.
A fire alarm approach won’t work because you would have people like Elon Musk and Mark Zuckerberg saying that we should be developing AI faster than we currently are. What I suggest should happen instead is that the EA community should try to convince a subset of people that AI risks are 80%+ of what we should care about, and if you donate to charity most of it should go to an AI risk organization, and if you have the capacity to directly contribute to reducing AI risk that is what you as a moral person should devote your life to.
I don’t think donating to other organizations is meaningful at this point unless those organizations have a way to spend a large amount of capital.
Both Musk and Zuckerberg are convinceable, they’re not insane, you just need to find the experts they’re anchoring on. Musk in particular definitely already believes the thesis.
Additional money would help as evidence by my son’s job search. My 17-year-old son is set to graduate college at age 18 from the University of Massachusetts at Amherst (where we live) majoring in computer science, concentrating in statistics and machine learning. He is looking for a summer internship. He would love to work in AI safety (and through me has known and been interested in the field since a very young age), and while he might end up getting a job in the area, he hasn’t yet. In a world where AI safety is well funded, every AI safety organization would be trying to hire him. In case any AI safety organizations are reading this, you can infer his intelligence from him having gotten 5s on the AP Calculus BC and AP Computer Science A exams in 7th grade. I have a PhD in economics from the University of Chicago and a JD from Stanford and my son is significantly more intelligence than I am.
Tell him to submit an application here, if he hasn’t already. These guys are competent and new.
I’ve heard the story told that Beth Barnes applied to intern at CHAI, but that they told her they didn’t have an internship program. She offered to create one and they actually accepted her offer.
I’m setting up AI Safety Australia and New Zealand to do AI safety movement-building (not technical research). We don’t properly exist yet (I’m still only on a planning grant), we don’t have a website and I don’t have funding for an internship program, but if someone were crazy enough to apply anyway, then I’d be happy that they reached out. They’d have to apply for funding so that I can pay them (with guidance).
I’m sure he can find access to better opportunities, but just thought I’d throw this out there anyway as there may be someone who is agenty, but can’t access the more prestigious internships.
Funding is not literally the only constraint; organizations can also have limited staff time to spread across hiring, onboarding, mentoring, and hopefully also doing the work the organization exists to do! Scaling up very quickly, or moderately far, also has a tendency to destroy the culture of organizations and induce communications problems at best or moral mazes at worst.
Unfortunately “just throw money at smart people to work independently” also requires a bunch of vetting, or the field collapses as an ocean of sincere incompetents and outright grifters drown out the people doing useful work.
That said, here are a couple of things for your son—or others in similar positions—to try:
https://www.redwoodresearch.org/jobs (or https://www.anthropic.com/#careers, though we don’t have internships)
Write up a proposed independent project, then email some funders about a summer project grant. Think “implement a small GPT or Efficient-Zero, apply it to a small domain like two-digit arithmetic, and investigate a restricted version of a real problem (in interpretability, generalization, prosaic alignment, etc)
You don’t need anyone’s permission to just do the project! Funding can make it easier to spend a lot of time on it, but doing much smaller projects in your free time is a great way to demonstrate that you’re fundable or hirable.
There is at least $10B that could straightforwardly be spent on AI safety. If these organizations are limited on money instead of logistical bandwidth, they should ping OpenPhil/FTX/other funders. Individuals’ best use of their time is probably on actual advocacy rather than donation.
I believe we are in the place we are in because Musk is listening and considering the arguments of experts. Contra Yudkowsky, there is no Correct Contrarian Cluster: while Yudkowsky and Bostrom make a bunch of good and convincing arguments about the dangers of AI and the alignment problem and even shorter timelines, I’ve always found any discussion of human values or psychology or even how coordination works to be one giant missing mood.
(Here’s a tangential but recent example: Yudkowsky wrote his Death with Dignity post. As far as I can tell, the real motivating point was “Please don’t do idiotic things like blowing up an Intel fab because you think it’s the consequentialist thing to do because you aren’t thinking about the second order consequences which will completely overwhelm any ‘good’ you might have achieved.” Instead, he used the Death with Dignity frame which didn’t actually land with people. Hell, my first read reaction was “this is all bullshit you defeatist idiot I am going down swinging” before I did a second read and tried to work a defensible point out of the text.)
My model of what happened was that Musk read Superintelligence, thought: this is true, this is true, this is true, this point is questionable, this point is total bullshit...how do I integrate all this together?
When you’re in takeoff, it doesn’t really matter whether people sprint to AGI or not, because either way we know lots of teams will eventually get there. We don’t have good reason to believe that the capability is likely to be restricted to one group, especially given that they don’t seem to be using any secret sauce. We also don’t seem at all likely to be within the sort-of alignment regime where a pivotal act isn’t basically guaranteed to kill everyone.
All the organizations that think about AGI know the situation, or will figure it out quickly, whether or not the EAs say something. If we do nothing, they will not go through the logic of “what reward do I give it” until right before they hit run. That is, unless you do the public advocacy first.
Have you thought about writing up a post suggesting this on the EA forum and then sharing it in various EA groups? One thing I’d be careful about though is writing “reducing AI risk that is what you as a moral person should devote your life to” as it could offend people committed to other cause areas. I’m sure you’d be able to find a way to word it better though.
Reflecting on this and other comments, I decided to edit the original post to retract the call for a “fire alarm”.
Given that it looks like (from your Elaboration) language models will form the cores of future AGIs, and human-like linguistic reasoning will be a big part of how they reason about goals (like in the “Long sequences of robot actions generated by internal dialogue” example) can’t we just fine-tune the language model by training it on statements like “If (authorized) humans want to turn me off, I should turn off.”
Maybe we can even fine-tune it with statements describing our current moral beliefs/uncertainties and examples of moral/philosophical reasoning, and hope that AGI will learn morality from that, like human children (sometimes) do. Obvious it’s very risky to take a black-box approach where we don’t really understand what the AI has learned (I would much prefer if we could slow things down enough to work out a white-box approach), but it seems like there’s maybe a 20% chance we can just get “lucky” this way?
Why would that make it corrigible to being turned off? What does the word “should” in the training data have to do with the system’s goals and actions? The AI does not want to do what it ought (where by “ought” I mean the thing AI will learn the word means from human text). It won’t be motivated by what it “should” do any more than by what it “shouldn’t” do.
This is a fundamental flaw in this idea; it is not repairable by tweaking the prompt. The word “should” will, just, having literally nothing whatsoever to do with what the AI is optimizing for (or even what it’s optimized for). DALL-E doesn’t make pictures because it “should” do that; it makes pictures because of where gradient descent took it.
Like, best-case scenario, it repeats “I should turn off” as it kills us.
Wei is correct, current LLMs are 100% corrigible. Large language models are trained on so-called self supervised objective functions to “predict the next word” (or sometimes, predict the masked word). If we’d like them to provide a particular output, all we need is to include that response in the training data. Through the training process, the model naturally learns to agree with its input data.
The problem of (in)corrigibility, as formalized by this MIRI paper, is our potential (in)ability to turn off an AI agent. But the paper only concerns agents, which language models are not. RL agents pose the potential for self-preservation, but self-supervised language models are more akin to “oracle” AIs that merely answer questions without a broader goal in mind.
Now, the most compelling stories of AI doom combine language processing with agentic optimization. These agents could be incorrigible and attempt self-preservation, potentially at the expense of humanity. Unfortunately most work on this topic has been theoretical—I would love to see an empirical demonstration of incorrigible self-preservation behavior by an RL agent.
Maybe!
We should be thinking through the problems with that!
It’s not the worst idea in the world!
EDIT: note that the obvious failure mode here is “but then you retrain the language model as part of your RL loop, and language loses its meaning, and then it does something evil and then everyone dies.” So everyone still has to not do that! But this makes me think that ~human-level alignment in controlled circumstances might not be impossible.
It doesn’t change the fact that if anyone thinks it would be fun to fine-tune the whole model on an RL objective, we’d still lose. So we have to do global coordination ASAP. (It does not seem at all likely that Yudkowskian pivotal acts can be achieved solely through the logic learned from getting ~perfect LLM accuracy on the entire internet, since all proposed pivotal acts require new concepts no one has ever discussed.)
Yeah, don’t do RL on it, but instead use it to make money for you (ethically) and at the same time ask it to think about how to create a safe/aligned superintelligent AGI. You may still need a big enough lead (to prevent others doing RL outcompeting you) or global coordination but it doesn’t seem obviously impossible.
Pretty much. I also think this plausibly buys off the actors who are currently really excited about AGI. They can make silly money with such a system without the RL part—why not do that for a while, while mutually-enforcing the “nobody kill everyone” provisions?
Even if I assume this all goes perfectly, would you want a typically raised teenager (or adult) to have ~infinite power to change anything they want about humanity? How about a philosopher? Do you know even 10 people who you’ve seen what decisions they’ve advocated for and you’d trust them with ~infinite power?
Hey!
One of the problems in
are that these are not well defined, and if you let a human (like me) read it, I will automatically fill in the blanks to probably match your own intuition.
As examples of problems:
“I should turn off”
who is “I”? what if the AI makes another one?
What is “should”? Does the AI get utility from this or not? If so, Will the AI try to convince the humans to turning it off? If not, will the AI try to prevent humans from WANTING to turn it off?
It’s way too late for the kind of top-down capabilities regulation Yudkowsky and Bostrom fantasized about; Earth just doesn’t have the global infrastructure. I see no benefit to public alarm—EA already has plenty of funding.
We achieve marginal impact by figuring out concrete prosaic plans for friendly AI and doing outreach to leading AI labs/researchers about them. Make the plans obviously good ideas and they will probably be persuasive. Push for common-knowledge windfall agreements so that upside is shared and race dynamics are minimized.
Earth does have the global infrastructure, we just don’t have access to it because we have not yet persuaded a critical mass of experts. AWS can just stop anyone from renting GPUs without their code being checked, and beyond that if you can create public consensus via iteration-based refined messaging, you make sure everyone knows the consequences of doing it.
People should absolutely be figuring out prosaic plans, and core alignment researchers probably shouldn’t stop doing their work. However, it’s simply not true that all capable labs (or those that will be capable soon) will even take a meeting with AI safety people, given the current belief environment. E.g. who do you call at BAAI?
It does? What do you mean? The only thing I can think of is the UN, and recent events don’t make it very likely they’d engage in coordinated action on anything.
If you convince the CCP, the US government, and not that many other players that this is really serious, it becomes very difficult to source chips elsewhere.
The CCP and the US government both make their policy decisions based on whatever (a weirdly-sampled subset of) their experts tell them.
Those experts update primarily on their colleagues.
So we just need to get two superpowers who currently feel they are in a zero sum competition with each other to stop trying to advance in an area that gives them a potentially infinite advantage? Seems a very classic case of the kind of coordination problems that are difficult to solve, with high rewards for defecting.
We have, partially managed to do this for nuclear and biological weapons. But only with a massive oversight infrastructure that doesn’t exist for AI. And relying on physical evidence and materials control that doesn’t exist for AI. It’s not impossible, but it would require a similar level of concerted international effort that was used for nuclear weapons. Which took a long time, so possibly doesn’t fit with your short timeline
If we do as well with preventing AGI as we have with nuclear non-proliferation, we fail. And, nuclear non-proliferation has been more effective than some other regimes (chemical weapons, drugs, trade in endangered animals, carbon emissions, etc.). In addition, because of the need for relatively scarce elements, control over nuclear weapons is easier than control over AI.
And, as others have noted the incentives for develpong AI are far stronger than for developing nuclear weapons.
What makes you think we fail if it looks like nukes? If everyone agrees on alignment difficulty and we have few actors, it is not unreasonable for no one to push the button, just like they don’t with MAD.
There are currently nine countries who have deployed nuclear weapons. At least four of those nine are countries that the non-proliferation regime would have preferred to prevent having nuclear weapons.
An equivalent result in AGI would have four entities deploying AGI. (And in the AGI context, the problem is deployment not using the AGI in any particular way.)
Note that 8 of those countries have never used nukes, and all 9 of them if you start after the IAEA was founded.
Most people think if 500 entities had nukes, they would be used more. But with few, MAD can work. AGI doesn’t have MAD, but it has a similar dynamic if you convince everyone of the alignment problem.
But… there isn’t reward for defecting? Like, in a concrete actual sense. The only basis for defection is incomplete information. If people think there is a reward, they’re in some literal sense incorrect, and the truth is ultimately easier to defend. Why not (wisely, concertedly, principledly) defend it?
And there are extremely concrete reasons to create that international effort for oversight (e.g. of compute), given convergence on the truth. The justifications, conditioned on the truth, are at least as great if not greater than the nuclear case.
Reward is not creation of uncontrolled AGI. Reward is creation of powerful not-yet-AGI systems which can drastically accelerate technical, scientific or military progress of country.
It’s pretty huge potential upside, and consequences of other superpower developing such technology can be catastrophic. So countries have both reward for defecting and risk to lose everything if other country defects.
Yes, such “AI race” is very dangerous. But so was nuclear arms race, and countries still did it.
Oh I don’t think anyone is going to be convinced not to build not-yet-AGI.
But it seems totally plausible to convince people not to build systems that they think have a real possibility of killing them, which, again, consequentialists will do because we don’t know how to build an off-switch.
Is there any concrete proposal that meets your specification? “don’t kill yourself with AGI, please”?
Prevent agglomerations of data center scale compute via supply chain monitoring, do mass expert education, create a massive social stigma (like with human bio experimentation), and I think we buy ourselves a decade easily.
How does that distinguish between AGI and not-yet-AGI? How does that prevent an arms race?
An arms race to what? If we alignment-pill the arms-racers, they understand that pushing the button means certain death.
If your point is an arms race on not-unbounded-utility-maximizers, yeah afaict that’s inevitable… but not nearly as bad?
Pushing which button? They’re deploying systems and competing on how capable those systems are. How do they know the systems they’re deploying are safe? How do they define “not-unbounded-utility-maximizers” (and why is it not a solution to the whole alignment problem)? What about your “alignment-pilled” world is different from today’s world, wherein large institutions already prefer not to kill themselves?
Wait, there are lots of things that aren’t unbounded utility maximizers—just because they’re “uncompetitive” doesn’t mean that non-suicidal actors won’t stick to them. AlphaGo isn’t! The standard LessWrong critique is that such systems don’t provide pivotal acts, but the whole point of governance is not to need to rely on pivotal acts.
The difference from this world is that in this world large institutions are largely unaware of alignment failure modes and will thus likely deploy unbounded utility maximizers.
So you have a crisp concept called “unbounded utility maximizer” so that some AI systems are, some AI systems aren’t, and the ones that aren’t are safe. Your plan is to teach everyone where that sharp conceptual boundary is, and then what? Convince them to walk back over the line and stay there?
Do you think your mission is easier or harder than nuclear disarmament?
The alignment problem isn’t a political opinion, it’s a mathematical truth. If they understand it, they can and will want to work the line out for themselves, with the scientific community publicly working to help any who want it.
Nuclear disarmament is hard because if someone else defects you die. But the point here is that if you defect you also die. So the decision matrix on the value of defecting is different, especially if you know other people also know their cost of defection is high.
If you launch the nukes, you also die, and we spend a lot of time worrying about that. Why?
We actually don’t worry about that that much. Nothing close to the 60s, before the IAEA and second strike capabilities. These days we mostly worry about escalation cycles, i.e. unpredictable responses by counter parties to minor escalations and continuously upping the ante to save face.
There isn’t an obvious equivalent escalation cycle for somebody debating with themselves whether to destroy themselves or not. (The closer we get to alignment, the less true this is, btw.)
FYI, him having that responsibility would seemingly entail a conflict of interest; he said in an interview:
What makes you say this? What should I read to appreciate how big a deal for AGI the recent papers are?
Reflecting on this and other comments, I decided to edit the original post to retract the call for a “fire alarm”.
I don’t have the bandwidth to write this up, though I strongly encourage someone else to.
To be blunt, I don’t believe that you have so little bandwidth given the stakes. If timelines are this short, movement strategy has to pivot considerably, and this requires everyone knowing the evidence. Such a writeup could be on the critical path for the entire movement.
Fair enough. Realize this is a bit of an infohazard. Basically, consider the pieces needed to build EfficientZero with language as the latent space, and then ask yourself which of those pieces hasn’t been shown to basically work in the last week.
[Before you point out the limitations of EfficientZero: i know. But rather than spelling them out, consider whether you can find any other recent papers that suggest how to solve those problems. Actually giving irresponsible readers a research plan is not a good idea.]
Then you’re basically at a dog (minus physical locomotion/control). It is very hard to predict what you will be at if you scale 3 more OOMs, via Moore’s law or organizational intent.
You’ve already posted this, but for the future, I’d suggest checking with the mods first. Once something has been posted, it can’t be removed without creating even more attention.
What exactly do you mean by “we are now in a fast takeoff”? (I wouldn’t say we’re in a fast takeoff until AI systems are substantially accelerating improvement in AI systems, which isn’t how I’d characterize the current situation.)
I might be abusing the phrase? We are in a “we should probably have short-timelines, we can see the writing on the wall of how these systems might be constructible” situation, but not in a literal “self-improvement” situation.
Is that called something different?
There have, over the decades, been plans for “self improving artificial general intelligence” where the AGI’s cleverness is aimed directly at improving the AGI’s cleverness, and the thought is that maybe this will amplify, like neutron cascades in fission or an epidemiological plague with sick people causing more sick people in a positive feedback loop.
Eurisko was a super early project sort of along these lines, with “heuristics” built out of “heuristics” by “heuristics”.
The idea of “fast takeoff” imagines that meta learning about meta learning about meta learning might turn out to admit of many large and pragmatically important insights about “how to think real good” that human culture hasn’t serially reached yet, because our brains are slow and our culture accretes knowledge in fits and starts and little disasters each time a very knowledgeable genius dies.
“Fast takeoff” is usually a hypothetical scenario where self-improvement that gets exponentially better turns out to be how the structure of possible thinking works, and something spend lots of serial steps (for months? or for hours in a big datacenter?) seeming to make “not very much progress” because it is maybe (to make an example up) precomputing cache lookups for generic patterns by which turing machines can be detected to have entered loops or not, and then in the last hour (or the last 10 seconds in a big datacenter) it just… does whatever it is that an optimal thinker would do to make the number go up.
This is distinct from “short timelines” because a lot of people have thought for the last 20 years that AGI might be 200 years away, or just impossible for humans, or something. For example… Andrew Ng is a famous idiot (who taught a MOOC on fancy statistics back in the day and then parlayed MOOC-based fame into a fancy title at a big tech company) who quipped that worrying about AI is like worrying about “overpopulation on Mars”. In conversations I had, almost everyone smart already knew that Andrew Ng was being an idiot here on the object level… (though maybe actually pretty smart at tricking people into talking about him?) but a lot of “only sorta smart” people thought it would be hubristic to just say that he was wrong, and so took it kind of on faith that he was “object level correct”, and didn’t expect researchers to make actual progress on actual AI.
But progress is happening pretty fast from what I can tell. Whether the relatively-shorter-timelines-than-widely-expected thing converts into a faster-takeoff-than-expected remains to be seen.
I hope not. I think think faster takeoffs convergently imply conflicts of interest, and that attempts to “do things before they can be blocked” would mean that something sorta like like “ambush tactics” were happening.
A couple more thoughts on this post which I’ve spent a lot of today thinking about and discussing with folks:
This post was good for generating a lot of discussion and engagement on the topic, but it’d be great to have some more careful, thorough systematic analysis of the arguments and implications presented. This post seems to be arguing for short timelines and at least a medium-fast takeoff (which I tend to agree with), but then it argues for mass advocacy as a result.
This is the opposite kind of intervention that makes sense to me and that Holden argues for in this kind of takeoff scenario: ‘Faster and less multipolar takeoff dynamics tend to imply that we should focus on very “direct” interventions aimed at helping transformative AI go well: working on the alignment problem in advance, caring a lot about the cultures and practices of AI labs and governments that might lead the way on transformative AI, etc.’ This is from his Important, actionable questions for the most important century doc.
A more careful and complete analysis should be framed in answer to the ‘Questions about AI “takeoff dynamics”’ from that doc by someone who can commit the time and thought to it.
While we should look for strong evidence and strong arguments about timelines and takeoff, we shouldn’t be surprised not to be able to arrive at consensus about it. Given that this post is about pulling the “fire alarm” I’m kind of surprised no one here has linked yet to MIRI’s very aptly titled There’s No Fire Alarm for Artificial General Intelligence:
”When I observe that there’s no fire alarm for AGI, I’m not saying that there’s no possible equivalent of smoke appearing from under a door. What I’m saying rather is that the smoke under the door is always going to be arguable; it is not going to be a clear and undeniable and absolute sign of fire; and so there is never going to be a fire alarm producing common knowledge that action is now due and socially acceptable. [...] There is never going to be a time before the end when you can look around nervously, and see that it is now clearly common knowledge that you can talk about AGI being imminent, and take action and exit the building in an orderly fashion, without fear of looking stupid or frightened.”
Reflecting on this and other comments, I decided to edit the original post to retract the call for a “fire alarm”.
I strongly disagree that there is a >10% chance of AGI in the next 10 years. I don’t have the bandwidth to fully debate the topic here and now, but some key points:
My comment EA has unusual beliefs about AI timelines and Ozzie Gooen’s reply
Why I have longer timelines than the BioAnchors report
Two other considerations driving me towards longer timelines
Of the news in the last week, PaLM definitely indicates faster language model progress over the next few years, but I’m skeptical that this will translate to success in the many domains with sparse data. Holden Karnofsky’s timelines seem reasonable to me, if a bit shorter than my own:
Pulling from those comments, you said:
A lot of prominent scientists, technologists and intellectuals outside of EA have warned about advanced artificial intelligence too. Stephen Hawking, Elon Musk, Bill Gates, Sam Harris, everyone on this open letter back in 2015 etc.
I agree that the number of people really concerned about this is strikingly small given the emphasis longtermist EAs put on it. But I think these many counter-examples warn us that it’s not just EAs and the AGI labs being overconfident or out of left field.
I know you said you don’t have time to fully debate this. This seemed to be one of the cruxes of your first bullet point though. So if your skepticism about short timelines is driven in a big way by thinking that no credible person outside EA or companies invested in AI think this is plausible, then I am curious what you make of this.
Hey Evan, thanks for the response. You’re right that there are circles where short AI timelines are common. My comment was specifically about people I personally know, which is absolutely not the best reference class. Let me point out a few groups with various clusters of timelines.
Artificial intelligence researchers are a group of people who believe in short to medium AI timelines. Katja Grace’s 2015 survey of NIPS and ICML researchers provided an aggregate forecast giving a 50% chance of HLMI occurring by 2060 and a 10% chance of it occurring by 2024. (Today, seven years after the survey was conducted, you might want to update against the researchers that predicted HLMI by 2024.) Other surveys of ML researchers have shown similarly short timelines. This seems as good of an authority as any on the topic, and would be one of the better reasons to have relatively short timelines.
What I’ll call the EA AI Safety establishment has similar timelines to the above. This would include decision makers at OpenPhil, OpenAI, FHI, FLI, CHAI, ARC, Redwood, Anthropic, Ought, and other researchers and practitioners of AI safety work. As best I can tell, Holden Karnofsky’s timelines are reasonably similar to the others in this reference group, including Paul Christiano and Rohin Shah (would love to add more examples if anybody can point to them), although I’m sure there are plenty of individual outliers. I have a bit longer timelines than most of these people for a few object level reasons, but their timelines seem reasonable.
Much shorter timelines than the two groups above come from Eliezer Yudkowsky, MIRI, many people on LessWrong, and others. You can read this summary of Yudkowsky’s conversation with Paul Christiano, where he does not quantify his timelines but consistently argues for faster takeoff speeds than Christiano believes are likely. See also this aggregation of the five most upvoted timelines from LW users, with a median of 25 years until AGI. That is 15 years sooner than Holden Karnofsky and 15 years sooner than Katja Grace’s survey of ML researchers. This is the group of scenarios that I would most strongly disagree with, appealing to both the “expert” consensus and my object level arguments above.
The open letter from FLI does not mention any specific AI timelines at all. These individuals all agree that the dangers from AI are significant and that AI safety research is important, but I don’t believe most of them have particularly short timelines. You can read about Bill Gates’s timelines here, he benchmarks his timelines as “at least 5 times as long as what Ray Kurzweil says”. I’m sure other signatories of the letter have talked about their timelines, I’d love to add these quotes but haven’t found any others.
Overall, I’d still point to Holden Karnofsky’s estimates as the most reasonable “consensus” on the topic. The object-level reasons I’ve outlined above are part of the reason why I have longer timelines than Holden, but even without those, I don’t think it’s reasonable to “pull the short timelines fire alarm”.
2015 feels decades ago though. That’s before GPT-1!
I would expect a survey done today to have more researchers predicting 2024. Certainly I’d expect a median before 2060! My layman impression is that things have turned out to be easier to do for big language models, not harder.
The surveys urgently need to be updated.
This was heavily upvoted at the time of posting, including by me. It turns out to be mostly wrong. AI Impacts just released a survey of 4271 NeurIPS and ICML researchers conducted in 2021 and found that the median year for expected HLMI is 2059, down only two years from 2061 since 2016. Looks like the last five years of evidence hasn’t swayed the field much. My inside view says they’re wrong, but the opinions of the field and our inability to anticipate them are both important.
https://aiimpacts.org/2022-expert-survey-on-progress-in-ai/
Appreciate the super thorough response, Aidan. You’re right it turns out that some of the people I mentioned like Gates who are concerned about AI aren’t on record with particularly short timelines.
I agree with FeepingCreature’s comment that some of these surveys and sources are starting to feel quite dated. Apparently GovAI is currently working on replicating the 2015 Grace et al. survey which I’m very much looking forward to. The Bostrom survey you linked to is even older than that—from 2012~13. At least the Gruetzemacher survey is from 2018.
(tangent) One thing I see in some of the surveys as well as in discussion that bothers me is an emphasis on automation of 100% of tasks or 99% of tasks. I think Holden made an important point in his discussion of a weak point in his Most Important Century series, that transformative AI need not depend on everything being automated. In fact, just having full-automation of a small number of specific activities could be all that’s needed:
You have my upvote for publicly disagreeing and explaining why <3
(even though I personally don’t agree with your conclusion, I really want to encourage people to discuss these opinions and not keep them “bottled up” (especially some really smart people I know that I wish were working on AI Safety!!..))
I just submitted a reply to the shortform you linked. Basically those are thoughtful data points, but I think you interpreting from them that timelines are likely to be longer. An alternate interpretation is that hard takeoff will be more likely than soft takeoff.
Also, probably just a typo from your comment above, but did you mean these two considerations are driving you towards longer timelines? (rather than “shorter” timelines)
Thanks, fixed.
Note that Metaculus predictions don’t seem to have been meaningfully changed in the past few weeks, despite these announcements. Are there other forecasts which could be referenced?
This post is mainly targeted at people capable of forming a strong enough inside view to get them above >30% without requiring a moving average of experts which may take months to update (since it’s a popular question).
For everyone else, I don’t think you should update much on this except vis a vis the number of other people who agree.
Actually, the Metaculus community prediction has a recency bias:
> approximately sqrt(n) new predictions need to happen in order to substantially change the Community Prediction on a question that already has n players predicting.
In this case, n=298, the prediction should change substantially after sqrt(n)=18 new predictions (usually it takes up to a few days). Over the past week, there were almost this many predictions and the AGI community median has shifted 2043 → 2039, and the 30th percentile is 8 years.
My impression (based on using Metaculus a lot) is that, while questions like this may give you a reasonable ballpark estimate and it’s great that they exist, they’re nowhere close to being efficient enough for it to mean much when they fail to move. As a proxy for the amount of mental effort that goes into it, there’s only been three comments on the linked question in the last month. I’ve been complaining about people calling Metaculus a “prediction market” because if people think it’s a prediction market then they’ll assume there’s a point to be made like “if you can tell that the prediction is inefficient, then why aren’t you rich, at least in play money?” But the estimate you’re seeing is just a recency-weighted median of the predictions of everyone who presses the button, not weighted by past predictive record, and not weighted by willingness-to-bet, because there’s no buying or selling and everyone makes only one prediction. It’s basically a poll of people who are trying to get good results (in terms of Brier/log score and Metaculus points) on their answers.
I suspect that these developments look a bit less surprising if you’ve been trying to forecast progress here, and so might be at least partially priced in. Anyhow, the forecast you linked to shows >10% likelihood before spring 2025, three years from now. That’s extraordinarily aggressive compared to (implied) conventional wisdom, and probably a little more aggressive than I’d be as an EA AI prof with an interest in language models and scaling laws.
Thanks, yeah now that I look closer Metaculus shows a 25% cumulative probability before April 2029, which is not too far off from OP’s 30% claim.
Note the answer changes a lot based on how the question is operationalized. This stronger operationalization has dates around a decade later.
This is wrong. Crying wolf is always a thing.
You’ve declared that you’ll turn out “obviously” right about “the big changes”, thus justifying whatever alarm-sounding. But saying these innovations will have societal impacts is very different from a 30% chance of catastrophe. Lots of things have “societal impact”.
You haven’t mentioned any operationalized events to even be able to make it “obvious” whether you were wrong. Whatever happens, in a few years you’ll rationalize you were “basically correct” or whatever. You’ll have baseball fields worth of wiggle room. Though there are ways you could make this prediction meaningful.
In the vast majority of worlds, nothing catastrophic happens anytime soon. Those are worlds where it’s indeed plausible to blow capital or reputation on something that turned out to be not that bad. I.e. “crying wolf” is indeed a thing.
I give <1% chance we are currently “in a fast takeoff”, but that forecast doesn’t mean anything until we have something unambiguous to score anyway. For me the high upvote count is a downward update on the expected competence of the LW audience.
You should remember that the post has 117 karma and 124 votes, which means it’s highly controversial for this forum. The average tends to be something like 2:1 as a ratio, given a mix of high-karma people strong upvoting and regular upvoting.
Reflecting on this and other comments, I decided to edit the original post to retract the call for a “fire alarm”.
(Not because I agree with your probabilities, I still don’t, but because I agree with your criticism on my lack of operationalization, and general vagueness about possible negative consequences.)
I have switched to upvoted! EDIT: Ah I see you are open to one of the bet offers, great!
These sorts of models all seem to be heavily dependent on “borrowing” a ton of intelligence from humans. To me they don’t seem likely to be capable of gaining any new skills that humans don’t already possess and give lots of demonstrations of. As such they don’t really seem to be FOOMy to me.
Also they’re literally reliant on human language descriptions of what they’re gonna do and why they’re gonna do it.
Reflecting on this and other comments, I decided to edit the original post to retract the call for a “fire alarm”.
To clarify I think this is clearer evidence of ~AGI than FOOM. If anything, AGI seeming this straightforward might be an update against “ASI follows quickly from AGI”, but then again it’s not clear.
Insofar as totally new skills in a domain without data is basically as hard as RL, RL still doesn’t really work? Could this path towards AGI working be seen to provide evidence that RL might just be computationally as hard as brute force search, and that the existence of human intelligence shouldn’t make us think there’s some secret sauce we haven’t found?
(TBC you can still cause civilization-ending damage with large clusters of Von Neumann-esque AGIs, they don’t have to FOOM quickly.)
I have some concern that AI risk advocacy might lead someone to believe the “AI is potentially civilization-altering and fast-takeoff is a real possilibity within 10 years” part but not the “alignment is really, really, really hard” part.
I imagine you can see how that might lead to undesirable outcomes.
Absolutely. But also, I think it strains credulity to believe that rationalists or EAs have a monopoly on the first concept—the sharp AI people already know, and everyone else will know in a few years due to people who don’t care about infohazards.
I have a hard time seeing the reason not to at least raise the salience of alignment. (That said, being unopinionated about timelines publicly while letting your knowledge guide your actions is undoubtedly the better choice.)
epistemic status: uncertain.
Respectfully, I disagree with the assertion that “the sharp AI people already know”
I feel that the idea of superhuman AI is a mental block, a philosophical blind spot for many people. It’s not about infohazards—the information is already out there. Nick Bostrom wrote “Superintelligence: Paths, Dangers, Strategies” in 2014. And EY has been talking about it for longer than that. But not everyone is capable of seeing that conclusion as true.
Possibly many of the sharpest AI people do already know, or are at least aware of the possibiity (“but the AI I’m working on wouldn’t do that, because...” [thought sequence terminated]). But I don’t think that information has become general knowledge, at all. It is still very much up in the air.
Ok, is raising the salience of alignment the strategy you are proposing? I guess it seems sensible?
I interpreted the “Do not make it easier for more people to build such systems. Do not build them yourself.” as “Let’s trying to convince people that AI is dangerous and they should stop building AI”. which is seems to me as either unlikely to work, or something that might backfire (“let’s build superhuman AI first before our enemies do”)
(I’m not 100% certain of it. Maybe mass advocacy for AI risks would work. But I have a… bad feeling about it).
My current, unfinished idea for a solution would be:
reach to out existing AI projects. Align them towards AI safety, as much as possible anyway.
solve the AI alignment problem
launch FAI before someone launches UFAI
(and yes, time crunch and arms race is probably not good for the kind of wisdom, restraint and safety culture required to build FAI)
(but if the alternative is “we all die”, this is the best take I have for now, even if this strategy in its current form may only have < 0.000000000000001% chance of succeeding)
Oh yeah 100%, “don’t unilaterally defect even though you might get lucky if you don’t defect” is not a good persuasion strategy.
“If you unilaterally defect you will definitely die, don’t do that to yourself, all the experts endorse” is a much more coherent argument. Idiots might not follow this advice, but if it can take $100M and a team of programmers, governments might be able to effectively prevent such a dumb accumulation of resources. Might. We have to try.
This post was only a little ahead of it’s time. The time is now. EA/LW will probably be eclipsed by wider public campaigning on this if they (the leadership) don’t get involved.
I think it’s a good conversation to be having. I really don’t want to believe we’re in a <5 year takeoff timeline, but honestly it doesn’t seem that far-fetched.
I’d put this 3-7 year thing at about 10%, maybe a bit less. So obviously with probability around 10%, capabilities researchers should be doing different things (I would love to say “pivoting en masse to safety and alignment research,” but we’ll see. Since a lot of it would be fake, perhaps that would need to reward/provide outlets for fake safety/alignment research). But EA orgs should still be focusing most attention on longer scales and not going all-in.
I think if you have timelines of 3-7 years at 10%, and alignment research where it is, it’s hard to imagine what kinds of alignment improvements we would get that also facilitate pivotal acts by individual actors way in front. So global coordination starting as soon as possible is still a necessary precondition to avoiding doom, even if we get lucky and have 15 years.
Reflecting on this and other comments, I decided to edit the original post to retract the call for a “fire alarm”.
Something I personally find convincing and index pretty heavily on are surveys of people in the field. For example, this Bostrom survey seems to be a classic and says:
I recognize that it is almost 10 years old though. I also have more epistemic respect for Eliezer, and to a somewhat lesser extent MIRI, and so I weigh their stances correspondingly more heavily. It’s hard to know how much more heavily I should weigh them though.
Anyway, I think that surveys are probably one of the best persuasive tools we have. If lots of experts are saying the same thing, then, well, there’s probably something to it. At least that’s how I expect a lot of smart people to think.
This excerpt from HPMoR comes to mind regarding the value of presenting a united front.
It also makes me think of when a bunch of scientists all sign a letter endorsing some position.
Furthermore, I think that these surveys should, I’m not sure how to say this, but be pushed more heavily? Let me describe my own experience.
I’ve been hanging around LessWrong for a while. I’m sure there are a bunch of other more recent surveys than the Bostrom one. Actually, I know there are, I recall seeing them on LessWrong a bunch of times. But they don’t stick out to me. And when I encountered them, I remember myself finding them to be somewhat confusing. Maybe they should be located more prominently on websites like LessWrong and MIRI.
And maybe it’d be worth spending some time working on presenting it in a way that is more easily understood. Perhaps via some user research. I’m not in the field so others probably have an easier time parsing them than I do, but then again, Illusion of Transparency. Maybe explanations should in fact be aimed at someone like me.
This all assumes of course that experts are, in fact, mostly in agreement that the probability of things like a fast takeoff are high. If they are not, well, I’m not sure exactly how to proceed, but maybe some more discussion is in order.
Reading the update that you’re retracted the fire alarm, I hope that you don’t stop thinking about this topic as I think it would be highly valuable for people to think through whether there should be a fire alarm, who would be able to pull it and what actions people should take. Obviously, you should work on this with collaborators and it should be somebody else who activates the fire alarm next time, but I still think you could make valuable contributions towards figuring out how to structure this. I suspect that there should probably be various levels of alert.
Tamay Besiroglu and I have replied to this post here with a proposal to bet up to $1000 about the claims made in this post.
I want to strongly endorse making competing claims; this was primarily intended as a coordination outlet for people who updated similarly to me, but that does not preclude principled arguments to the contrary, and I’m grateful to Matt and Tamay for providing some.
This week, the Journal of Moral Theology published a special issue on AI edited by Matthew Gaudet and Brian Patrick Green that is so important that the publishers have made it free to the public. It contributes well thought out insights about the potential implications of the decisions that will quickly roll towards us all. https://jmt.scholasticahq.com/issue/4236
Thank you for pointing this out, I’m curious about what they have to say.
I’d like to push back on this a little.
There are some fairly straightforward limitations on the types of algorithms that can be learned by current deep learning (look at TLM performance on variable-length arithmetic for a clear-cut example of basic functionality that these networks totally fail at) that would severely handicap a would-be superintelligence in any number of ways. There is a reason DeepMind programs MCTS into AlphaZero rather than simply having the network learn its own search algorithm in the weights—because MCTS is not in the region of algorithm space existing neural networks have access to.
I am pretty confident that you can’t get to AGI via just continuing to make these models bigger or by coming up with new Transformer-level architectural tricks. Going from GPT-2 to GPT-3 did not significantly improve performance on generalizing to new length arithmetic problems and there are strong theoretical reasons why it couldn’t possibly have done so.
That’s the good news, in my book.
The bad news is that I have absolutely no idea how hard those algorithmic limits are to solve, and there hasn’t been a serious push by the DL community to address them because until recently the problems of focus outside of RL haven’t required it. Maybe we’ll hit an AI-winter level wall on the way there. Maybe one big research paper comes out and all hell breaks loose as all of these systems unlock tremendous new capabilities overnight.
Hard to say. The performance you can get without access to wide swaths of the algorithmic space is scary as hell.
Is there any progress on casuality or comprehension or understanding of logic without requiring an enormous amount of compute that makes it seem like solving the problem without actual understanding?
Yann Lecun published a vision on how to build an autonomous system in February. Should folks have started considering alarm bells then? Have the recent results made Lecun’s vision seem more plausible now than it did back in February?
Wanna bet some money that nothing bad will come of any of this on the timescales you are worried about?
There’s a new post today specifying such bets.
To clarify, the above argument isn’t a slam dunk in favor of “something bad will happen because of AGI in the next 7 years”. I could make that argument, but I haven’t, and so I wouldn’t necessarily hope that you would believe it.
But what about the 10 years after that, if we still have no global coordination? What’s the mechanism by which nobody ever does anything dumb, without most people being aware of safety considerations and no restriction whatsoever on who gets to do what.
That seems like a hard bet to win. I suggest instead offering to bet on “you will end up less worried” vs “I will end up more worried”, though that may not work.
I don’t think it’s that hard e.g see here https://www.econlib.org/archives/2017/01/my_end-of-the-w.html
TLDR person who doesn’t think end of the world will happen gives other person money now and it gets paid back double if the world doesn’t end.
If you think it’s a hard bet to win, you are saying you agree that nothing bad will happen. So why worry?
I meant it’s a hard bet to win because how exactly would I collect. That said, I’m genuinely not sure if it’s a good field for betting. Roughly speaking, there’s two sorts of bets: “put your money where your mouth is” bets and “hedging” bets. The former are “for fun” and signaling/commitment purposes; the latter are where the actual benefit comes in. But with both bets, it’s difficult to figure out a bet structure that works if the market gets destroyed in the near future! We could bet on confidence, but I’m genuinely not sure if there’ll be one or two “big papers” before the end shifting probabilities. So the model of the world might be one where we see nothing for years and then all die. Hard for markets to model.
Doing a “money now/money later” bet structure works, I guess, like the other commenter said, but I don’t know of any prediction markets that are set up for that.
Just commenting on the concept of “goals” and particularly the “off switch” problem: no AI system has (to my knowledge) run into this problem, which IMO strongly suggests that “goals” in this sense are not the right way to think about AI systems. AlphaZero in some sense has a goal of winning a Go game, but AlphaZero does not resist being turned off, and I claim its obvious that even a very advanced version of AlphaZero would not resist being turned off. The same is true for large language models (indeed, it’s not even clear the idea of turning off a language model is meaningful, since different executions of the model share no state).
In the causal influence diagram approach, I think AlphaZero as formulated would be ‘TI-ignoring’ because it does all learning while ignoring the possibility of interruption and assumes it can execute the optimal action. But other algorithms would not be TI-ignoring—I wonder if MuZero would be TI-ignoring or not? (This corresponds to the Q-learning vs SARSA distinction—if you remember the slippery ice example in Sutton & Barto, the wind/slipping would be like the human overseer interrupting, I guess.)
Why wonder when you can think: What is the substantial difference in MuZero (as described in [1]) that makes the algorithm to consider interruptions?
Maybe I show some great ignorance of MDPs, but naively I don’t see how an interrupted game could come into play as a signal in the specified implementations of MuZero:
Explicit signals I can’t see, because the explicitly specified reward u seems contingent ultimately only on the game state / win condition.
One can hypothesize an implicit signal could be introduced if algorithm learns to “avoid game states that result in game being terminated for out-of-game reason / game not played until the end condition”, but how such learning would happen? Can MuZero interrupt the game during training? Sounds unlikely such move would be implemented in Go or Shogi environment. Are there any combination of moves in Atari game that could cause it?
[1] https://arxiv.org/abs/1911.08265
The most obvious difference is that MuZero learns an environment, it doesn’t take a hardwired simulator handed down from on high. AlphaZero (probably) cannot have any concept of interruption that is not in the simulator and is forced to plan using only the simulator space of outcomes while assuming every action has the outcome the simulator says it has, while MuZero can learn from its on-policy games as well as logged offline games any of which can contain interruptions either explicitly or implicitly (by filtering them out), and it does planning using the model it learns incorporating the possibility of interruptions. (Hence the analogy to Q-learning vs SARSA.)
Even if the interrupted episodes are not set to −1 or 0 rewards (which obviously just directly incentivize a MuZero agent to avoid interruption as simply another aspect of playing against the adversary), and you drop any episode with interruption completely to try to render the agent as ignorant as possible about interruptions, that very filtering could backfire. For example, the resulting ignorance/model uncertainty could motivate avoidance of interruption as part of risk-sensitive play: “I don’t know why, but node X [which triggers interruption] never shows up in training even though earlier state node X-1 does, so I am highly uncertain about its value according to the ensemble of model rollouts, and so X might be extremely negative compared to my known-good alternatives Y and Z while the probability of it being the best possible outcome has an extremely small base rate; so, I will act to avoid X.” (This can also incentivize exploration & exploitation of human manipulation strategies simply because of the uncertainty around its value! Leading to dangerous divergences in different scenarios like testing vs deployment.)
(I’d like to see more about educating experts and the public)
Could it be possible to build an AI with no long-term memory? Just make it’s structure static. If you want it to do a thing, you put in some parameters (“build a house that looks like this”), and they are automatically wiped out once the goal is achieved. Since the neural structure in fundamentally static (not sure how to build it, but it should be possible?), the AI cannot rewrite itself to not lose memory, and it probably can’t build a new similar AI either (remember, it’s still an early AGI, not a God-like Superintelligence yet). If it doesn’t remember things, it probably can’t come up with a plan to prevent itself from being reset/turned off, or kill all humans. And then you also reset the whole thing every day just in case.
This approach may not work in the long term (an AI with memory is just too useful not to make), but it might give us more time to come up with other solutions.
This is similar to the concept of myopia. It seems a bit different though, as myopia tends to focus on constraining an AI’s forward-lookingness, whereas your focus is on constraining past memory.
I think myopia has potential, but I’m not sure about blocking long-term memory. Does forgetting the past really prevent an AI from having dangerous plans and objectives? (I haven’t thought about this very much yet, it’s just an initial reaction.)
Quite possibly dumb question: Why couldn’t you just bake into any AI goal “dont copy and paste yourself, dont harm anybody etc” and make that common practice?
Imagine a human captured by a mind control fungus, and being mind controlled to not replicate and to do no harm. Also the entire planet is covered with the fungus and the human hates it and wants it to be dead, because of the mind control. (This is not an AI analogy, just an intuition pump to get the human in the right mindset.) Also the fungus is kind of stupid, maybe 90 IQ by human standards for its smartest clusters. What rules could you, as the fungus, realistically give the human, that doesn’t end up with “our entire planet is now on fire” or “we have lost control of the mind control tech” or some other analogue a few years later? Keep in mind that when thinking of rules, you should not use your full intelligence, because you don’t have your full intelligence, because in this analogy we are the fungus.
The point is: there are two kinds of systems. Those that are obviously not dangerous, and those that are not obviously dangerous. This is creating a system of the latter kind, because for any threat you can think of, you will create a rule, so by definition you will end up with an AI that poses no threat that you can think of. But the Superintelligence, by definition, can think of more threats than you, and your rules will give you no safety at all from them.
Notice that you can’t create your feared scenario without “it” and “itself”. That is, the AI must not simply be accomplishing a process, but must also have a sense of self—that this process, run this way, is “me”, and “I” am accomplishing “my” goals, and so “I” can copy “myself” for safety. No matter how many tasks can be done to a super-human level when the model is executed, the “unboxing” we’re all afraid of relies totally on “myself” arising ex nihilo, right? Has that actually changed in any appreciable way with this news? If so, how?
I think people in these sort of spheres have an anti-anthrocentrism argument that goes something like “if it can reach human-like levels at something, we shouldn’t assume we’re the top and that it can keep going.” But when you think about things from the ontological lens—how will the statistical correlations create new distinctions that were annihilated in the training data? - then “it might really quickly reach human-like levels for given tasks, but will never put it all together” is exactly what we expect! The pile of statistical correlations is only as good as the encoding, and the encoding is all human, baby. Representations annihilate detail, and so assuming that the correlations will stay in their static representations isn’t any sort of human arrogance but a pretty straightforward understanding of what data is and how it works.
(Though I will say that SayCan is actually slightly concerning, depending on the details of how it’s going behind the scenes. An embodied approach like that can actually learn things outside of a human frame.)
A system that contains agents is a system that is dangerous, it doesn’t have to “be” an agent. Arguably PaLM already contains simple agents. This is why it’s so important that it understands jokes, because jokes contain agents that are mistaken about the world, which implies the capability to model people with different belief states.
Attaboy! I think half the problem that people have accepting the really obvious arguments for doom is that it just seems such a weird science-fictiony sort of thing to believe. If you can throw a couple of billion at getting attractive musicians and sportspeople to believe it on television you’ll probably be able to at least start a scary jihad before the end of the world.
I’m getting really bored of the idea of being killed by nerve-gas emitting plants and then harvested for my atoms, and will start looking forward to being killed by pitchfork-wielding luddites with flaming torches.
Party’s over. GG humanity.
Not over yet, if I thought there was no hope I wouldn’t be trying.
The game is not unwinnable, it just takes skill and luck.
One could say, in part, it’s a skill issue.
Tsuyoku Naritai!