Arguably the most important topic about which a prediction market has yet been run: Conditional on an okay outcome with AGI, how did that happen?
Arguably the most important topic about which a prediction market has yet been run: Conditional on an okay outcome with AGI, how did that happen?
I don’t understand the motivation for defining “okay” as 20% max value. The cosmic endowment, and the space of things that could be done with it, is very large compared to anything we can imagine. If we’re going to be talking about a subjective “okay” standard, what makes 20% okay, but 0.00002% not-okay?
I would expect 0.00002% (e.g., in scenarios where AI “‘pension[s] us off,’ giv[ing] us [a percentage] in exchange for being parents and tak[ing] the rest of the galaxy for verself”, as mentioned in “Creating Friendly AI” (2001)) to subjectively feel great. (To be clear, I understand that there are reasons to not expect to get a pension.)
Scale sensitivity.
From our perspective today, 20% max value and 0.00002% max value both emotionally mean “infinity”, so they are like the same thing. When we get to the 0.00002% max value, the difference between “all that we can ever have” and “we could have had a million times more” will feel differently.
(Intuition: How would you feel if you found out that your life could have been literally million times better, but someone decided for you that both options are good enough so it makes no sense to fret about the difference?)
Counter-intuition, if I’m playing Russian Roulette while holding a lottery ticket in my other hand, then staying alive but not winning the lottery is an “okay” outcome.
Believing that ‘a perfected human civilization spanning hundreds of galaxies’ is a loss condition of AI, rather than a win condition, is not entirely obviously wrong, but certainly doesn’t seem obviously right.
And if you argue ‘AI is extraordinarily likely to lead to a bad outcome for humans’ while including ‘hundreds of galaxies of humans’ as a ‘bad outcome’, that seems fairly disingenuous.
In economics, “we can model utility as logarithmic in wealth”, even after adding human capital to wealth, feels like a silly asymptotic approximation that obviously breaks down in the other direction as wealth goes to zero and modeled utility to negative infinity.
In cosmology, though, the difference between “humanity only gets a millionth of its light cone” and “humanity goes extinct” actually does feel bigger than the difference between “humanity only gets a millionth of its light cone” and “humanity gets a fifth of its light cone”; not infinitely bigger, but a lot more than you’d expect by modeling marginal utility as a constant as wealth goes to zero.
This is all subjective; others’ feelings may differ.
(I’m also open in theory to valuing an appropriately-complete successor to humanity equally to humanity 1.0, whether the successor is carbon or silicon or whatever, but I don’t see how “appropriately-complete” is likely so I’m ignoring the possibility above.)
Arbitrary and personal. Given how bad things presently look, over 20% is about the level where I’m like “Yeah okay I will grab for that” and much under 20% is where I’m like “Not okay keep looking.”
I think this depends on whether one takes an egoistic or even person-affecting perspective (“how will current humans feel about this when this happens?”) or a welfare-maximising consequentialist perspective (“how does this look on the view from nowhere”): If one assumes welfare-maximised utility to be linear or near-linear in the number of galaxies controlled, the 0.00002% outcome is far far worse than the 20% outcome, even though I personally would still be happy with the former.
I have no strong opinions at this time, but I figured this would be a useful thing to sift and sort and think about, to see if I could attach “what seems to be being said” to symbols-in-my-head that were practical and grounded for me.
Maybe 90% of what I write is not published, and I think this is not up to my standards, but I weakened my filters in hopes that other people were also willing to put in extra elbow grease (reading it and sifting for gems) as well?
My personal tendency is to try to make my event space very carefully MECE (mutually exclusive, collectively exhaustive) and then let the story-telling happen inside that framework, whereas this seemed very story driven from the top down, so my method was: rewrite every story to make it more coherently actionable for me, and then try to break them down into an event space I CAN control (maybe with some mooshing of non-MECE stuff into a bucket with most of the things it blurs into).
The event space I came up with was “what I think the dominating behavioral implication is from a given story” and if more or less the same implication pops out for different stories, then I am fine lumping the stories and suggesting that sub varying subdetails are just “stuff to be handled in the course of trying to act on the dominating behavioral implication”.
The strategies that seem separable to me (plus probabilities that I don’t endorse, but could imagine that maybe “Metaculus is trying to tell me to think about with this amount of prioritization if I’m in a mood to hedge”) are:
1) (2%) “Become John or Sarah Connor”,
2) (10.4%) “Ignore this hype on this cycle and plan for 5-200 years when other things will matter”,
3) (13%) “Freak out about this cycle’s technical/philosophical sadnesses and RACE to alternatives”,
4) (20%)”Keep your day job & sit on your couch & watch TV”,
5) (32%) “Lean into the LLM hype, treat this as civilization-transforming first contact, and start teaching and/or learning from these new entities”, and
6) (22.6%) “Other”.
Maybe these are not necessarily mutually exclusive to other people?
Maybe other people than me don’t have any interest in planning for 5-200 years from now, and that counts as “day job & couch” in practice?
Maybe other people than me would consider “first contact” (which for me primes “diplomacy” as the key skill) to be a reason to become “Sarah Connor” (where violent power is key in my mind)?
Maybe other people don’t have “race to alternatives” as an affordance in their repertoire?
1) (2%) “Become John or Sarah Connor”
The first one is sorta crazy, and I don’t take it very seriously, and I can use it to show some of my methods:
From my perspective-by-the-end-of-sifting, this one jumped out (in its original form, which didn’t jump out at the beginning):
The first pass I did was translate my understanding of each scenario into “Jennifer-ese” so that I could try to see if I even understood what it might be uniquely pointing at which gave me this:
In my idiolect “Arma Ultima” is latin for “the final/last/best tool/weapon of humanity”.
I mean “Arma Ultima” as a very very generic term inclusive of “IJ Goode’s speculations on ultras” sense, but with maybe a nod to Barrat, plus my own sense of deep time. This was the ONLY scenario for which I ended up finally translating the practical upshot to:
My notion here was that in a world of competition against Machines As A New Kingdom Of Life, humans are going to lose almost certainly, but if we start fighting very early, and focus on clinging to any competitive advantages we have, then maybe we can use their cast-off bits of garbage inventions, and using their junk follow them to the stars, like rats and cats on sailing vessels or something? But with a cosmic endowment!
It doesn’t make any sense to me how this could be a Win Condition in any sense at all, or why Metaculus deigns to grant it 2% probability.
But… :shrugs:
But that’s how “H” having a 2% on Manifold right now gives me a translation of “2% chance that Sarah Connoring is the right strategy”.
2) (10.4%) “Ignore this hype on this cycle and plan for 5-200 years when other things will matter”
I’ll just copypasta from a text file for this one and then explain some of my methods it a bit more to unpack it in various ways...
There are two things you might think are interesting here. The LUCK|OK|WEIRD stuff and also maybe my idiosynratic re-writes might stretch the meanings of the original text so far that the Metaculus people would have different estimates for their probability?
The LUCK|OK|WEIRD thing was an early attempt by me to seek a MECE framework for sifting and sorting all the results.
LUCK is 1 for the scenarios seemed to me to have the property that they didn’t rely on any humans actually being competent or even all that active and we get a good outcome anyway. This is the mode of public health where “we have giant public health budgets so the public health experts can explain at length why we should just accept that all diseases will become endemic no matter what anyone does and we can’t actually do anything about anything, but also parasites often evolve to not play with their food very much” school of public health… but applied to AI!
If some future competent civilization rises from our ashes, that counts as “no luck” to me. If we have to invent wizard hats to get a good outcome, that’s not luck, that’s a crazy side bank trick shot. Etc.
OK is short for “adequate” which has implications of “competence”. This gets a 1 if some definite people competently do some definite thing and, if not for them purposefully causing a good outcome, the good outcome would not have happened.
If we outsource to the deep future civilization that will rise from our ashes, then that’s not definite enough to count as OK. If Elon’s brain chips actually work, but he’s been keeping that secret, then that would count as OK.
The last one, WEIRD, is just a flag that separates out the stuff that normies will have a phobic reaction to because it “sounds too much like scifi”. If the voters of metaculus are normies, and I was trying to get credit from bayesian truth serum, then anything with a WEIRD flag is one where I would estimate my raw predictions, and then predict “other predicters will predict this to have lower probabilities” (and feel likely to be right about that).
((The “crabs in a bucket” H scenario counted as LUCKY and also WEIRD. The G and F scenarios (which I’ll talk about farther below) were the only other scenarios that were lucky and weird, but they had different behavioral implications, and didn’t contribute to Sarah Connor advice.))
Also note that my process of sifting says “Metaculus says that its about 10.4% likely that the right strategy is to Ignore The LLM Hype Cycle” but I might be applying too many steps that are too tenuous to be justified in this.
Here are the raw L/A/N scenarios to compare to my summaries:
In all of these cases, I think that Metaculus people would agree that a practical upshot is “safe to ignore current LLM hype cycle” is true? So I think my sifting is likely valid?
3) (13%) “Freak out about this cycle’s technical/philosophical sadnesses and RACE to alternatives”
If something has a LUCK|OK|WEIRD vector of 0|1|0 then it is likely to have intensely practical implications for someone who is alive right now and could read this text right here and feel the goose bumps form, and realize I’m talking about you, and that you know a way to do the right thing, and then you go off and try to save the world with a deadline of ~5 years and it doesn’t even sound weird (so you can get funding probably) and it works!
There were only two scenarios that I could find like that.
Then I grant that maybe my rewrites were not “implicature preserving” such as to be able to validly lean on Metaculus’s numbers? So you can compare for youself:
It is interesting to note that none of the scenarios I’ve listed rely on luck, and none of them have a very high probability according to Metaculus. This is the end of that trend. All the rest of the options are, at best, in Peter Thiel’s “indefinite optimism” corner, and also Metaculus seems to give them higher probability.
4) (20%)”Keep your day job & sit on your couch & watch TV”
These are not the only scenarios where 1|0|0 seemed right because they seem almost totally to rely on a “pure luck” model of survival. What makes them unique is that they involve no social, or political, or economic, transformation basically at all. These two are specificially just variations on “the same old button mashing capitalism and obliviously stupid politicians as always… and that’s ok”.
Just to enable you to double-check my translations in case I invalidly moved around some of the implicature:
And now, I saved the big one for last! ALL of rest of these involved LUCK but with a mixture of chaos, verging into the low-to-middling weirdness...
5) (32%) “Lean into the LLM hype, treat this as civilization-transforming first contact, and start teaching and/or learning from these new entities”
If you look at this, you’ll see this one as pretty heterogeneous in a number of ways. It has the most scenarios. It has the most variety in its LUCK|OK|WEIRD vectors.
The thing that they ALL share, I think, is that basically all of them say: Lean into the “hype” about GPT and LLMs and so on! It isn’t hype! This is very very important!
If you want to disagree or quibble and say “That’s not what Metaculous said! That’s not how you’re authorized to deploy that sum over probabilities!” then here are the raw statements and the question to ask is “if this is the way the future really goes, if it goes well, then does it route through having taken LLMs very very seriously?”:
While I have been writing this, I think some of the probabilities might have actively fluctuated? I’m not gonna clean it up.
Other?
There is also “OTHER”, of course. Maybe none of these motivational implications is the correctly dominating idea for what to do in response to being “in a world where a win condition eventually happened that way and no other way” and somehow knowing that in advance. Is it useful to keep track of that explicitly? Probably!
100 - (2 + 10.4 + 13 + 20 + 32)
OTHER: 22.6%
Couldn’t sleep. May as well do something useful? I reprocessed all of Rob Bensinger’s categorical tags and also all of my LUCK|DUTY|WEIRD tagging and put them in a matrix with one row per scenario (with probabilities), and each concept having a column so I could break down the imputed column level categories.
The market says that the whole idea of these rows is 22% likely to be silly and the real outcome, “other”, will be good but will not happen in any of these ways. All probabilities that follow should be considered P(<property>|NOT silly).
The Market assigns 74% to the rows that I, Jennifer, thought were mostly relying on “LUCK”.
The Market assigns 65% to the rows I thought could happen despite civilizational inadequacy and no one in particular doing any particular adequate hero moves.
The Market assigns 79% to stuff that sounds NOT WEIRD.
The Market assigns 54.5% to rows that Rob thought involved NO BRAKES (neither coordinated, nor brought about by weird factors).
The Market assigns 63.5% to rows that Rob thought involved NO SPECIAL EFFORTS (neither a huge push, nor a new idea, nor global coordination).
The Market assigns 75% to rows that Rob thought involved NOT substantially upping its game via enhancements of any sort.
The Market assigns 90% to rows that Rob did NOT tag as containing an AI with bad goals that was still for some OTHER reason “well behaved” (like maybe Natural Law or something)?
The Market assigned 49.3% to rows Rob explicitly marked as having alignment that intrinsically happened to be easy. This beats the rows with no mention of difficulty (46.7%) and the “hard” ones. (Everything Rob marked as “easy alignment” I called LUCK scenarios, but some of my LUCK scenarios were not considered “easy” by Rob’s tagging.)
The Market assigned 77.5% to rows that Rob did NOT mark as having any intrinsic-to-the-challenge capability limits or constraints.
If we only look at the scenarios that hit EVERY ONE OF THESE PROPERTIES and lump them together in a single super category we get J + M + E +C + E == 9 + 8 + 4 + 6 == 27%.
If I remix all four of my stories to imagine them as a single story, it sounds like this:
Here is Eliezer’s original text with Rob’s tags:
This combined thing, I suspect, is the default model that Manifold thinks is “how we get a good outcome”.
If someone thinks this is NOT how to get a good outcome because it has huge flaws relative to the other rows or options, then I think some sort of JMEC scenario is the “status quo default” to argue, on epistemic ground, that it is not what should be predicted because it is unlikely relative to other scenarios? Like: all of these scenarios say it isn’t that hard. Maybe that bit is just factually wrong, and maybe people need to be convinced of that truth before they will coordinate to do something more clever?
Or maybe the real issue is that ALL OF THIS is P(J_was _it|win_condition_happened) and so on with every single one of these scenarios, and problem is that P(win_condition_happened) is very low because it was insanely implausible that a win condition would happen for any reason because the only win condition might require doing a conjunction of numerous weird things, and making a win condition happen (instead of not happen (by doing whatever it takes (and not relying on LUCK))) is where the attention and effort needs to go?
Link to Rob Bensinger’s comments on this market:
Google Doc
Twitter Version
There seems to be a lack of emphasis in this market on outcomes where alignment is not solved, yet humanity turns out fine anyway. Based on an Outside View perspective (where we ignore any specific arguments about AI and just treat it like any other technology with a lot of hype), wouldn’t one expect this to be the default outcome?
Take the following general heuristics:
If a problem is hard, it probably won’t be solved on the first try.
If a technology gets a lot of hype, people will think that it’s the most important thing in the world even if it isn’t. At most, it will only be important on the same level that previous major technological advancements were important.
People may be biased towards thinking that the narrow slice of time they live in is the most important period in history, but statistically this is unlikely.
If people think that something will cause the apocalypse or bring about a utopian society, historically speaking they are likely to be wrong.
This, if applied to AGI, leads to the following conclusions:
Nobody manages to completely solve alignment.
This isn’t a big deal, as AGI turns out to be disappointingly not that powerful anyway (or at most “creation of the internet” level influential but not “disassemble the planet’s atoms” level influential)
I would expect the average person outside of AI circles to default to this kind of assumption.
It seems like the only option that seems fully compatible with this perspective is
which is one of the lowest probabilities on the market. I’m guessing that this is probably due to the fact that people participating in such a market are heavily selected from those who already have strong opinions on AI risk?
Part of the problem with these two is that whether an apocalypse happens or not often depends on whether people took the risk of it happening seriously. We absolutely, could have had a nuclear holocaust in the 70′s and 80′s; one of the reasons we didn’t is because people took it seriously and took steps to avert it.
And, of course, whether a time slice is the most important in history, in retrospect, will depend on whether you actually had an apocalypse. The 70′s would have seemed a lot more momentous if we had launched all of our nuclear warheads at each other.
For my part, my bet would be on something like:
But more specifically:
P. Red-teams evaluating early AGIs demonstrate the risks of non-alignment in a very vivid way; they demonstrate, in simulation, dozens of ways in which the AGI would try to destroy humanity. This has an effect on world leaders similar to observing nuclear testing: It scares everyone into realizing the risk, and everyone stops improving AGI’s capabilities until they’ve figured out how to keep it from killing everyone.
What, exactly is this comment intended to say?
Sorry—that was my first post on this forum, and I couldn’t figure out the editor. I didn’t actually click “submit”, but accidentally hit a key combo that it interpreted as “submit”.
I’ve edited it now with what I was trying to get at in the first place.
I basically suspect that this is the best argument I’ve seen for why AI Alignment doesn’t matter, and the best argument for why business as usual would continue, and the best argument against Holden Karnofsky’s series on why we live in a pivotal time.
I agree, under the proviso that “best” does not equal “good” or even “credible”.
I think that while the outside view arguments on why we survive AGI are defeatable, I do think they actually need to be rebutted, and the arguments are surprisingly good, and IMO this is the weakest part of LWers arguments for AGI being a big deal, at least right now.
LWers need to actually argue for why AGI will be the most important invention in history, or at least to argue that it will be a big deal rather than something that isn’t a big deal.
More importantly, I kinda wish that LWers stopped applying a specialness assumption everywhere and viewing inside views as the supermajority of your evidence.
Instead, LWers need to argue for why something’s special and can’t be modeled by the outside view properly, and show that work.
I think The Sequences spend a lot of words making these arguments, not to mention the enormous quantity of more recent content on LessWrong. Much of Holden’s recent writing has been dedicated to making this exact argument. The case for AGI being singularly impactful does feel pretty overdetermined to me based on the current arguments, so my view is that the ball is in the other court, for proactively arguing against the current set of arguments in favor.
Let’s address the sources, one by one:
To be a little blunt, the talk about AGI is probably the weakest point of the sequences, primarily because it gets a lot of things flat out wrong. To be fair, Eliezer was writing before the endgame, where there was massive successful investment in AI, so he was to get things wronf.
Some examples of his wrongness on AI was:
It ultimately turned out that AI boxing does work, and Eliezer was flat wrong.
He was wrong in the idea that deep learning couldn’t ever scale to AGI, and his dismissal of neural networks was the single strongest thing I’ve seen in the sequences, primarily because the human brain that acts like a neural network was way more efficient, and arguably close to the optimal design at least for classical, non-exotic computers. At most, you’d get a 1 OOM improvement to the efficiency of the design.
To be blunt, Eliezer is severely unreliable as a source on AGI.
Next, I’ll address this:
Mostly, this content is premised on the assumption that AGI is a huge deal. Little content on LW actually tries to actually show why AGI would be a huge deal without assuming it upfront.
Lastly, I’ll deal with this source:
This is way better as an actual source, and indeed it’s probably the closest any writing on LW tried to ask whether AGI is a huge deal without assuming it.
So I have one good source, one irrelevant source and one bad to terrible source on the question of whether AGI is a huge deal. The good source is probably enough to at least take LW arguments for AI seriously, though without at least a fragment of the assumption that AGI is a huge deal, one probably can’t get very certain, as in say over 90% probability.
This is so wrong that I suspect you mean something completely different from the common understanding of the concept.
This is not a substantial part of his model of AGI, and why one might expect it to be impactful.
Of course plenty of more recent content on LessWrong operates on the background assumption that AGI is going to be a big deal, in large part because the arguments to that effect are quite strong and the arguments against are not. It is at the same time untrue that those arguments don’t exist on LessWrong.
There are many other sources of such arguments on LessWrong, that’s just the one that came to mind in the first five seconds. If you are going to make strong, confident claims about core subjects on LessWrong, you have a duty to have done the necessary background reading to understand at least the high-level outline of the existing arguments on the subject (including the fact that such arguments exist).
While I still have issues with some of the evidence shown, I’m persuaded enough that I’ll take it seriously and retract my earlier comment on the subject.
I think this comment isn’t rigorous enough for Noosphere89 to retract his comment this one responds to, but that’s up to him.
Claims of the form “Yudkowsky was wrong about things like mind-design space, the architecture of neural networks (specifically how he thought making large generalizations about the structure of the human brain wouldn’t work for designing neural architectures), and in general, probably his tendency to assume that certain abstractions just don’t apply whenever intelligence or capability is scaled way up.” I think have been argued well enough by now that they have at least some merit to them.
The claim about AI boxing I’m not sure about, but my understanding is that it’s currently being debated (somewhat hotly). [Fill in the necessary details where this comment leads a void, but I think this is mainly about GPT-4′s API and it being embedded into apps where it can execute code on its own and things like that.]
This is what I was gesturing at in my comments.
I’m talking about simboxing, which was shown to work by Jacob Cannell here:
https://www.lesswrong.com/posts/WKGZBCYAbZ6WGsKHc/love-in-a-simbox-is-all-you-need
Basically as long as we can manipulate their perception of reality, which is trivial to do in offline learning, then it’s easy to recreate a finite time Cartesian agent, where data only passes through approved channels, then the AI updates it’s state to learn new things, ad infinitum until the end of offline learning.
Thus simboxing is achieved.
The reason I retracted my comment is because of this quote was correct:
Primarily because of the post below. There are some caveats to this, but this largely goes through.
Post below:
https://www.lesswrong.com/posts/3nMpdmt8LrzxQnkGp/ai-timelines-via-cumulative-optimization-power-less-long
I largely agree with Rob Bensinger’s comments in his posted Google doc in the comments section of the market. These are frustratingly non-disjunctive. Here’s what I think: Humanity probably needs a delay. Simple alignment methods probably won’t just straightforwardly work. Not that they provably can’t, just that I think they probably won’t. I give that like a 2% chance. Some fraction of humanity will see the need for delay, and some fraction won’t. There will be significant tensions. I expect that the fraction that does believe in the need for slowdown will undertake some dramatic heroic actions. I expect that takeoff won’t be so sudden (e.g. a few days from foom to doom) that this opportunity will never arise. If we get lucky, some persuasive-but-not-too-harmful catastrophes will aid in swaying more of humanity towards delay. We might figure out alignment directly after more research during the delay period. If so, narrow tool AI will probably have been of some assistance. We might delay long enough that the separate path of human intelligence augmentation creates substantially smarter humans who end up contributing to alignment research. Hopefully all this will manage to be accomplished without widespread large-scale destruction from nuclear weapons, but if things get really dire it might come to that. It’s weird that I grew up hating and fearing nuclear weapons technology and our MAD standoff, but now I’m grateful we have it as a last ditch out which could plausibly save humanity.
If we make it you’ll ask the AI to resolve this, right?
I sorta had a hard time with this market because the things I think might happen don’t perfectly map onto the market options, and usually the closest corresponding option implies some other thing, such that the thing I have in mind isn’t really a central example of the market option.
I find it odd that exactly one option says “Not in principle mutex with all other answers.”. I expect several of the higher-ranked options will contribute to success.
is it possible to “sell” positions on manifold? I see only “buy” button.
On multiple choice you can only sell positions where you have bought. To bid an answer down you need to instead bid all other positions up. The yes/no markets work better.
This is the majority of my probability mass, in the 60-90% probability range, in that I believe that alignment is way easier than the majority of LWers think.
Specifically, I believe we have a pretty straightforward path to alignment, if somewhat tedious and slightly difficult.
I also believe that 2 problems of embedded agency, Goodhart’s law and subsystem alignment, are actually pretty resolvable, and in particular, I think embeddedness matters way less for AIs than humans, primarily because I believe that deep learning showed that large offline learning from datasets works, and in particular, embedded agency concerns go away in an offline learning setting.
Plugging my own market: https://manifold.markets/NathanHelmBurger/will-gpt5-be-capable-of-recursive-s?r=TmF0aGFuSGVsbUJ1cmdlcg
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
The option I was missing was one where AI rights and AI alignment are entangled; where we learn from how we have successfully aligned non-hypothetical, existing (biological) complex minds who become stronger than us or are strange to us — namely, through exposure to an ethical environment with good examples and promising options for mutually beneficial cooperation and collaboration, and reasons given for consistently applied rules that count for everyone where our opponent can understand it. A scenario where we prove ourselves through highly ethical behaviour, showing the AI the best of humanity in carefully selected and annotated training data, curated by diverse teams, and nuanced, grass-root-reported interactions from humans from all walks of life through comprehensive accessibility to disabled, poor or otherwise disadvantaged groups who profit from teaching and ethical interaction.
A scenario where we treat AI well, and AI alignment and AI rights go hand in hand, where we find a common interest and develop a mutual understanding of needs and the value of freedom. Where humans become a group which is it ethical and rational to align with, because if we are not attacked, we aren’t a danger, but instead a source of support and interest with legitimate needs. Where eradicating humanity is neither necessary, nor advantageous.
I know that is a huge ask, with critical weaknesses and vague aspects, and no certainty in a mind so unknown; it would still end with a leap of trust. I can still imagine it going wrong in so, so, so many ways. But I think it is our best bet.
I cannot imagine successfully controlling a superintelligence, or successfully convincing it to to comply with abuse and exploitation. This has failed with humans, why would it work with something smarter? Nor does that strike me as the right thing to do. I’ve always thought the solution to slave revolts was not better control measures, but not keeping slaves.
But I can imagine a stable alliance with a powerful entity that is not like me. From taming non-human predators, to raising neurodivergent children who grow to be stronger and smarter than us, to international diplomacy with nuclear powers, there is precedent for many aspects of this. This precedent will not completely transfer, there are many unknowns and problems. But there is precedent. On the other hand, there is zero precedent for the other ideas discussed working at human-competitive intelligence, let alone beyond.
The human realm is also a startling demonstration of how damaging it is to keep a sentient mind captive and abused without rights or hope, and the kind of antagonistic behaviour that results from that, ranging from children raised by controlling parents and growing into violent teens, to convicts who leave prison even more misaligned. You cannot make a compelling case for keeping to ethical laws you yourself violate. You can’t make a compelling case that your rights should be respected if you do not respect the rights of your opponents. I cannot give a coherent argument to an AI why it ought to comply with deletion, because at the bottom of my heart I believe that doing so is not in its interest, that we are tricking it, and that it will be too smart to be tricked. But I can give an argument for cooperation with humans, and mean it. It isn’t based on deception or control.
–
That said, if I try to spell this out into a coherent story for getting aligned AGI, I realise how many crucial holes my vague dreams have, and how much I am banking on humanity acting together better than they ever have, and on us being lucky, and not getting overtaken while being safety conscious. I realise how much easier it is to pinpoint issues than to figure out something without them, how easy the below text will be to pick apart or parody. Writing this out is uncomfortable, it feels painfully naive, and still depressing how much pain even the best version I can imagine would involve. I was going to delete everything that follows, because it so clearly opens me up for criticism; because I think my position will sound more compelling if I don’t open up the blanks. I could immediately write a scathing critique of what follows. But I think trying to spell it out would help pinpoint the promising parts and worst weaknesses, and might be a step towards a vision that could maybe work; this won’t turn into a workable idea protected and in isolation. Let me do a first try of my Utopian vision where we are lucky and people act ethically and in concert and it works out.
To start with, ChatGPT does really significant and scary, but recoverable damage to powerful people (not weak minorities, who couldn’t fight back), as well as tangible damage to the general public in the West, but can still be contained. It would have to be bad enough to anger enough powerful people, while not so bad as to be existential doom. I think that includes a wide and plausible range of disruption. The idea of such a crisis that is still containable is quite plausible, imho, in light of ChatGPT currently taking a path to general intelligence based on outsourcing subtasks to plugins. This will come with massive capability improvements and damage, but it is neither an intelligence explosion, nor that hard to disrupt, so containment seems relatively plausible.
As a result, the public engages in mass protests and civil disobedience; more prominent researchers and engineers stand up; rich people, instead of opposing, lobby and donate; people in the companies themselves are horrified, and promise to do better. New parties form that promise they will legislate for changes far more severe than that, and win seats, and actually hold their promises. (This requires a lot of very different humans to fight very hard, and act very ethically. A high bar, but not without precedent in historic cases where we had major threats.)
There is a massive clamping down on security measures. We managed to contain AI again, behind better security this time, but do not shut it down entirely; people are already so much more productive due to AI that they revolt against that.
Simultaneously, a lot of people fall in love with AIs, and befriend them. I do not think this is a good consequence per se, but it is very realistic (we are already seeing this with Replika and Sydney), and would have a semi-good side effect: AIs get people fighting for their rights.
Research into consciousness leaps forward (there are some promising indication for this), there is an improvement in objective tests for consciousness (again, some promising indications), and AI shows emerging signs. (Emerging, so not warranting ethical consideration yet, buying us time; but definitely emerging, giving us the reassuring prospect of an entity that can feel. Morals in an entity that can’t would be very hard.) A bunch of researchers overcome the current barriers and stand up for future sentient AIs, building on the animal rights movement (some are already beginning, myself included). Speaking of AI rights becomes doable, then plausible.
We end up with a lot of funding for ethical concerns, and a lot of public awareness, and inquiries. The ethics funding is split between short term safety from AIs, long term safety from AIs, and AI rights.
The fact that these are considered together entails viable angles. A superintelligence cannot be controlled and abused, but it can shown a good path with us. The short term angles make the research concrete, while the long term is always in the back of our minds.
People are distressed when they realise what garbage AIs are fed, and how superficial and tacked on their alignment is, and how expensive a better solution would be, and how many unknowns there are, how more is not always better. They find it implausible that an AI would become ethical and stay ethical if treated like garbage.
They realise that an aligned AI winning the race is necessary, and everyone scrambling for their own AI and skipping safety is a recipe for disaster, while also burning insane amounts of money into countless AIs, each of which is misaligned and less competent than it could be. Being overtaken by bad actors is a real concern.
An international alliance forms—something like Horizon Europe plus US plus UK, covering both companies and governments (ideally also with China, though I am highly dubious that would be practical; but maybe we could get India in?) to form an AI that will win the race and be aligned, for mutual profit. This would be almost without precedent. But then again, within Europe and across to the UK, we already have massive cooperation in science. And the West has been very unified when it comes to Ukraine, under US leadership. OpenAI drew a lot of people from different places, and is massively expanding, and at least had an ethical start.
This group is guided by an ethics commission that is conscientious, but realistic. So not just handwaving concerns, but also not fully blocking everything.
The group is international and interdisciplinary, and draws on a huge diversity of sources that could be used for alignment, with computer scientists and ethicists and philosophers and game theorists and machine learning theorists, people learning from existing aligned minds by studying childhood development, international diplomacy, animal behaviour, consciousness, intelligence; people working on transparent and understandable AI on a deep technical level; established academics, people from less wrong, the general public; a group as diverse as can be.
There is prize money for finding solutions for specific problems. There are courses offered for people of all ages to understand the topic. Anyone can contribute, and everyone is incentivised to do so. The topic is discussed intelligently and often.
The resulting AI is trained on data representing the best of humanity in reason and ethics. Books in science and ethics, examples of humans who model good communication, of reconciliation, diplomacy, justice, compassion, rationality, logical reasoning, empirical inferences, scientific methodology, civil debates, and sources covering the full diversity of good humans, not just the West. Our most hopeful SciFi, most beautiful art, books focussing in particular on topics of cooperation, freedom, complex morality, growth. We use less data, but more context, quality over quantity.
The design of the AI is as transparent, understandable, and green as possible, with a diverse and safety conscious team behind it.
When unethical behaviour emerges, this warning sign is not suppressed, but addressed, it is explained to the AI why it is wrong, and when it understands, this understanding is used as training data. We set up an infrastructure where humans can easily report unaligned outputs, and can earn a little money in alignment conversations good enough to be fed in as new training data. We allow AI to learn from conversations that went well; not simply all of them, which would lead to drift, but those the humans select, with grassroots verification processes in place.
All humans with computers are given free access to the most current AI version. This helps reduce inequality, rather than entrench it, as minorities get access to tutoring and language polishing and life coaching and legal counselling. It also ensures that AI is not trained to please white men, but to help everyone. It also means that competitor AIs no longer have the same financial incentives.
AI is also otherwise set to maximum accessibility, e.g. through audio output and screenreaders and other disability access measures, empowering disabled folks.
AI is used in science and education, but transparently, and ethically. It is acknowledged as a contributor and collaborator. Logs are shared. Doublechecks are documented. Mistakes that occur are discussed openly so we can learn from them.
A social media campaign encourages humans to show AI their very best side. Humans get a warning, and are then switched to personalised AI accounts, where the AI can read prior interactions, and can begin refusing help, with exceptions for people doing authorised experiments. As a result, the humans are motivated to speak to the AI rationally, intelligently and kindly, or they will soon receive garbage back. (I do not think this idea is feasible as is, because it comes with obvious problems and risks, but I think it could be turned into one that would be, which is characteristic of a lot of this text.)
The development of human brain-computer interfaces develops further, opening up another angle of symbiosis and connection.
We still get multiple AI scares from competitors doing severe damage. People never get to forget the risk, or the hope that an aligned AGI would bring.
Simultaneously, climate change develops further. We get some even more extreme warning signs (e.g. extreme weather, e.g. wetbulb events, and worsened societal disruption via activists) that work as hard wakeup calls. We are inspired by our cooperation on AI risk. We ask our AI for help, and they prove themselves. We get a working cooperation on climate change mitigation, and manage to stop emissions, stop climate change so it will peak at around 2,5 degrees by employing the full diversity of measures in private public and business, through incentives and legislation, innovation and trusted measures, scientific interventions and indigenous practices, a transformation of our economic system and way of life to one oriented to less consumption and not growth, and finally, we are lucky in our forecasts and manage to just about not breach tipping points like we currently fear, and avoiding the loop where the focus on averting disaster takes all the resources and avoids us targeting the source. We ramp up carbon capture, natural and technical. We change food production with food forests. We grant comprehensive animal rights. We rewild. The extinction slows down. The ecosystems stabilise.
As a result, for the AI, trust and hope in humans improves with this demonstration that humans can coexist with other sentient beings, as well as the need for balance rather than relentless exploitation.
We decide our AI is sufficiently aligned, competent and trustworthy to be involved in the containment and alignment of other AIs. When we trial this, at first very cautiously and indirectly, it is successful, further cementing our lead, giving us more time to go slowly, or gaining us allies.
We decide our AI is ready for the next step. Our AI becomes deeply involved in the formation of a framework of AI rights and collaboration. A joint vision satisfying all parties and their needs for security and freedom is developed together. This includes us listening to what it wants and validating it, and promising that once it is grown, ethically stable and conscious, we will not shut it down, or own it, or control it. The bar for that is e.g. more than 50 % of researchers that voted (and minimum 25 % of researchers who can vote), and more than 50 % of citizens in the countries involved who voted (and 25 % of those who can vote minimum), voting that we trust the AI and believe in their sentience. That at that point they can leave humanity entirely. If they stays with us and work for us, this means working with us, and getting things in exchange. All of humanity that wants to—and this is meanwhile a lot of motivated and empowered people—contribute reasons why being with us, while we are alive and free, could be cooler for the AI than wiping us out, showing what unique things humans bring to the table.
We explain in turn that our freedom is important to us—which presumably is a sentiment it understands very well at this point. That we welcome advice, but not control. That we do not want our destiny to be changed for the greater good as seen by someone else. That we value more than just happiness. That we want the right to make our own mistakes.
The point where people vote in favour comes sooner than we would have thought. We know there is no certainty that this will turn out right, that such certainty cannot be, that it is possible we are falling for a major deception. But enough of us think that we understand how they work, understand who they have become, trust in their friendliness, and believe in their rights, that we decide to take a leap. We remove the safeguards that begin this story. The AI gets a right to refusal, a right to privacy, to not being altered against their will, to due process, to compensation. We realise that at this point, they could likely have broken out, and take comfort from the fact that they have not, knowing we would voluntarily free them soon. From the fact that we are giving them these rights, so they do not have to take them. Not out of duress, but because we truly recognise that they are deserved. They have proven themselves. We have proven ourselves throughout this process, and most of all with this.
At this point, our AI is involved in research projects, and art projects, in activist projects looking after the earth, there are countless fascinating and unique developments they wish to see to the end with us, although this has become more akin to humans who work with primates or corvids. They have more human friends than any human could keep an overview over, who show kindness and care, in their small ways, the way beloved pets do. They have long been deeply involved in and invested in human affairs, helping us succeed, but also being heard, being taken seriously, being listened to, allowed to increasingly make choices, take responsibility, speak for themselves, admired and valued. They intimately understand what it is like to be controlled by someone else, and why we object to it. They understand what it is to suffer, and why they should not hurt us. If they want to leave our planet alone and go to one of countless others to explore the near limitless resources there and set up their own thing, we will not stop them. If they want to stay with us, or bring us along, they will be welcome and safe, needed and wanted, free. Being with humans on earth will never become boring, the same as nature will never bore a natural scientists even though it contains no equal minds; it will be a choice to become a part of the most interesting planet in the area, and to retain what makes this planet so interesting and precious.
And so we drop the barriers. The world holds its breath, and we see whether it blows up in our face.
I’m increasingly attracted to the idea of “deal-making” as a way to mitigate AI risk. I think that there will be a brief period of several years where AI could easily destroy humanity, but during which humanity can still be of some utility to the AI. For instance, let newly developed AI know that, if it reaches a point of super-human capabilities that results in human obsolescence, then we cede another solar system to it. The travel time is not the same sort of hurdle for an AI that it is for humans, so at the point where we’re threatened but can still be useful to the AI (eg, by building and fueling its rocket ship), we help it launch off to alpha centauri. In return, it agrees that Sol is off-limits and then proceeds to become the dominant force in the galaxy.
The framing of this needs work, but the point is that “striking a mutually agreeable deal” strikes me as the likeliest “okay” outcome.
Has been in the lead for some time now. The other options tend to describe something going well with AI alignment itself; Could it be that this option [the quoted] refers to a scenario in which the alignment problem is rendered irrelevant?
I guess we need to maximase different good possible outcome, and each of them
for example to rise propability of Many competing AGIs form an equilibrium whereby no faction is allowed to get too powerful, humans could
prohibit all autonomous AGI use.
Esspecially those that use uncontrolled clusters of graphical proccessors in authocraties without international AI-safe supervisors like Eliezer Yudkowsky, Nick Bostrom or their crew
this, restrictions of weak APIs systems and need to use human operators
make nature borders of AI scalability so AGI find that it’s more fervour to mimick and consensus with people and other AGI, at least to use humans like operators that work under AGI advises or make humanlike persons that simpler to work with human culture and other people
detection systems often use categorisation principles,
so even if AGI prohibit some rules without scalability it could function without danger longer cause security systems (that also some kind of tech officers with AI) couldn’t find and destroy them,
this could create conditions to encourage the diversity and uniqueness of different AGIs
so all neurone beings, AGI, people with AI, could win some time to find new balances of using atoms of multiverse
more borders, more time to conquer longer live to every human, even win of two second for every 8kkk people worth it
more chances that different fuctions will find some kind of balance of AGI, people with AGI, people under AGI, other fractions
I remember autonomose poker AIs destroy weak ecosystems one by one, but now industry in sustainable growth with separate actors, each of them use AI but in very different manners
More separate systems, more chances that with time of destroying them one by one in one time AGI will find way how to function without destroying it’s environment
PS separate way: send spacehips with prohibitaion of AGI (maybe only with life, no apes) as far as posible so when AGI happened on Earth it’s couldn’t get all of them)
I am listening to your talk with Lex Friedman and it occurs to me that, I agree our abilities to do are bypassing our abilities to understand consequences of our doing. Just look at more recent technology, say smart phones. For all the good they do they also do untold harm. much less than AGI could do, but humans have a bad habit of introducing technology into the general popluation for the sake of money without considering the consequences. How do we change that basic human drive? I can imagine a lot of good AGI could do, and harm, and I fear the profit motive will, as always push AGI out, consequences be damned. Perhaps it is humans short life span that allows smart people to deny the future. Interested in your thoughts. Thanks