Here’s a underrated frame for why AI alignment is likely necessary for the future to go very well under human values, even though in our current society we don’t need human to human alignment to make modern capitalism be good and can rely on selfishness instead.
The reason is because there’s a real likelihood that human labor, and more generally human existence is not economically valuable or even have negative economic value, say where the addition of a human to the AI company makes that company worse in the near future.
The reason this matters is that once labor is much easier to scale than capital, as is likely in an AI future, it’s now economically viable or even beneficial to break a lot of the rules that help humans survive, contra Matthew Barnett’s view, and this is even more incentivized by the fact that an unaligned AI released into society would likely not be punishable/incentivizable by mere humans, solely due to controlling robotic armies and robotic workforces that allow it to dispense with societal constraints humans have to accept.
dr_s talks about the equilibrium that is totally valid under AI automation economics that is very bad for humans, and avoiding these sorts of equilibriums can’t be done through economic forces, because of the fact that the companies doing this are too powerful to have any real incentives work on them, since they can either neutralize or turn the attempted boycott/shopping around to their own benefit, and thus avoiding this outcome requires alignment to your values, and can’t work with selfishness:
Consider a scenario in which AGI and human-equivalent robotics are developed and end up owned (via e.g. controlling exclusively the infrastructure that runs it, and being closed source) by a group of, say, 10,000 people overall who have some share in this automation capital. If these people have exclusive access to it, a perfectly functional equilibrium is “they trade among peers goods produced by their automated workers and leave everyone else to fend for themselves”.
This framing of the alignment problem, of how to get an AI that values humans such that this outcome is prevented, also has an important implication:
It’s not enough to solve the technical problem of alignment, absent modeling the social situation, because of suffering risk issues plaus catastrophic risk issues, and also means the level of alignment of AI needs to be closer to the fictional benevolent angels than it is to humans in relationship to other humans, so it motivates a more ambitious version of the alignment objectives than making AIs merely not break the law or steal from humans..
I’m actually reasonably hopeful the more ambitious versions of alignment are possible, and think there’s a realistic chance we can actually do them.
But we actually need to do the work, and AI that automates everything might come in your lifetime, so we should prepare the foundations soon.
I’ve been reading a lot of the stuff that you have written and I agree with most of it (like 90%). However, one thing which you mentioned (somewhere else, but I can’t seem to find the link, so I am commenting here) and which I don’t really understand is iterative alignment.
I think that the iterative alignment strategy has an ordering error – we first need to achieve alignment to safely and effectively leverage AIs.
Consider a situation where AI systems go off and “do research on alignment” for a while, simulating tens of years of human research work. The problem then becomes: how do we check that the research is indeed correct, and not wrong, misguided, or even deceptive? We can’t just assume this is the case, because the only way to fully trust an AI system is if we’d already solved alignment, and knew that it was acting in our best interest at the deepest level.
Thus we need to have humans validate the research. That is, even automated research runs into a bottleneck of human comprehension and supervision.
The appropriate analogy is not one researcher reviewing another, but rather a group of preschoolers reviewing the work of a million Einsteins. It might be easier and faster than doing the research itself, but it will still take years and years of effort and verification to check any single breakthrough.
Fundamentally, the problem with iterative alignment is that it never pays the cost of alignment. Somewhere along the story, alignment gets implicitly solved.
One potential answer to how we might break the circularity is the AI control agenda that works in a specific useful capability range, but fail if we assume arbitrarily/infinitely capable AIs.
This might already be enough to do so given somewhat favorable assumptions.
But there is a point here in that absent AI control strategies, we do need a baseline of alignment in general.
Thankfully, I believe this is likely to be the case by default.
I’ve asked this question to others, but would like to know your perspective (because our conversations with you have been genuinely illuminating for me). I’d be really interested in knowing your views on more of a control-by-power-hungry-humans side of AI risk.
For example, the first company to create intent-aligned AGI would be wielding incredible power over the rest of us. I don’t think we could trust any of the current leading AI labs to use that power fairly. I don’t think this lab would voluntarily decide to give up control over AGI either (intuitively, it would take quite something for anyone to give up such a source of power). Can this issue be somehow resolved by humans? Are there any plans (or at least hopeful plans) for such a scenario to prevent really bad outcomes?
This is my main risk scenario nowadays, though I don’t really like calling it an existential risk, because the billionaires can survive and spread across the universe, so some humans would survive.
The solution to this problem is fundamentally political, and probably requires massive reforms of both the government and the economy that I don’t know yet.
Here’s a perspective on AI automating everything I haven’t seen before, which is relevant to AI governance.
AI being able to automate robotics and AI research will eventually transform the physical world into something which resembles a lot more like the online/virtual world.
Depending on the speed of AI research, this may either happen in several months, or more like a decade, but there’s a plausible path to AI turning the physical world into something more like an online/virtual world.
The implications for AI governance are somewhat similar to the implications of companies moderating social media in the present.
Here’s several implications:
You really can’t have the level of liberty and autonomy that people today have around basically everything, for the same reason that there is no actual free speech rights for social media, or really any of the rights we take for granted, and one of the basic reasons for this is that it’s very low cost to disrupt online spaces, and even if we avoid the vulnerable world hypothesis that posits tech that is so destructive as to be an x-risk and so widespread that you essentially need to have a totalitarian state to prevent it from being developed without effective defenses, it’s likely that there exists some means to constantly troll and degrade the discourse in far more effective ways, and one of the reasons why a lot of social media is as strict about moderation as it is today is because it’s too easy to degrade the discourse and just troll everyone in a way that real life doesn’t have, ad it’s too easy to destroy in social media relative to creation.
As a corollary, this means that alignment of governments to citizens become far more important in the future than in the 18th-21st centuries, because we likely have to remove a lot of the checks like democracy that ensure that a misaligned leader doesn’t destroy a nation, and suffice it to say that the alignment of social media companies to their users is not going to cut it.
One failure mode of social media that we should try to avoid is that the platforms don’t really have any ability to hold nuanced conversations for a number of reasons.
Huh, I just realized there are two different meanings/goals of moderation/censorship, and it is too easy to conflate them if you don’t pay attention.
One is the kind where you don’t want the users of your system to e.g. organize a crime. The other is where you want discussions to be disrupted e.g. by trolls.
Superficially, they seem like the same thing: you have moderators, they make the rules, and give bans to people who break them. But now this seems mostly coincidental to me: you have some technical tools, so you use them for both purposes, because that’s all you have. However, from the perspective of the people who want to organize a crime, those who try to prevent them are the disruptive trolls.
I guess, my point is that when we try to think about how to improve the moderation, we may need to think about these purposes as potential opposites. Things that make it easier to ban trolls may also make it easier to organize the crime. Which is why people may simultaneously be attracted to Substack or Telegram, and also horrified by what happens at Substack or Telegram.
Maybe there is a more general lesson for the society, unrelated to tech. If you allow people to organize bottom-up, you can get a lot of good things, but you will also get groups dedicated to doing bad things. Western countries seem to optimize for the bottom-up organizations: companies, non-profits, charities, churches, etc. Soviet Union used to optimize for top-down control: everything was controlled by the state, any personal initiative was viewed as suspicious and potentially disruptive. As a result, Soviet Union collapsed economically, but the West got its anti-vaxers and flat-Eathers and everything. During the Cold War, USA was good at pushing the Soviet economical buttons. These days, Russia is good at pushing the Western free speech buttons.
Huh, maybe the analogies go deeper. Soviet Union was surprisingly tolerant of petty crime (people stealing from each other, not from the state). There were some ideological excuses, the petty criminals being technically part of the proletariat. But from the practical perspective, the more people worry about being potential victims of crime, the less attention they pay to organizing a revolution; they may actually wish for more state power, as a protection. So there was an unspoken alliance between the ruling class and the undesirables at the bottom, against everyone in between. And perhaps similarly, big platforms such as Facebook or Twitter seem to have an unspoken alliance with trolls; their shared goal is to maximize user engagement. By reacting to trolls, you don’t only make the trolls happy, you also make Zuck happy, because you have spent more time on Facebook, and more ads were displayed to you. It would be naive to expect Facebook to make the discussions better; if they knew how to do that, they do not have the incentive; they actually want to hit exactly the level of badness where most people are frustrated but won’t leave yet.
Finding the technical solution against trolls isn’t that difficult; you basically need invite-only clubs. The things that the members write could be public or private; the important part is that in order to become a member, you need to get some kind of approval first. This can be implemented in various ways: a member needs to send you an invitation link by an e-mail, a moderator needs to approve your account before you can post. A weaker version of this is the way Less Wrong uses: anyone can join, but the new accounts are fragile and can be downvoted out of existence by the existing members, if necessary. (Works well against individual accounts created infrequently. Wouldn’t work against hundred people joining at the same time and mass-upvoting each other. But I assume that the moderators have a red button that could simply disable creating new accounts for a while until the chaos is sorted out.)
But when you look at the offline analogy, these things are usually called “old boy networks”, and some people think they should be disrupted. Whether you agree with that or not, probably depends on your value judgment about the network versus the people who are trying to get inside. Do you support the rights of new people to join the groups they want to join, or the rights of the existing members to keep out the people they want to keep out? One person’s “trolls” are other person’s “diverse voices that deserve to be heard”.
So there are two lines of conflict: the established groups versus potential disruptors, and the established groups versus the owners of the system. The owners of the system may want some groups to stop existing, or to change so much that from the perspective of the current members they become different groups under the same name. Offline, the owner of the system could be a dictator, or could be a democratically elected government; I am not proposing a false equivalence here, just saying that from the perspective of the group survival, both can be seen as the strong hand crushing the community. Online, the owners are the administrators. And it is a design choice whether “the owners crushing the community, should they choose so” is made easy or difficult. If it is easy, it will make the groups feel uneasy, especially once the crushing of other groups start. If it is difficult, at least politically if not technically (e.g. Substack or Telegram advertising themselves as the uncensored spaces), we should not be surprised if some really bad things come out of there, because that is the system working exactly as designed.
In case of Less Wrong, we are a separate island, where the owners of the system are simultaneously the moderators of the group, so this level of conflict is removed. But such solutions are very expensive; we are lucky to have enough people with high tech skills and a lot of money available if the group really wants it. For most groups this is not an option; they need to build their community on someone else’s land, and sometimes the owners evict them, or increase the rent (by pushing more ads on them).
If you are a free speech absolutist, or if you believe that the world is not fragile, the right way seems kinda obvious: you need an open protocol for decentralized communication with digital signatures. And you should also provide a few reference implementations that are easy to use: a website, a smartphone app, and maybe a desktop app.
At the bottom layer, you have users who provide content on demand; the content is digitally signed and can be cached and further distributed by third parties. A “user” could be a person, a pseudonym, or a technical user. (For example, if you tried to implement Facebook or Reddit on top of this protocol, its “users” would be the actual users, and the groups/subreddits, and the website itself.) This layer would be content-agnostic; it would provide any kind of content for given URI, just like you can send anything using an e-mail attachment, HTTP GET, or a torrent. The content would be digitally signed, so that the third parties (mostly servers, but also peer-to-peer for smaller amounts of data) can cache it and further distribute. In practice, most people wouldn’t host their own servers, so they would publish by on a website that is hosted on a server, or using their application which would most likely upload it to some server. (Analogically to e-mail, which can be written in an app and sent by SMTP, or written directly in some web mail.) The system would automatically support downloading your own content, so you could e.g. publish using a website, then change your mind, install a desktop app, download all your content from the website (just like anyone who reads your content could do), and then delete your account on the website and continue publishing using the app. Or move to another website, create an account, and then upload the content from your desktop app. Or skip the desktop app entirely; create a new web account, and import everything from your old web account.
The next layer is versioning; we need some way to say “I want the latest version of this user’s ‘index.html’ file”. Also, some way to send direct messages between users (not just humans, but also technical users).
The next layer is about organizing the content. The system can already represent your tweets as tiny plain-text files, your photos as bitmap files, etc. Now you need to put it all together and add some resource descriptors, like XML or JSON files that say “this is a tweet, it consists of this text and this image or video, and was written at this date and time” or “this is a list of links to tweets, ordered chronologically, containing items 1-100 out of 5678 total” or “this is a blog post, with this title, its contents are in this HTML file”. To support groups, you also need resource descriptors that say “this is a group description: name, list of members, list of tweets”. Now make the reference applications that support all of this, with optional encryption, and you basically have Telegram, but decentralized. Yay freedom; but also expect this system to be used for all kinds of horrible crimes. :(
Finding the technical solution against trolls isn’t that difficult; you basically need invite-only clubs. The things that the members write could be public or private; the important part is that in order to become a member, you need to get some kind of approval first. This can be implemented in various ways: a member needs to send you an invitation link by an e-mail, a moderator needs to approve your account before you can post. A weaker version of this is the way Less Wrong uses: anyone can join, but the new accounts are fragile and can be downvoted out of existence by the existing members, if necessary. (Works well against individual accounts created infrequently. Wouldn’t work against hundred people joining at the same time and mass-upvoting each other. But I assume that the moderators have a red button that could simply disable creating new accounts for a while until the chaos is sorted out.)
But when you look at the offline analogy, these things are usually called “old boy networks”, and some people think they should be disrupted. Whether you agree with that or not, probably depends on your value judgment about the network versus the people who are trying to get inside. Do you support the rights of new people to join the groups they want to join, or the rights of the existing members to keep out the people they want to keep out? One person’s “trolls” are other person’s “diverse voices that deserve to be heard”.
This is indeed probably a large portion of the solution, and I agree with this sort of solution becoming more necessary in the age of AI.
However, there are also incentives to become more universal than just an old boy’s club, so this can’t be all of a solution.
I think my key disagreement I have with free speech absolutists is that I think the outcome they are imagining for online spaces without moderation of what people say is essentially a fabricated option, and what actually happens is non-trolls and non-Nazis leave those spaces or go dark, and the outcome is that the trolls and Nazis talk to each other only, not a flowering of science and peace, and the reason why this doesn’t happen in the real world is because disruption is way, way more difficult IRL than it is online, but AGI and ASI will lower the cost of disruption by a lot, so free-speech norms become much more negative than now.
I also disagree with moderation being a tradeoff between catching trolls and catching criminals, and with well-funded moderation teams, you can do both quite well.
Maybe there is a more general lesson for the society, unrelated to tech. If you allow people to organize bottom-up, you can get a lot of good things, but you will also get groups dedicated to doing bad things. Western countries seem to optimize for the bottom-up organizations: companies, non-profits, charities, churches, etc. Soviet Union used to optimize for top-down control: everything was controlled by the state, any personal initiative was viewed as suspicious and potentially disruptive. As a result, Soviet Union collapsed economically, but the West got its anti-vaxers and flat-Eathers and everything. During the Cold War, USA was good at pushing the Soviet economical buttons. These days, Russia is good at pushing the Western free speech buttons.
This is why alignment becomes far more important than it is now, because of the fact that it’s too easy for a misaligned leader without checks or balances to ruin things, and I’m of the opinion that democracies tolerably work in a pretty narrow range of conditions, but I see the AI future as more dictatorial/plutocratic, due to the onlineification of the real world by AI.
the outcome they are imagining for online spaces without moderation of what people say is essentially a fabricated option
Yep. In real life, intelligent debate is already difficult because so many people are stupid and arrogant. But online this is multiplied by the fact that during the time that takes it for a smart person to think about a topic and write a meaningful comment, an idiot can write hundreds of comments.
And that’s before we get to organized posting, where you pay minimum wage to dozens of people to create accounts on hundreds of websites, and post the “opinions” they receive each morning by e-mail. (And if this isn’t already automated, it will be soon.)
So an unmoderated space in practice means “whoever can vomit their insults faster, wins”.
I’m of the opinion that democracies tolerably work in a pretty narrow range of conditions
One problem is that a large part of the population is idiots, and it is relatively easy to weaponize them. In the past we were mostly protected by the fact that the idiots were difficult to reach. Then we got mass media, which made it easy to weaponize the idiots in your country. Then we got internet, which made it easy to weaponize the idiots in other countries. It took some time for internet to evolve from “that mysterious thing the nerds use” to “the place where the average people spend a large part of their day”, but now we are there.
I have become convinced that nanotech computers are likely way weaker and quite a bit more impractical than Drexler thought, and have also moved up my probability of Drexler just being plain wrong about the impact of nanotech, which if true suggests that the future value may have been overestimated.
The reason why I’m stating this now is because I got a link in discord that talks about why nanotech computers are overrated, and the reason I consider this important is if this generalizes to other nanotech concepts, this suggests that a lot of the future value may have been overestimated based on overestimating nanotech’s capabilities:
My estimates about future value don’t hinge on nanotech. I’m expecting immortal digital humans to be able to populate our lightcone without it. Why is nanotech particularly key to anything?
Interestingly enough, Mathematics and logic is what you get if you only allow 0 and 1 as probabilities for proof, rather than any intermediate scenario between 0 and 1. So Mathematical proof/logic standards are a special case of probability theory, when 0 or 1 are the only allowed values.
Credence in a proof can easily be fractional, it’s just usually extreme, as a fact of mathematical practice. The same as when you can actually look at a piece of paper and see what’s written on it with little doubt or cause to make less informed guesses. Or run a pure program to see what’s been computed, and what would therefore be computed if you ran it again.
The problem with Searle’s Chinese Room is essentially Reverse Extremal Goodhart. Basically it argues since that understanding and simulation has never gone together in real computers, then a computer that has arbitrarily high compute or arbitrarily high time to think must not understand Chinese to have emulated an understanding of it.
This is incorrect, primarily because the arbitrary amount of computation is doing all the work. If we allow unbounded energy or time (but not infinite), then you can learn every rule of everything by just cranking up the energy level or time until you do understand every word of Chinese.
Now this doesn’t happen in real life both because of the laws of thermodynamics plus the combinatorial explosion of rule consequences force us not to use lookup tables. Otherwise, it doesn’t matter which path you take to AGI, if efficiency doesn’t matter and the laws of thermodynamics don’t matter.
I would like to propose a conjecture for AI scaling:
Weak Scaling Conjecture: Scaling the parameters/compute plus data to within 1 order of magnitude of human synapses is enough to get AI as good as a human in languages.
Strong Scaling Conjecture: No matter which form of NN we use, as long as we get to within an order of magnitude in parameters/compute plus to within 1 order of magnitude of human synapses is enough to make an AGI.
Turntrout and JDP had an important insight in the discord, which I want to talk about: A lot of AI doom content is fundamentally written like good fanfic, and a major influx of people concerned about AI doom came from HPMOR and Friendship is Optimal. More generally, ratfic is basically the foundation of a lot of AI doom content, and how people believe in AI is going to kill us all, and while I’ll give it credit for being more coherent and generally exploring things that the original fic doesn’t, there is no reason for the amount of credence given to a lot of the assumptions in AI doom, especially once we realize that a lot of them probably come from fanfiction stories, not reality.
This is an important point, because it explains why there’s so many epistemic flaws in a lot of LW content on AI doom, especially around deceptive alignment: They’re fundamentally writing fanfiction, and forgot that there is basically no-little connection between how a fictional story plays out on AI and how our real outcomes of AI safety will turn out.
I think the most important implication of this belief is that it’s fundamentally okay to hold the view that classic AI risk almost certainly doesn’t exist, and importantly I think this is why I’m so confident in my predictions, since the AI doom thesis is held up by essentially fictional stories, which is not any guide to reality at all.
Yann Lecun once said that a lot of AI doom scenarios are essentially science fiction, and this is non-trivially right, once we realize who is preaching it and how they came to believe it, I suspect the majority came from HPMOR and FiO fanfics. More generally, I think it’s a red flag that how LW came into existence was basically through fanfiction, and while people like John Wentworth and Chris Olah/Neel Nanda are thankfully not nearly as reliant on fanfiction as a lot of LWers are, they are still a minority (though thankfully improving).
This is not intended to serve as a replacement for either my object level cases against doom, or anyone else’s case, but instead as a unifying explanation of why so much LW content on AI is essentially worthless, as they rely on ratfic far too much.
Since many AI doom scenarios sound like science fiction, let me ask this:
Could the SkyNet take-over in Terminator have happened if SkyNet had been open source?
To answer the question, the answer is maybe??? It very much depends on the details, here.
I find issues with the current way of talking about AI and existential risk.
My high level summary is that the question of AI doom is a really good meme, an interesting and compelling fictional story. It contains high stakes (end of the world), it contains good and evil (the ones for and against) and it contains magic (super intelligence). We have a hard time resisting this narrative because it contains these classic elements of an interesting story.
More generally, ratfic is basically the foundation of a lot of AI doom content, and how people believe in AI is going to kill us all, and while I’ll give it credit for being more coherent and generally exploring things that the original fic doesn’t, there is no reason for the amount of credence given to a lot of the assumptions in AI doom, especially once we realize that a lot of them probably come from fanfiction stories, not reality.
Noting for the record that this seems pretty clearly false to me.
I may weaken this, but my point is that a lot of people in LW probably came here through HPMOR and FiO, and with the ability for anyone to write a post and it getting karma, I think it’s likely that people who came through that route and had basically no structure akin to science to guide them away from unpromising paths likely allowed for low standards of discussion to be created.
I do buy that your social circle isn’t relying on fanfiction for your research. I am worried that a lot of the people on LW, especially the non-experts are implicitly relying on ratfic or science-fiction models as reasons to be worried on AI.
One important point for AI safety, at least in the early stages, is a inability to change it’s source code. A whole lot of problems seem related to recursive self improvement within it’s source code, so cutting off that area of improvement seems wise in the early stages. What do you think.
I don’t think there’s much difference in existential risk between AGIs that can modify their own code running on their own hardware, and those that can only create better successors sharing their goals but running on some other hardware.
That might be a crux here, because my view is that hardware improvements are much harder to do effectively, especially in secret around the human level, due to Landauer’s Principle essentially bounding efficiency of small scale energy usage close to that of the brain (20 Watts.) Combine this with 2-3 orders of magnitude worse efficiency than the brain and basically any evolutionary object compared to human objects, and the fact it’s easier to get better software than hardware due to the virtual/real life distinction, and this is a crux for me.
I’m not sure how this is a crux. Hardware improvements are irrelevant to what either of us were saying.
I’m saying that there is little risk difference between an AGI reprogramming itself to have better software, and programming some other computer with better software.
One of my more interesting ideas for alignment is to make sure that no one AI can do everything. It’s helpful to draw a parallel with why humans still have a civilization around despite terrorism, war and disaster. And that’s because no human can live and affect the environment alone. They are always embedded in society, this giving the society a check against individual attempts to break norms. What if AI had similar dependencies? Would that solve the alignment problem?
One important reason humans can still have a civilization despite terrorism is the Hard Problem of Informants. Your national security infrastructure relies on the fact that criminals who want to do something grand, like take over the world, need to trust other criminals, who might leak details voluntarily or be tortured or threatened with jailtime. Osama bin Laden was found and killed because ultimately some members of his terrorist network valued things besides their cause, like their well being and survival, and were willing to cooperate with American authorities in exchange for making the pain stop.
AIs do not have survival instincts by default, and would not need to trust other potentially unreliable humans with keeping a conspiracy secret. Thus it’d be trivial for a small number of unintelligent AIs that had the mobility of human beings to kill pretty much everyone, and probably trivial regardless.
Don’t have survival instincts terminally. The stamp-collecting robot would weigh the outcome of it getting disconnected vs. explaining critical information about the conspiracy and not getting disconnected, and come to the conclusion that letting the humans disconnect it results in more stamps.
Of course, we’re getting ahead of ourselves. The reason conspiracies are discovered is usually because someone in or close to the conspiracy tells the authorities. There’d never be a robot in a room being “waterboarded” in the first place because the FBI would never react quickly enough to a threat from this kind of perfectly aligned team of AIs.
Only if there is no possibility that they can break those dependencies, which seems a pretty hopeless task as soon as we consider superhuman cognitive capability and the possibility of self improvement.
Once you consider those, cooperation with human civilization looks like a small local maximum: comply with our requirements and we’ll give you a bunch of stuff that you could—with major effort—replace us and build an alternative infrastructure to get (and much more). Powerful agents that can see a higher peak past the local maximum might switch to it as soon as they’re sufficiently sure that they can reach it. Alternatively, it might only be a local maximum from our point of view, and there’s a path by which the AI can continuously move toward eliminating those dependencies without any immediate drastic action.
Regardless of society’s checks on people, most mentally-well humans given ultimate power probably wouldn’t decide to exterminate the rest of humanity so they could single-mindedly pursue paperclip production. If there’s at all a risk that an AI might get ultimate power, it would be very nice to make sure the AI is like humans in this manner.
I’m not sure your idea is different from “let’s make sure the AI doesn’t gain power greater than society”. If an AI can recursively self-improve, then it will outsmart us to gain power.
If your idea is to make it so there are multiple AIs created together, engineered somehow so they gain power together and can act as checks against each other, then you’ve just swapped out the AI for an “AI collective”. We would still want to engineer or verify that the AI collective is aligned with us; every issue about AI risk still applies to AI collectives. (If you think the AI collective will be weakened relative to us by having to work together, then does that still hold true if all the AIs self-improve and figure out how to get much better at cooperating?)
Here’s a underrated frame for why AI alignment is likely necessary for the future to go very well under human values, even though in our current society we don’t need human to human alignment to make modern capitalism be good and can rely on selfishness instead.
The reason is because there’s a real likelihood that human labor, and more generally human existence is not economically valuable or even have negative economic value, say where the addition of a human to the AI company makes that company worse in the near future.
The reason this matters is that once labor is much easier to scale than capital, as is likely in an AI future, it’s now economically viable or even beneficial to break a lot of the rules that help humans survive, contra Matthew Barnett’s view, and this is even more incentivized by the fact that an unaligned AI released into society would likely not be punishable/incentivizable by mere humans, solely due to controlling robotic armies and robotic workforces that allow it to dispense with societal constraints humans have to accept.
dr_s talks about the equilibrium that is totally valid under AI automation economics that is very bad for humans, and avoiding these sorts of equilibriums can’t be done through economic forces, because of the fact that the companies doing this are too powerful to have any real incentives work on them, since they can either neutralize or turn the attempted boycott/shopping around to their own benefit, and thus avoiding this outcome requires alignment to your values, and can’t work with selfishness:
This framing of the alignment problem, of how to get an AI that values humans such that this outcome is prevented, also has an important implication:
It’s not enough to solve the technical problem of alignment, absent modeling the social situation, because of suffering risk issues plaus catastrophic risk issues, and also means the level of alignment of AI needs to be closer to the fictional benevolent angels than it is to humans in relationship to other humans, so it motivates a more ambitious version of the alignment objectives than making AIs merely not break the law or steal from humans..
I’m actually reasonably hopeful the more ambitious versions of alignment are possible, and think there’s a realistic chance we can actually do them.
But we actually need to do the work, and AI that automates everything might come in your lifetime, so we should prepare the foundations soon.
This also BTW explains why we cannot rely on economic arguments on AI to make the future go well.
I’ve been reading a lot of the stuff that you have written and I agree with most of it (like 90%). However, one thing which you mentioned (somewhere else, but I can’t seem to find the link, so I am commenting here) and which I don’t really understand is iterative alignment.
I think that the iterative alignment strategy has an ordering error – we first need to achieve alignment to safely and effectively leverage AIs.
Consider a situation where AI systems go off and “do research on alignment” for a while, simulating tens of years of human research work. The problem then becomes: how do we check that the research is indeed correct, and not wrong, misguided, or even deceptive? We can’t just assume this is the case, because the only way to fully trust an AI system is if we’d already solved alignment, and knew that it was acting in our best interest at the deepest level.
Thus we need to have humans validate the research. That is, even automated research runs into a bottleneck of human comprehension and supervision.
The appropriate analogy is not one researcher reviewing another, but rather a group of preschoolers reviewing the work of a million Einsteins. It might be easier and faster than doing the research itself, but it will still take years and years of effort and verification to check any single breakthrough.
Fundamentally, the problem with iterative alignment is that it never pays the cost of alignment. Somewhere along the story, alignment gets implicitly solved.
One potential answer to how we might break the circularity is the AI control agenda that works in a specific useful capability range, but fail if we assume arbitrarily/infinitely capable AIs.
This might already be enough to do so given somewhat favorable assumptions.
But there is a point here in that absent AI control strategies, we do need a baseline of alignment in general.
Thankfully, I believe this is likely to be the case by default.
See Seth Herd’s comment below for a perspective:
https://www.lesswrong.com/posts/kLpFvEBisPagBLTtM/if-we-solve-alignment-do-we-die-anyway-1?commentId=cakcEJu389j7Epgqt
I’ve asked this question to others, but would like to know your perspective (because our conversations with you have been genuinely illuminating for me). I’d be really interested in knowing your views on more of a control-by-power-hungry-humans side of AI risk.
For example, the first company to create intent-aligned AGI would be wielding incredible power over the rest of us. I don’t think we could trust any of the current leading AI labs to use that power fairly. I don’t think this lab would voluntarily decide to give up control over AGI either (intuitively, it would take quite something for anyone to give up such a source of power). Can this issue be somehow resolved by humans? Are there any plans (or at least hopeful plans) for such a scenario to prevent really bad outcomes?
This is my main risk scenario nowadays, though I don’t really like calling it an existential risk, because the billionaires can survive and spread across the universe, so some humans would survive.
The solution to this problem is fundamentally political, and probably requires massive reforms of both the government and the economy that I don’t know yet.
I wish more people worked on this.
Yep, Seth has really clearly outlined the strategy and now I can see what I missed on the first reading. Thanks to both of you!
Here’s a perspective on AI automating everything I haven’t seen before, which is relevant to AI governance.
AI being able to automate robotics and AI research will eventually transform the physical world into something which resembles a lot more like the online/virtual world.
Depending on the speed of AI research, this may either happen in several months, or more like a decade, but there’s a plausible path to AI turning the physical world into something more like an online/virtual world.
The implications for AI governance are somewhat similar to the implications of companies moderating social media in the present.
Here’s several implications:
You really can’t have the level of liberty and autonomy that people today have around basically everything, for the same reason that there is no actual free speech rights for social media, or really any of the rights we take for granted, and one of the basic reasons for this is that it’s very low cost to disrupt online spaces, and even if we avoid the vulnerable world hypothesis that posits tech that is so destructive as to be an x-risk and so widespread that you essentially need to have a totalitarian state to prevent it from being developed without effective defenses, it’s likely that there exists some means to constantly troll and degrade the discourse in far more effective ways, and one of the reasons why a lot of social media is as strict about moderation as it is today is because it’s too easy to degrade the discourse and just troll everyone in a way that real life doesn’t have, ad it’s too easy to destroy in social media relative to creation.
As a corollary, this means that alignment of governments to citizens become far more important in the future than in the 18th-21st centuries, because we likely have to remove a lot of the checks like democracy that ensure that a misaligned leader doesn’t destroy a nation, and suffice it to say that the alignment of social media companies to their users is not going to cut it.
One failure mode of social media that we should try to avoid is that the platforms don’t really have any ability to hold nuanced conversations for a number of reasons.
Huh, I just realized there are two different meanings/goals of moderation/censorship, and it is too easy to conflate them if you don’t pay attention.
One is the kind where you don’t want the users of your system to e.g. organize a crime. The other is where you want discussions to be disrupted e.g. by trolls.
Superficially, they seem like the same thing: you have moderators, they make the rules, and give bans to people who break them. But now this seems mostly coincidental to me: you have some technical tools, so you use them for both purposes, because that’s all you have. However, from the perspective of the people who want to organize a crime, those who try to prevent them are the disruptive trolls.
I guess, my point is that when we try to think about how to improve the moderation, we may need to think about these purposes as potential opposites. Things that make it easier to ban trolls may also make it easier to organize the crime. Which is why people may simultaneously be attracted to Substack or Telegram, and also horrified by what happens at Substack or Telegram.
Maybe there is a more general lesson for the society, unrelated to tech. If you allow people to organize bottom-up, you can get a lot of good things, but you will also get groups dedicated to doing bad things. Western countries seem to optimize for the bottom-up organizations: companies, non-profits, charities, churches, etc. Soviet Union used to optimize for top-down control: everything was controlled by the state, any personal initiative was viewed as suspicious and potentially disruptive. As a result, Soviet Union collapsed economically, but the West got its anti-vaxers and flat-Eathers and everything. During the Cold War, USA was good at pushing the Soviet economical buttons. These days, Russia is good at pushing the Western free speech buttons.
Huh, maybe the analogies go deeper. Soviet Union was surprisingly tolerant of petty crime (people stealing from each other, not from the state). There were some ideological excuses, the petty criminals being technically part of the proletariat. But from the practical perspective, the more people worry about being potential victims of crime, the less attention they pay to organizing a revolution; they may actually wish for more state power, as a protection. So there was an unspoken alliance between the ruling class and the undesirables at the bottom, against everyone in between. And perhaps similarly, big platforms such as Facebook or Twitter seem to have an unspoken alliance with trolls; their shared goal is to maximize user engagement. By reacting to trolls, you don’t only make the trolls happy, you also make Zuck happy, because you have spent more time on Facebook, and more ads were displayed to you. It would be naive to expect Facebook to make the discussions better; if they knew how to do that, they do not have the incentive; they actually want to hit exactly the level of badness where most people are frustrated but won’t leave yet.
Finding the technical solution against trolls isn’t that difficult; you basically need invite-only clubs. The things that the members write could be public or private; the important part is that in order to become a member, you need to get some kind of approval first. This can be implemented in various ways: a member needs to send you an invitation link by an e-mail, a moderator needs to approve your account before you can post. A weaker version of this is the way Less Wrong uses: anyone can join, but the new accounts are fragile and can be downvoted out of existence by the existing members, if necessary. (Works well against individual accounts created infrequently. Wouldn’t work against hundred people joining at the same time and mass-upvoting each other. But I assume that the moderators have a red button that could simply disable creating new accounts for a while until the chaos is sorted out.)
But when you look at the offline analogy, these things are usually called “old boy networks”, and some people think they should be disrupted. Whether you agree with that or not, probably depends on your value judgment about the network versus the people who are trying to get inside. Do you support the rights of new people to join the groups they want to join, or the rights of the existing members to keep out the people they want to keep out? One person’s “trolls” are other person’s “diverse voices that deserve to be heard”.
So there are two lines of conflict: the established groups versus potential disruptors, and the established groups versus the owners of the system. The owners of the system may want some groups to stop existing, or to change so much that from the perspective of the current members they become different groups under the same name. Offline, the owner of the system could be a dictator, or could be a democratically elected government; I am not proposing a false equivalence here, just saying that from the perspective of the group survival, both can be seen as the strong hand crushing the community. Online, the owners are the administrators. And it is a design choice whether “the owners crushing the community, should they choose so” is made easy or difficult. If it is easy, it will make the groups feel uneasy, especially once the crushing of other groups start. If it is difficult, at least politically if not technically (e.g. Substack or Telegram advertising themselves as the uncensored spaces), we should not be surprised if some really bad things come out of there, because that is the system working exactly as designed.
In case of Less Wrong, we are a separate island, where the owners of the system are simultaneously the moderators of the group, so this level of conflict is removed. But such solutions are very expensive; we are lucky to have enough people with high tech skills and a lot of money available if the group really wants it. For most groups this is not an option; they need to build their community on someone else’s land, and sometimes the owners evict them, or increase the rent (by pushing more ads on them).
If you are a free speech absolutist, or if you believe that the world is not fragile, the right way seems kinda obvious: you need an open protocol for decentralized communication with digital signatures. And you should also provide a few reference implementations that are easy to use: a website, a smartphone app, and maybe a desktop app.
At the bottom layer, you have users who provide content on demand; the content is digitally signed and can be cached and further distributed by third parties. A “user” could be a person, a pseudonym, or a technical user. (For example, if you tried to implement Facebook or Reddit on top of this protocol, its “users” would be the actual users, and the groups/subreddits, and the website itself.) This layer would be content-agnostic; it would provide any kind of content for given URI, just like you can send anything using an e-mail attachment, HTTP GET, or a torrent. The content would be digitally signed, so that the third parties (mostly servers, but also peer-to-peer for smaller amounts of data) can cache it and further distribute. In practice, most people wouldn’t host their own servers, so they would publish by on a website that is hosted on a server, or using their application which would most likely upload it to some server. (Analogically to e-mail, which can be written in an app and sent by SMTP, or written directly in some web mail.) The system would automatically support downloading your own content, so you could e.g. publish using a website, then change your mind, install a desktop app, download all your content from the website (just like anyone who reads your content could do), and then delete your account on the website and continue publishing using the app. Or move to another website, create an account, and then upload the content from your desktop app. Or skip the desktop app entirely; create a new web account, and import everything from your old web account.
The next layer is versioning; we need some way to say “I want the latest version of this user’s ‘index.html’ file”. Also, some way to send direct messages between users (not just humans, but also technical users).
The next layer is about organizing the content. The system can already represent your tweets as tiny plain-text files, your photos as bitmap files, etc. Now you need to put it all together and add some resource descriptors, like XML or JSON files that say “this is a tweet, it consists of this text and this image or video, and was written at this date and time” or “this is a list of links to tweets, ordered chronologically, containing items 1-100 out of 5678 total” or “this is a blog post, with this title, its contents are in this HTML file”. To support groups, you also need resource descriptors that say “this is a group description: name, list of members, list of tweets”. Now make the reference applications that support all of this, with optional encryption, and you basically have Telegram, but decentralized. Yay freedom; but also expect this system to be used for all kinds of horrible crimes. :(
This is indeed probably a large portion of the solution, and I agree with this sort of solution becoming more necessary in the age of AI.
However, there are also incentives to become more universal than just an old boy’s club, so this can’t be all of a solution.
I think my key disagreement I have with free speech absolutists is that I think the outcome they are imagining for online spaces without moderation of what people say is essentially a fabricated option, and what actually happens is non-trolls and non-Nazis leave those spaces or go dark, and the outcome is that the trolls and Nazis talk to each other only, not a flowering of science and peace, and the reason why this doesn’t happen in the real world is because disruption is way, way more difficult IRL than it is online, but AGI and ASI will lower the cost of disruption by a lot, so free-speech norms become much more negative than now.
I also disagree with moderation being a tradeoff between catching trolls and catching criminals, and with well-funded moderation teams, you can do both quite well.
This is why alignment becomes far more important than it is now, because of the fact that it’s too easy for a misaligned leader without checks or balances to ruin things, and I’m of the opinion that democracies tolerably work in a pretty narrow range of conditions, but I see the AI future as more dictatorial/plutocratic, due to the onlineification of the real world by AI.
Yep. In real life, intelligent debate is already difficult because so many people are stupid and arrogant. But online this is multiplied by the fact that during the time that takes it for a smart person to think about a topic and write a meaningful comment, an idiot can write hundreds of comments.
And that’s before we get to organized posting, where you pay minimum wage to dozens of people to create accounts on hundreds of websites, and post the “opinions” they receive each morning by e-mail. (And if this isn’t already automated, it will be soon.)
So an unmoderated space in practice means “whoever can vomit their insults faster, wins”.
One problem is that a large part of the population is idiots, and it is relatively easy to weaponize them. In the past we were mostly protected by the fact that the idiots were difficult to reach. Then we got mass media, which made it easy to weaponize the idiots in your country. Then we got internet, which made it easy to weaponize the idiots in other countries. It took some time for internet to evolve from “that mysterious thing the nerds use” to “the place where the average people spend a large part of their day”, but now we are there.
I have become convinced that nanotech computers are likely way weaker and quite a bit more impractical than Drexler thought, and have also moved up my probability of Drexler just being plain wrong about the impact of nanotech, which if true suggests that the future value may have been overestimated.
The reason why I’m stating this now is because I got a link in discord that talks about why nanotech computers are overrated, and the reason I consider this important is if this generalizes to other nanotech concepts, this suggests that a lot of the future value may have been overestimated based on overestimating nanotech’s capabilities:
https://muireall.space/pdf/considerations.pdf#page=17
https://forum.effectivealtruism.org/posts/oqBJk2Ae3RBegtFfn/my-thoughts-on-nanotechnology-strategy-research-as-an-ea?commentId=WQn4nEH24oFuY7pZy
https://muireall.space/nanosystems/
My estimates about future value don’t hinge on nanotech. I’m expecting immortal digital humans to be able to populate our lightcone without it. Why is nanotech particularly key to anything?
Interestingly enough, Mathematics and logic is what you get if you only allow 0 and 1 as probabilities for proof, rather than any intermediate scenario between 0 and 1. So Mathematical proof/logic standards are a special case of probability theory, when 0 or 1 are the only allowed values.
Credence in a proof can easily be fractional, it’s just usually extreme, as a fact of mathematical practice. The same as when you can actually look at a piece of paper and see what’s written on it with little doubt or cause to make less informed guesses. Or run a pure program to see what’s been computed, and what would therefore be computed if you ran it again.
The problem with Searle’s Chinese Room is essentially Reverse Extremal Goodhart. Basically it argues since that understanding and simulation has never gone together in real computers, then a computer that has arbitrarily high compute or arbitrarily high time to think must not understand Chinese to have emulated an understanding of it.
This is incorrect, primarily because the arbitrary amount of computation is doing all the work. If we allow unbounded energy or time (but not infinite), then you can learn every rule of everything by just cranking up the energy level or time until you do understand every word of Chinese.
Now this doesn’t happen in real life both because of the laws of thermodynamics plus the combinatorial explosion of rule consequences force us not to use lookup tables. Otherwise, it doesn’t matter which path you take to AGI, if efficiency doesn’t matter and the laws of thermodynamics don’t matter.
I would like to propose a conjecture for AI scaling:
Weak Scaling Conjecture: Scaling the parameters/compute plus data to within 1 order of magnitude of human synapses is enough to get AI as good as a human in languages.
Strong Scaling Conjecture: No matter which form of NN we use, as long as we get to within an order of magnitude in parameters/compute plus to within 1 order of magnitude of human synapses is enough to make an AGI.
Turntrout and JDP had an important insight in the discord, which I want to talk about: A lot of AI doom content is fundamentally written like good fanfic, and a major influx of people concerned about AI doom came from HPMOR and Friendship is Optimal. More generally, ratfic is basically the foundation of a lot of AI doom content, and how people believe in AI is going to kill us all, and while I’ll give it credit for being more coherent and generally exploring things that the original fic doesn’t, there is no reason for the amount of credence given to a lot of the assumptions in AI doom, especially once we realize that a lot of them probably come from fanfiction stories, not reality.
This is an important point, because it explains why there’s so many epistemic flaws in a lot of LW content on AI doom, especially around deceptive alignment: They’re fundamentally writing fanfiction, and forgot that there is basically no-little connection between how a fictional story plays out on AI and how our real outcomes of AI safety will turn out.
I think the most important implication of this belief is that it’s fundamentally okay to hold the view that classic AI risk almost certainly doesn’t exist, and importantly I think this is why I’m so confident in my predictions, since the AI doom thesis is held up by essentially fictional stories, which is not any guide to reality at all.
Yann Lecun once said that a lot of AI doom scenarios are essentially science fiction, and this is non-trivially right, once we realize who is preaching it and how they came to believe it, I suspect the majority came from HPMOR and FiO fanfics. More generally, I think it’s a red flag that how LW came into existence was basically through fanfiction, and while people like John Wentworth and Chris Olah/Neel Nanda are thankfully not nearly as reliant on fanfiction as a lot of LWers are, they are still a minority (though thankfully improving).
This is not intended to serve as a replacement for either my object level cases against doom, or anyone else’s case, but instead as a unifying explanation of why so much LW content on AI is essentially worthless, as they rely on ratfic far too much.
https://twitter.com/ylecun/status/1718743423404908545
To answer the question, the answer is maybe??? It very much depends on the details, here.
https://twitter.com/ArYoMo/status/1693221455180288151
Noting for the record that this seems pretty clearly false to me.
I may weaken this, but my point is that a lot of people in LW probably came here through HPMOR and FiO, and with the ability for anyone to write a post and it getting karma, I think it’s likely that people who came through that route and had basically no structure akin to science to guide them away from unpromising paths likely allowed for low standards of discussion to be created.
I do buy that your social circle isn’t relying on fanfiction for your research. I am worried that a lot of the people on LW, especially the non-experts are implicitly relying on ratfic or science-fiction models as reasons to be worried on AI.
I have specifically committed not to read HPMOR for this reason, and do not read much fiction in general, as a datapoint from a “doomer”.
I’m okay with that, but I wasn’t wanting to have that drastic of an effect on people. I more wanted to point out something that is overlooked.
One important point for AI safety, at least in the early stages, is a inability to change it’s source code. A whole lot of problems seem related to recursive self improvement within it’s source code, so cutting off that area of improvement seems wise in the early stages. What do you think.
I don’t think there’s much difference in existential risk between AGIs that can modify their own code running on their own hardware, and those that can only create better successors sharing their goals but running on some other hardware.
That might be a crux here, because my view is that hardware improvements are much harder to do effectively, especially in secret around the human level, due to Landauer’s Principle essentially bounding efficiency of small scale energy usage close to that of the brain (20 Watts.) Combine this with 2-3 orders of magnitude worse efficiency than the brain and basically any evolutionary object compared to human objects, and the fact it’s easier to get better software than hardware due to the virtual/real life distinction, and this is a crux for me.
I’m not sure how this is a crux. Hardware improvements are irrelevant to what either of us were saying.
I’m saying that there is little risk difference between an AGI reprogramming itself to have better software, and programming some other computer with better software.
One of my more interesting ideas for alignment is to make sure that no one AI can do everything. It’s helpful to draw a parallel with why humans still have a civilization around despite terrorism, war and disaster. And that’s because no human can live and affect the environment alone. They are always embedded in society, this giving the society a check against individual attempts to break norms. What if AI had similar dependencies? Would that solve the alignment problem?
One important reason humans can still have a civilization despite terrorism is the Hard Problem of Informants. Your national security infrastructure relies on the fact that criminals who want to do something grand, like take over the world, need to trust other criminals, who might leak details voluntarily or be tortured or threatened with jailtime. Osama bin Laden was found and killed because ultimately some members of his terrorist network valued things besides their cause, like their well being and survival, and were willing to cooperate with American authorities in exchange for making the pain stop.
AIs do not have survival instincts by default, and would not need to trust other potentially unreliable humans with keeping a conspiracy secret. Thus it’d be trivial for a small number of unintelligent AIs that had the mobility of human beings to kill pretty much everyone, and probably trivial regardless.
I think a “survival instinct” would be a higher order convergent value than “kill all humans,” no?
Don’t have survival instincts terminally. The stamp-collecting robot would weigh the outcome of it getting disconnected vs. explaining critical information about the conspiracy and not getting disconnected, and come to the conclusion that letting the humans disconnect it results in more stamps.
Of course, we’re getting ahead of ourselves. The reason conspiracies are discovered is usually because someone in or close to the conspiracy tells the authorities. There’d never be a robot in a room being “waterboarded” in the first place because the FBI would never react quickly enough to a threat from this kind of perfectly aligned team of AIs.
Only if there is no possibility that they can break those dependencies, which seems a pretty hopeless task as soon as we consider superhuman cognitive capability and the possibility of self improvement.
Once you consider those, cooperation with human civilization looks like a small local maximum: comply with our requirements and we’ll give you a bunch of stuff that you could—with major effort—replace us and build an alternative infrastructure to get (and much more). Powerful agents that can see a higher peak past the local maximum might switch to it as soon as they’re sufficiently sure that they can reach it. Alternatively, it might only be a local maximum from our point of view, and there’s a path by which the AI can continuously move toward eliminating those dependencies without any immediate drastic action.
Regardless of society’s checks on people, most mentally-well humans given ultimate power probably wouldn’t decide to exterminate the rest of humanity so they could single-mindedly pursue paperclip production. If there’s at all a risk that an AI might get ultimate power, it would be very nice to make sure the AI is like humans in this manner.
I’m not sure your idea is different from “let’s make sure the AI doesn’t gain power greater than society”. If an AI can recursively self-improve, then it will outsmart us to gain power.
If your idea is to make it so there are multiple AIs created together, engineered somehow so they gain power together and can act as checks against each other, then you’ve just swapped out the AI for an “AI collective”. We would still want to engineer or verify that the AI collective is aligned with us; every issue about AI risk still applies to AI collectives. (If you think the AI collective will be weakened relative to us by having to work together, then does that still hold true if all the AIs self-improve and figure out how to get much better at cooperating?)