Allegory On AI Risk, Game Theory, and Mithril
“Thorin, I can’t accept your generous job offer because, honestly, I think that your company might destroy Middle Earth.”
“Bifur, I can tell that you’re one of those “the Balrog is real, evil, and near” folks who thinks that in the next few decades Mithril miners will dig deep enough to wake the Balrog causing him to rise and destroy Middle Earth. Let’s say for the sake of argument that you’re right. You must know that lots of people disagree with you. Some don’t believe in the Balrog, others think that anything that powerful will inevitably be good, and more think we are hundreds or even thousands of years away from being able to disturb any possible Balrog. These other dwarves are not going to stop mining, especially given the value of Mithril. If you’re right about the Balrog we are doomed regardless of what you do, so why not have a high paying career as a Mithril miner and enjoy yourself while you can?”
“But Thorin, if everyone thought that way we would be doomed!”
“Exactly, so make the most of what little remains of your life.”
“Thorin, what if I could somehow convince everyone that I’m right about the Balrog?”
“You can’t because, as the wise Sinclair said, ‘It is difficult to get a dwarf to understand something, when his salary depends upon his not understanding it!’ But even if you could, it still wouldn’t matter. Each individual miner would correctly realize that just him alone mining Mithril is extraordinarily unlikely to be the cause of the Balrog awakening, and so he would find it in his self-interest to mine. And, knowing that others are going to continue to extract Mithril means that it really doesn’t matter if you mine because if we are close to disturbing the Balrog he will be awoken.”
“But dwarves can’t be that selfish, can they?”
“Actually, altruism could doom us as well. Given Mithril’s enormous military value many cities rightly fear that without new supplies they will be at the mercy of cities that get more of this metal, especially as it’s known that the deeper Mithril is found, the greater its powers. Leaders who care about their citizen’s safety and freedom will keep mining Mithril. If we are soon all going to die, altruistic leaders will want to make sure their people die while still free citizens of Middle Earth.”
“But couldn’t we all coordinate to stop mining? This would be in our collective interest.”
“No, dwarves would cheat rightly realizing that if just they mine a little bit more Mithril it’s highly unlikely to do anything to the Balrog, and the more you expect others to cheat, the less your cheating matters as to whether the Balrog gets us if your assumptions about the Balrog are correct.”
“OK, but won’t the rich dwarves step in and eventually stop the mining? They surely don’t want to get eaten by the Balrog.”
“Actually, they have just started an open Mithril mining initiative which will find and then freely disseminate new and improved Mithril mining technology. These dwarves earned their wealth through Mithril, they love Mithril, and while some of them can theoretically understand how Mithril mining might be bad, they can’t emotionally accept that their life’s work, the acts that have given them enormous success and status, might significantly hasten our annihilation.”
“Won’t the dwarven kings save us? After all, their primary job is to protect their realms from monsters.
“Ha! They are more likely to subsidize Mithril mining than to stop it. Their military machines need Mithril, and any king who prevented his people from getting new Mithril just to stop some hypothetical Balrog from rising would be laughed out of office. The common dwarf simply doesn’t have the expertise to evaluate the legitimacy of the Balrog claims and so rightly, from their viewpoint at least, would use the absurdity heuristic to dismiss any Balrog worries. Plus, remember that the kings compete with each other for the loyalty of dwarves and even if a few kings came to believe in the dangers posed by the Balrog they would realize that if they tried to imposed costs on their people, they would be outcompeted by fellow kings that didn’t try to restrict Mithril mining. Bifur, the best you can hope for with the kings is that they don’t do too much to accelerating Mithril mining.”
“Well, at least if I don’t do any mining it will take a bit longer for miners to awake the Balrog.”
“No Bifur, you obviously have never considered the economics of mining. You see, if you don’t take this job someone else will. Companies such as ours hire the optimal number of Mithril miners to maximize our profits and this number won’t change if you turn down our offer.”
“But it takes a long time to train a miner. If I refuse to work for you, you might have to wait a bit before hiring someone else.”
“Bifur, what job will you likely take if you don’t mine Mithril?”
“Gold mining.”
“Mining gold and Mithril require similar skills. If you get a job working for a gold mining company, this firm would hire one less dwarf than it otherwise would and this dwarf’s time will be freed up to mine Mithril. If you consider the marginal impact of your actions, you will see that working for us really doesn’t hasten the end of the world even under your Balrog assumptions.”
“OK, but I still don’t want to play any part in the destruction of the world so I refuse work for you even if this won’t do anything to delay when the Balrog destroys us.”
“Bifur, focus on the marginal consequences of your actions and don’t let your moral purity concerns cause you to make the situation worse. We’ve established that your turning down the job will do nothing to delay the Balrog. It will, however, cause you to earn a lower income. You could have donated that income to the needy, or even used it to hire a wizard to work on an admittedly long-shot, Balrog control spell. Mining Mithril is both in your self-interest and is what’s best for Middle Earth.”
I like this. I wonder what would happen if you post this on some LOTR fanfiction subreddit or something, with a link to a discussion of AI risk at the end.
“But, Bifur, the prophecies are not that clear. It’s possible the Balrog will annihilate us, but it’s also possible he will eradicate poverty, build us dwarf-arcs to colonize other planets, and grant us immortality. Our previous mining efforts have produced some localized catastrophes, but the overall effect has been fantastically positive, so it’s reasonable to believe continued mining will produce even more positive outcomes.”
“Yes, but while i’d pay a million diamonds for immortality, i’d pay a thousand million to save the dwarven race. The two paths are wildly disproportionate.”
“How many diamonds do you have?”
This seems like a better metaphor for fossil fuel extraction and climate change than AI risk.
My son said the same thing when he proofread it for me.
The same game theory would seem to apply equally well in both cases. In what way does it work better with climate change?
An interesting metaphor, given how the balrog basically went back to sleep after eating the local (and only the local) dwarves. And after some clumsy hobbitses managed to wake him up again, he was safely disposed of by a professional. In no case did the balrog threaten the entire existence of the Middle-Earth.
In the first draft of the lord of the rings, the Balrog ate the hobbits and destroyed middle Earth. Tolkien considered this ending unsatisfactory, if realistic, and wisely decided to revise it.
“You keep using that word, I do not think it means what you think it means”
I think Houshalter thinks it means “given the premises, is this a way things are likely to turn out?”. It might be true that “balrog eats hobbits, destroys Middle-earth” is a realistic outcome given everything up to the release of the balrog as premise.
So you are using the word in the sense that a balrog “realistically” can be killed only by a very specific magic sword, or, say, Ilúvatar “realistically” decides that all this is too much and puts his foot down (with an audible splat!)? X-)
I’m not using the word at all in this thread, so far as I can recall. FWIW neither of those seems super-realistic to me given Tolkien’s premises.
Well, yes, by “you” I meant “all you people” :-D
I think the appropriate word in the context is “plausible”.
Making a small step towards seriousness, yes, Ilúvatar suddenly taking interest in Middle Earth isn’t terribly plausible, but super-specificity has its place in Tolkien’s world: the only way Sauron can be defeated is by dropping some magical jewelry into a very specific place.
That was a Shallow Balrog. Everyone knows a Balrogs strength and hunger increases as you dig deeper, and the dwarfs are starting to dig pretty deep to get the mithril out.
Yeah, you know why Deep Balrogs are so rare? Every time someone manages to find and wake one and he climbs out of the pit and starts to eat the Middle-Earth, a certain all-seeing Eye goes “MY WORLD! SPLAT!” and there is one less Deep Balrog around.
I think this may have started to be less useful as an analogy for AI safety now.
Because it didn’t go the way you liked?
Er, no. Because we don’t (so far as I know) have any reason to expect that if we somehow produce a problematically powerful AI anything like an “all-seeing Eye” will splat it.
(Why on earth would you think my reason for saying what I said was “because it didn’t go the way [I] liked”? It seems a pointlessly uncharitable, as well as improbable, explanation.)
Because there are plenty of all-seeing eye superpowers in this world. Not everyone is convinced that the very real, very powerful security regimes around the world would be suddenly left inept when the opponent is a computer instead of a human being.
My comment didn’t contribute any less than yours to the discussion, which is rather the point. The validity of an allegory depends on the accuracy of the setup and rules, not the outcome. You seemed happy to engage until it was pointed out that the outcome was not what you expected.
Those “very real, very powerful security regimes around the world” are surprisingly inept at handling a few million people trying to migrate to other countries, and similarly inept at handling the crime waves and the political fallout generated by it.
And if you underestimate how much a threat could a mere “computer” be, read the “Friendship is Optimal” stories.
I’ve read the sequences on friendliness here and find them completely unconvincing with lack of evidence and a one-sided view the problem. I’m not about to start generalizing from fictional evidence.
I’m not sure I agree with the assessment of the examples that you give. There are billions of people who would like to live in first world countries but don’t. I think immigration controls have a particularly effective if there’s only a few million people crossing borders illegally in a world of 7 billion. And most of the immigration issues being faced by the world today such as Syrian refugees are about asylum-seekers who are in fact being permitted just in larger numbers than secondary systems were designed to support. Also the failure modes are different. If you let the wrong person in, what happens? Statistically speaking, nothing of great consequence.
Crime waves? We are currently at one of the lowest periods of violence per capita. I think the powers that be have been doing quite a good job actually.
Oh, I see. OK then.
My impression was that it was generally agreed that superintelligences sufficiently visible and slow-growing to be squelched by governments and the like aren’t much of a threat; the balrog-like (hypothetical) ones are the ones that emerge too quickly and powerfully to be so easily stopped. So the threats you have in mind aren’t in the “balrog” category at all, for me.
My first comment in the balrog discussion was the one you took exception to. The point at which you say I stopped being “happy to engage” is the point at which I started engaging. The picture you’re trying to paint is literally the exact opposite of the truth.
I don’t think that’s the case. A superintelligence doesn’t have to be balrog like to advance to the point where it’s too big to fail and thus not easily regulated by the government.
EY et al focus more on the threat of a superintelligence that can improve itself fast and have a lot of power in short amount of time but that’s not the only concerning scenario.
When a bank like HSBC can launder drug and terrorist money without any of it’s officials going to prison for it, the amount of control that a government could exert on a big company run by a complex AI might also be quite limited.
When the superintelligence becomes good enough at making money and buying politicians, it doesn’t have to worry so much about government action, and has enough time to grow slowly.
How much does Putin cost? Or the Chinese Politbureau?
You have at least two options: either buy Putin, or hire someone to replace him, whatever is cheaper. It’s not like Putin single-handedly rules his country—he relies on his army, police, secret services, etc. All these institutions probably have many people who would enjoy replacing Putin at the top of the pyramid. Throw in some extra money (“if you are going to replace Putin, here you have a few extra billions to bribe whoever needs to be bribed to help you with the coup”).
I am not familiar with the internal structure of the Chinese Politbureau, but I would guess this one is easier. There are probably competing factions, so you will support the one more friendly to you.
But there is always to option to ignore both Putin and the Chinese Politbureau, and upload yourself to a computer center built in some other country.
Correct, and yet Putin rules with hardly a challenge to his supremacy.
Money is not very useful when you’re dead.
If you are looking at an AGI that manages investment at a company like Goldman Sachs in an effective way it doesn’t even need to know directly how to buy politicians. If it makes a lot of money for Goldman Sachs, there are other people at Goldman who can do the job of buying politicians.
When Ray Dalio of Bridgewater Associates wants to build an AI that can replace him after he retires, it’s not clear whether any government can effectively regulate it.
Ah, now we are at the crux of the issue. That is not generally agreed upon, at least not outside of the Yudkowski-Bostrom echo chamber. You’ll find plenty of hard-takeoff skeptics even here on LessWrong, let along AI circles where hard-takeoff scenarios are given much credence.
I think you have misunderstood me. I was not intending to say that hard-takeoff scenarios are likely (for what it’s worth, I don’t think they are) but that they are what was being analogized to balrogs here.
(Of course a slow-takeoff controllable-by-governments superintelligence can still pose a threat—e.g., some are worried about technological unemployment, or about those who own the AI(s) ending up having almost all the world’s resources. But these are different, not very balrog-like, kinds of threat.)
Only on LW: disputes about ways in which an AI is like (or unlike) a balrog X-D
Well, we’ve had a basilisk already. Apparently we’re slowly crawling backwards through alphabetical order. Next up, perhaps, Bahamut or Azathoth.
Azathoth, check.
Is there a directory of the gods and monsters somewhere? If not, I think I’ll start one.
I dunno :-) Didn’t we just have a discussion about controlling the (Beta) AI by putting the fear of God (Alpha AI) into it?
Oh, but that’s an entirely different proposition—that’s about the Deep Balrogs believing that an all-seeing Eye will splat them if they try to eat Middle-earth. (Also, I didn’t get the impression that the “fear of God” proposal was regarded as terribly convincing by most readers...)
Well, they are Maiar and so definitely should have a clue.
This Thorin guy sounds pretty clever. Too bad he followed his own logic straight to his demise, but hey he stuck to his guns! Or pickaxe, as it were.
His argument attempting to prevent Bifur from trying to convince fellow Dwarves against mining into the Balrog’s lair sounds like a variation on the baggage carousel problem (this is the first vaguely relevant link I stumbled across, don’t take it as a definitive explanation)
Basically, everyone wants resource X, resulting in a given self-interested behavior whose result is to collectively lower everyone’s overall success rate, but where the solution that maximizes total success directly goes against each person’s self-interest. This results in an equilibrium where everyone works sub-optimally.
In this variation, the action of Thorin’s operation achieving resource M moves everyone slightly closer to negative consequence B. So the goal is no longer to maximize resource collection, but to minimize it. Doing so goes against everyone’s self-interest, resulting in etc. That is what Thorin is so eloquently trying to prevent Bifur from doing.
There are a couple ways Bifur can approach this.
He could do it through logical discourse: Thorin is in error when he claims
because it assumes unearthing the Balrog is a matter of incrementally filling a loading bar, where each Dwarf’s contribution is miniscule. That’s the naive way to imagine the situation, since you see in your mind the tunnel boring ever closer to the monster. But given that we can’t know the location or depth of the Balrog, each miner’s strike is actually more like a dice roll. Even if it’s a large dice roll, recontexualizing the danger in this manner will surely cause some dwarves to update their risk-reward assessment of mining Mithril. A campaign of this nature will at least lower the number of dwarves willing to join Thorin’s operation, although it doesn’t address the “Balrog isn’t real” or “Balrog isn’t evil” groups.
Alternatively, he could try to normalize new moral behavior. People are willing to work against their self-interest if doing so demonstrates a socially accepted/enforced moral behavior. If he were a sly one, he could sidestep the divisive Balrog issue altogether and simply spread the notion that wearing or displaying Mithril is sinful within the context of Dwarven values. eg maybe it’s too pragmatic, and not extravagant enough for a proper ostentatious Dwarven sensibility. That could shut down Thorin’s whole operation without ever addressing the Balrog.
But Bifur probably sees the practical value of Mithril beyond its economic worth. As Thorin says, it’s vital for the war effort—completely shutting down all Mithril mining may not be the best plan if it results in a number of Dwarf casualties similar to or greater than what he estimates a Balrog could do. So a more appetizing plan might be to combine the manipulation of logic and social norms. He could perform a professional survey of the mining systems. Based on whatever accepted arbitrary standards of divining danger the Dwarves agree to (again, assuming the location of the Balrog is literally unknowable before unearthing it due to magic), Bifur could identify mining zones of ever increasing danger within whatever tolerances he’s comfortable with. He could then shop these around to various mining operations as helpful safety guidelines until he has a decent lobby behind him in order to persuade the various kings to ratify his surveys into official measuring standards. Dwarves are still free to keep mining deeper if they wish, but now with a socially accepted understanding that heading into each zone ups their risk relative to their potential reward, naturally preventing a majority of Dwarves from wanting to do so. Those who believe the Balrog doesn’t exist or is far away would be confronted with Bifur’s readily available surveys, putting them on the defensive. There would still be opposition from those who see the Balrog as “not evil”, but the momentum behind Bifur’s social movement should be enough to shout them down. This result would allow Thorin’s operation to continue to supply the realm with life-saving Mithril, while at least decreasing the danger of a Balrog attack for as long as Bifur’s standards are recognized.
Finally, Bifur could try to use evidenced-based research and honestly performed geological surveys, but even in the real world where locating the Balrog beforehand is technologically possible, that tends to be a weaker tactic than social manipulation. Only other experts will be able to parse them, his opponents will have emotional arguments that will give them the upper hand, and Thorin’s baggage carousel logic would remain unchallenged.
Bifur should tell everyone that he is going to try to wake the Balrog, and dig directly towards it, openly advertising his intent to be the first to wake it. Spreading rumors that he intends to yoke the Balrog to a plow, and that he alone has a specific plan for accomplishing this would be helpful too.
The action taken to control the insanity of that one crazy dwarf might prevent the catastrophe outright.
That action is likely to be involuntary commitment to a mental hospital. A clear hint that only crazies worry about balrogs.
Interesting. Should MIRI announce they have become negative utilitarians who think the universe contains more suffering than happiness and so they intend to great a paperclip maximizer?
Oog and Og are sitting in the forest. Oog says ‘man someone could build a fire and totally destroy this whole place, we need to devote a lot more energy to stove research before that happens.‘. Og says ‘sure, fire, OK, that’s sci-fi stuff, I’m going to go gather some berries while you waste your time with heat flow calculations and the stove safety problem’.
Oog doesn’t like being brushed off, so he decides to try a different tactic. Og returns to see Oog waving a brightly burning torch around in the air, dangerously close to the dry leaves and twigs of their shelter. Og’s reaction is far less skeptical than before: ‘Oog! You will kill us all, stop waving that thing around until we have at least a pit dug!’
So yes, the best thing you can do to popularize the cause of AI safety could be building an obviously unsafe AI and doing something demonstrably dangerous with it.
Totally unrealistic.
Thorin was never in a position to hire mirthril miners. He had sufficient capital for only a very brief time before dying at the Battle of Five Armies.
In other words, if you set up the allegory so as to force a particular conclusion, that proves that that’s the proper conclusion in real life, because we all know that the allegory must be correct.
I’m a teacher (in real life). I set up the allegory to communicate a set of my beliefs about AI.
I think this is more useful as a piece that fleshes out the arguments; a philosophical dialogue.
I don’t believe for one moment that using a Balrog analogy actually makes people understand the argument when they otherwise wouldn’t.
It is a fallacy to think of AI risk as like Balrogs because someone has written a plausible-sounding story comparing it to Balrogs. And that seems to be the main effect of the Balrog analogy.
I disagree, I think there is value in analogies when used carefully.
Yes, I also agree with this; you have to be careful of implicitly using fiction as evidence.
You assume that balrogs can only be stopped by unmined bedrock. Since the chance of a given balrog being stopped by bedrock but not by the combined efforts of the dwarves is muniscule compared to the chance of a weak one that can be stopped by mithril-clad soldiers or a strong one that can dig through mere stone, the best defense against balrogs is to mine and guard the mines well.
I have the feeling you still don’t agree with Thorin. Why not?
No, I do agree with Thorin.
If you really believe in this allegory, you should try to intervene before people choose what research field to specialize in. You are not going to convince people to give up their careers in AI after they’ve invested years in training. But if you get to people before they commit to advanced training, it should be pretty easy to divert their career trajectory. There are tons of good options for smart idealistic young people who have just finished their undergraduate degrees.
I would like to see a single real-life example where this worked.
(“single” not as in “a single person”, but as in “a single field to avoid”)
I’ve been hired by a rich dwarf to fortify his castle, and I found that instead of using the world-endangering Mithril you get about 80% or the strength with an alloy that contains very small amount of Mithril and large amounts of other metals that does not require deep excavation (we patented the alloy as Mothril(TM). While the strength is a little less it’s much cheaper and you can make it up in volume. If people need an even stronger metal I think they should be working on Mothril++
So I think the economic explanation for Mithril is somewhat weaker than the dwarfs’ desire to keep digging deeper. They really need to be re-educated (I hope wizards are also working on that)
I have a higher probability of a group of very dedicated wizards succeeding, worth re-doing the above decision analysis with those assumptions
Then there is still a problem with how much time we leave for the wizards, which mithril mining approaches we should pursue (risky vs safe)