Compelling Villains and Coherent Values

Epistemic status: This essay attempts to communicate intuitions strong shaped by @Joe Carlsmith’s writing, particularly “On Green.”

Warning: very minor spoilers, mostly for “The Dark Knight”.

Some people have a sense of weight about them. They know exactly what they want and can explain why it is Right. Presented with a moral dilemma, they resolve conflicts between local values in a consistent way by appealing to higher values—they can tell you today whether they would actually pull the lever in the trolley problem and why they (hopefully) wouldn’t harvest the organs from one patient to save five—or why they would be compelled to. Most of these people seem to be deeply religious—the real-world example I have in mind is an intelligent Catholic friend who once had his bike stolen and explained to me that it was a shame because he would have happily given it away if someone needed it. I know him well enough to say it was true. Ask him what he wants out of life and he gives a well-reasoned, specific answer justified in terms of a hierarchy of values from family and charity to (his idea of) “serving God.” I think of this as moral^[1] coherence.

The paragon of this class is some sort of warrior monk defending the gates of an abbey against a horde of barbarians, and afterwards burying each of them with an hours long prayer. The opposite is perhaps the Joker, who we will come back to.

Mostly, I don’t get this sense of weight from nerds or rationalists—I think it’s not a natural part of nerd culture. Nerds are too clever and contrarian; we mostly can’t even agree with ourselves. Also, we tend to “throw off the shackles” of religion and often conventional morality, which are both systems carefully engineered for moral coherence, though modern morality is much more cosmopolitan. However, I do get a certain sense of moral coherence off of Eliezer (Scott seems more complicated).

Some of the most compelling villains are morally coherent. They have some consistent, authentic beliefs that set them on a collision course with the protagonist and in the best cases bring out the moral conflicts within the protagonist, often by posing some kind of ultimatum (for instance Jason Bourne, who has a complicated relationship with authority and nationalism, usually fights hardcore patriots). We tend to like a little moral confusion in our protagonists; perhaps it makes them relatable. Think of the cold determination of Thanos facing down the internally divided Avengers, or Colonel Miles Quaritch mercilessly slaughtering aliens in Avatar (the movie was a bit basic but I think he was an excellent, formidable antagonist). In the worst cases, such as Voldemort^[2] or Sauron, the villain is just pure evil, but in the better cases they are pure ideological extremists with their own compelling worldview. Sometimes the protagonist has two or more conflicting priorities, almost like an internal society trying to negotiate their actions, and the villain embodies one of the components and forces the protagonist to acknowledge and reconcile it.

But I think the greatest hero / villain dynamics reverse this dynamic. The hero embodies a strict moral code and the villain is. . . Insane. The clearest well-known example is Batman versus the Joker. I am thinking particularly of their portrayals in “The Dark Knight.” Batman is a Kantian who refuses to kill and who’s entire being is devoted to a quest for justice/vengeance. He’s super-humanly disciplined with a spartan training regimen (and therefore jacked, which seems to be a common sign of moral coherence in both fiction and reality). He doesn’t seem to spend any time struggling with moral dilemmas^[3], though his actions often seem morally ambiguous to others. The Joker, on the other hand, is dangerous because of his intelligence, charisma, and raw unpredictability, but doesn’t seem to have any fixed objective at all except possibly to cause chaos. He is in a sense so morally incoherent that he no longer faces internal conflict; he is just a force of nature. Their struggle is fascinating to watch not because there seems to be any chance of either party questioning themselves, but because there doesn’t. It’s like an unstoppable force meeting an immovable object. The suspense is mostly over who’s going to end up inside the blast radius^[4].

Think of the Dark Tower’s gunslinger Roland Deschain versus the man in black Walter O’Dim. Roland is consistently characterized as simple minded, not particularly intelligent, driven by one overriding goal. Walter is clever and full of secrets, but he’s his own worst enemy everywhere he appears in Stephen King’s universes. My reading is that behind the curtain even he doesn’t really know what his schemes are for.

Similarly, Lucifer is often portrayed as a paragon of moral incoherence for its own sake, almost constructing a kind of meta-coherence but still constantly undercutting itself with overly clever schemes.

I think it’s worth noting here that coherent morals are quite separable from coherent beliefs. Decision theorists have worked out a lot of the math of coherent belief: there is a thorough argument that any beliefs satisfying the Cox axioms for consistency can be represented and updated according to probability theory. This doesn’t tell us how to correct our incoherent beliefs, but in practice I think that Bayesian updating tends to automatically smooth out inconsistencies^[5]. On the other hand, we can represent coherent preferences as utility functions according to the von Neumann and Morgenstern criteria for rationality. But humans don’t have coherent preferences, and we don’t seem to have a mathematical description of how to correct this. This may be required for a solution to the alignment problem. For instance, corrigibility is essentially (meant to be) a choice of algorithm for resolving incoherent preferences. However, the situation seems to be less convenient for “cohering” preferences^[6].

I wonder if rationalists, having basically worked out coherence of belief but not of action, are particularly prone to power-seeking. When you’re not sure what you want but you feel exceptionally competent it makes sense to increase your optionality so you can get whatever you happen to want tomorrow. This may seem counter-intuitive sense a person with very coherent values should pursue power instrumentally. But when you have a specific goal there are often more direct ways to obtain it than first power-seeking, such as various forms of self-sacrifice. For instance, you might serve as a soldier to protect your country instead of forming your own mercenary band (= law firm, in the modern day), donate your kidney instead of training for a marathon, or volunteer at a food bank instead of grinding out leetcode problems after work. A reliable gear in a larger machine might be less agentic but more useful than a scheming Machiavellian. Goals like solving the alignment problem are unusually strong justifications for cultivating personal excellence, but there are a lot of direct ways to save lives that just aren’t as glorious. I am not convinced that EA-style earning to give always or even usually works out to higher expected value for most people, since I tend to think that doing good now compounds faster than career capital.

Also, it is interesting to consider whether there are deep reasons that coherent belief seems to be anti-correlated with coherent morals. For instance, internal divisions are the opposite of moral coherence, but should increase self-doubt, which is useful for reasoning under uncertainty. A Bayesian cultivates lightness, but a warrior monk has weight. Can these two opposing and perhaps contradictory natures be united to create some kind of unstoppable Kwisatz Haderach? Perhaps romantically, this is how I like to imagine Miyamoto Musashi, but I don’t know of anyone fit to inherit this mantle today.

^
Substitute “utility” for “moral” if you like—this intellectualization of vocabulary also took place historically in decision theory.
^
Various supporting villains are more complex than the Dark Lord. Lucius Malfoy is a true believer who feels a little more realistic.
^
For that, you would need to see either the former or later movie in the trilogy—I watched TDK first (out of order) which probably colored my impression of this Batman significantly.
^
The Doctor and the Master have a similar relationship. Though, on second thought, perhaps there is just a hint of dread that even our most morally coherent heroes will stare into the void of a mad nemesis’s broken mind and be changed—and that if even they can be corrupted, maybe there is no fundamental difference between good and evil.
^
See Blackwell and Dubin’s results on merging of opinions. Technically this only explains how people with different priors that still obey probability theory can come to agreement. But at least one doesn’t need to get the max entropy calculations exactly right—some sorts of errors can be smoothed out by Bayesian updating.
^
This is my intuition because the distinction seems similar to the choice of UTM for the universal distribution in belief and action. For pure prediction it only matters up to a constant, but in history-based reinforcement learning choosing the wrong UTM ruins convergence guarantees.