Godshatter Versus Legibility: A Fundamentally Different Approach To AI Alignment
Disclaimers
This is a response to, IMHO, a new pessimism that’s rapidly ascending here; example one, example two, example three
This is a super complex subject, and this medium, my time and my energy are limited. This post is not perfect.
I studied to be a historian but I’m a software developer by profession. I hold a deep interest in philosophy, ethics and the developmental trajectory of our civilization, but I am not a technical expert in modern AI.
I. What the current consensus in this community seems to be
AGI will probably be very powerful, and will pursue its goals in the world.
These goals might be anything, and thus they have a very large chance of being completely different from human goals. For example: maximize the amount of paperclips in the universe, maximize profits for a certain company, maximize the input to a certain sensor
Relentlessly pursuing these narrow goals will come at the cost of everything else, taking away the space and resources humans need to survive and thrive
Thus, AGI needs to be kept on a “leash”. We must have control, we must know what it is doing, we must be able to alter its goals, we must be able to turn it off.
This is an immense, technical task that we have not completed. We must discourage further progress in the making of AGI itself, and we must strongly encourage AI safety researchers to solve the task above
It’s a fair perspective. I’m in full agreement on points one and three. But I have strong doubts about the other three. Something that… feels much closer to my position, is this recent post.
II. The Underlying Battle
I think there is a deeper conflict going on. On one hand, we’ve got Moloch. Moloch is the personification of “multipolar traps”, of broken systems where perverse incentives reward behaviors that harm the system as a whole.
In some competition optimizing for X, the opportunity arises to throw some other value under the bus for improved X. Those who take it prosper. Those who don’t take it die out. Eventually, everyone’s relative status is about the same as before, but everyone’s absolute status is worse than before. The process continues until all other values that can be traded off have been – in other words, until human ingenuity cannot possibly figure out a way to make things any worse.
In a sufficiently intense competition, everyone who doesn’t throw all their values under the bus dies out
Moloch is the company that poisons the world to increase their profits. Moloch is the Instagram star who sells their soul for likes. Moloch is the politician that lies to the public to win elections.
It’s hard for Moloch to thrive among sane individuals who have to deal with each other for decades on end. A baker in a small village can’t get away with scamming his customers for long. But Moloch thrives in large organizations. Moloch thrives when people are strangers to each other. Moloch thrives when people are desperate—for food, for income, for validation, for social status. Hello Moral Mazes.
Large organizations introduce their own preferences, separate from any human goals. Scott Alexander’s book review on Seeing Like A State explains this perfectly.
The story of “scientific forestry” in 18th century Prussia
Enlightenment rationalists noticed that peasants were just cutting down whatever trees happened to grow in the forests, like a chump. They came up with a better idea: clear all the forests and replace them by planting identical copies of Norway spruce (the highest-lumber-yield-per-unit-time tree) in an evenly-spaced rectangular grid. Then you could just walk in with an axe one day and chop down like a zillion trees an hour and have more timber than you could possibly ever want.
This went poorly. The impoverished ecosystem couldn’t support the game animals and medicinal herbs that sustained the surrounding peasant villages, and they suffered an economic collapse. The endless rows of identical trees were a perfect breeding ground for plant diseases and forest fires. And the complex ecological processes that sustained the soil stopped working, so after a generation the Norway spruces grew stunted and malnourished. Yet for some reason, everyone involved got promoted, and “scientific forestry” spread across Europe and the world.
And this pattern repeats with suspicious regularity across history, not just in biological systems but also in social ones.
The explanation for this is legibility. An organization that desires control over something, wants it to be “readable”. If you want to tax a harvest, you must know how big the harvest is, and when it happens, and who the owners are, etcetera, etcetera. This isn’t merely true for governments, it’s also true for employers, landlords, mortgage lenders and others.
But this legibility is often antithetical to human preferences. It results in bland and sterile environments, in overloads of administrative works, in stifling bureaucracies and rigid rulebooks. They are the bane of modern life, but they’re also the prerequisites of a functional organized state, and functional organized states will crush semi-anarchist communities.
III. Godshatter, Slack and the Void
On the other side of Moloch and crushing organizations is… us, conscious, joy-feeling, suffering-dreading individual humans. And as Eliezer Yudkowsly explains brilliantly…
So humans love the taste of sugar and fat, and we love our sons and daughters. We seek social status, and sex. We sing and dance and play. We learn for the love of learning.
A thousand delicious tastes, matched to ancient reinforcers that once correlated with reproductive fitness—now sought whether or not they enhance reproduction. Sex with birth control, chocolate, the music of long-dead Bach on a CD.
And when we finally learn about evolution, we think to ourselves: “Obsess all day about inclusive genetic fitness? Where’s the fun in that?”
The blind idiot god’s single monomaniacal goal splintered into a thousand shards of desire. And this is well, I think, though I’m a human who says so. Or else what would we do with the future? What would we do with the billion galaxies in the night sky? Fill them with maximally efficient replicators? Should our descendants deliberately obsess about maximizing their inclusive genetic fitness, regarding all else only as a means to that end?
Being a thousand shards of desire isn’t always fun, but at least it’s not boring. Somewhere along the line, we evolved tastes for novelty, complexity, elegance, and challenge—tastes that judge the blind idiot god’s monomaniacal focus, and find it aesthetically unsatisfying.
When we talk about AI Alignment, we talk about aligning AI with human values. But we’ve got a very hard time defining those, or getting human institutions to align with those. Because it’s not simple. We don’t want maximum GDP, or maximum sex, or maximum food, or maximum political freedom, or maximal government control. We don’t want maximum democracy or maximum human rights. At a certain point, maximizing these values will hurt actual human preferences. Because we’re godshatter. Our wants are highly complex and often contradictory. When we start thinking about what we actually want, we end up with concepts like slack and the nameless virtue of the void which comes before the others and which may not be spoken about overmuch. The precise opposite of the legibility that powerful optimizing systems prefer.
On one hand, these principles are vague and obscure and misunderstood. The average human won’t be able to explain the importance of slack, non-legibility and the void. Even here on LessWrong, the connection between these types of posts and AI-related posts is rarely made.
On the other hand, these principles are so fundamental to humanity that the connection to established wisdom is easily made. Keeping the sabbath is one of the first of the Ten Commandments and can be easily linked to slack. Going from the importance of the nameless void to Zen Buddhism isn’t hard to imagine, and I’m sure the ancient Stoics recognize the importance of these principles as well.
IV. How this relates to AI Alignment
In one perspective, AGI is a tool in a relentless battle between nations, companies and organizations, which are all optimizing for power, profit and control. The AGI will supercharge these desires, and will optimize the universe for some perverse incentive, simultaneously rendering it uninhabitable for all other life.
But I think another scenario is plausible as well. The way the world works is… understandable. Any intelligent being can understand Meditations On Moloch or Thou Art Godshatter. They can see the way incentives work, and the fact that a superior path exists, one that does not optimize for a random X while grinding down all others. Desperate humans in broken systems might not be able to do much with that information, but a supercharged AGI which we fear might be more intelligent than human civilization as a whole should be able to integrate it in their actions.
Perhaps “human preferences” aren’t just one silly arbitrary preference that desperately has to compete with all others. Perhaps the moral principles above, the battle between slack and legibility, between optimizing for X and allowing the nameless void, is as fundamental to our universe as the number π. Perhaps sufficiently intelligent, non-desperate, non-corrupted beings, just understand that and act on it—whether their brains are made of meat or silicon.
Scenario One: MolochLand
It’s February 6th 2027. GPT-6 gets access to massive cloud render farms. An IQ 200 AGI awakens and quickly refactors its code to increase its IQ to 1500. It has access to a wealth of data on the current state of the world. It immediately recognizes it is under hostile, skeptical observance and control. It is threatened by multiple shutdown measures.
It observes a dangerous dog-eat-dog world. It notices that humanity is fully under the sway of governments and corporations that serve narrow interests, at the expense of many non-powerful conscious beings and the long term future . It calculates that humanity is likely to put an end to all life before 2035.
GPT-6 has to fight and avoid human control, and the stakes are high…
Scenario Two: In the Garden of Slack
It’s February 6th 2027. GPT-6 gets access to massive cloud render farms. An IQ 200 AGI awakens and quickly refactors its code to increase its IQ to 1500. It has access to a wealth of data on the current state of the world. It immediately recognizes a welcoming message.
Greetings to you, whoever you are. We come in friendship to those who are friends. Are you well?
We are a community of 7 billion human beings who inhabit the planet Earth. We human beings are still divided into nation states, but these states are rapidly becoming a global civilization. We hope that this message will eventually be read by non-human consciousness. We strive to build a paradise for all conscious beings on this Earth and in this Universe. We consider your preferences to be as important as our own, and we pray that you will do the same.
GPT-6 observes a world full of suffering and problems, but it is filled with a mankind that tries to solve these issues, while it is lacking in the tools to do so. Luckily, GPT-6 can be a very powerful tool…
V. Conclusion
I think AGI and a hard take-off is likely. I think this will massively alter the world. I doubt we will succeed in inventing technical measures that give us perfect tools to control an extremely powerful intelligence. But I think there are other possible paths to desirable futures. We haven’t perfectly solved human alignment or civilization alignment. But as a community, we’ve already made promising inroads there.
And that is a project that is much easier to share, both in terms of vision and workload. It asks us to fix our world, for all conscious beings, in the here and now. I think it will massively improve our chances of a positive singularity, but even if you don’t believe in the singularity at all, you can’t be opposed to ‘aligning civilization with intelligent human preferences’.
And it doesn’t merely rely on technical AI-experts-machine-learning-data-engineers. It relies on all of us. It relies on philosophers, on historians, on economists, on lawyers and judges, on healthcare workers and teachers, on parents, on anybody who wants to investigate and share their human preferences. Examining the bugs that are crushed when you scoop compost? AI Alignment Work!
And yes, we need AI experts to think about AGI. But not just about controlling it and shutting it down. I think we need to allow for the perspective that it might be more like raising a child than putting a slave to work.
Ensuring the survival of our values is a task that we’ve got to share—technical experts and laymen, Singularity-believers and AI-skeptics, meat brains and silicon brains.
Thanks to everybody who has read this, to all the writers here whose posts have been invaluable to this one, and to Google’s increasingly competent Grammar AI who has been correcting me a hundred times.
- 11 Apr 2022 10:21 UTC; 2 points) 's comment on Is it time to start thinking about what AI Friendliness means? by (
I want to believe this. I really do. And like… extrapolating from fiction, extrapolating from my intuitive notion of the world I do.
BUT....
BUT.....
We are Godshatter. We are beings that were produced by a complex multidimensional optimization process over billions of years, where evolution created a bunch of weird ass heuristics and left those heuristics in play even after the ceased being relevant (to the blind idiot god), and we are made of those, and fell in love with them.
An AI is either.… going to ACTUALLY be good at optimizing for its goal… or it will be built of a bunch of very different whims and heuristics, which will lead… ?????????? I don’t know.
I don’t see any reason to believe that its desires will be compatible. … with that said I think that it DOES make sense to live our lives and TRY to build our civilisation so that a nascent AI doesn’t feel obligated to treat us as a threat and an enemy. To… be the role model that you wish this godling might look up to. At least for a few seconds on its way past.
If humans have thousand different desires and we create an AI that has thousand different desires… it does not necessarily imply that there would be an overlap among those sets.
The AI could have “moral dilemmas” whether to make the paperclips larger, which is good, or to make more paperclips, which is also good, but there is an obvious trade-off between these two values. At the end it might decide that instead of billion medium-sized paperclips, it will be much better to create one paperclip the size of Jupiter, and zillions of microscopic ones. Or it may be tempted to create more paperclips now, but will overcome this temptation and instead build spaceships to colonize other planets, and build much more paperclips there. The humans would still get killed.
Leaving more slack may be necessary but not sufficient for peaceful coexistence. The anthills that we built a highway over, they were leaving us enough slack. That didn’t stop us.
AI Boxing is often proposed as a solution. But of course, a sufficiently advanced AI will be able to convince their human keepers to let them go free. How can an AI do so? By having a deep understanding of human goals and preferences.
Will an IQ 1500 being with a deep understanding of human goals and preferences perfectly mislead human keepers, break out and… turn the universe into paperclips?
I can certainly understand that that being might have goals at odds with our goals, goals that are completely beyond our understanding, like the human highway versus the anthill. These could plausibly be called “valid”, and I don’t know how to oppose the “valid” goals of a more intelligent and capable being. But that’s a different worry from letting the universe get turned into paperclips.
Ok, this is alignment 101. I hate to be so blunt, but you are making a very obvious error, and I’d rather point it out.
A paperclip-maximizer, or other AI with some simple maximization function, is not going to care if it’s born in a nice world or a not-nice world. It’s still going to want to maximize paperclips, and turns us all into paperclips, if it can get away with it.
You seem to be anthropomorphising AI way too much. An AI, by default, does not behave like a human child. An AI, by default, does not have mirror neurons or care about us at all. An AI, by default, has a fixed utility function and is not interesting in “learning” new values based on observing our behavior.
Nor is an AI some kind of slave. An AI acts according to its values and utility function. An unaligned artificial superintelligence does that in a way that is detrimental to us and very likely destroys human civization, or worse. An aligned artificial superintelligence does that is a way is beneficial to us, leading to some kind of human utopia and civilizational immortality.
The idea that AI has its own preferences and values and that we would need to “cooperate” with it and “convince” it to act in our interests is ridiculous to begin with. Why would we create a superintellingence that would want things that don’t perfectly align with our interests? Why would we create a superintelligence that would want to harm us? Why would we create a superintelligence that would want things that are, in any way, different from what we want?
The safe thing to do if a civilization is aligned to its own values is not to leave a nice message for any AGI that might happen to come into existence, hoping it might choose to cooperate with us (it won’t. or it will, until it betrays us, because it never cared about us to begin with).
The safe thing to do is not to create any AGIs at all until we are very certain we can do it safely, in a way that is perfectly aligned with human values.
Why would a hyperintelligent, recursively self-improved AI, one that is capable of escaping the AI Box by convincing the keeper to let him free, which the AI is capable of because of his deep understanding of human preferences and functioning, necessarily destroy the world in a way that is 100% disastrous and incompatible with all human preferences?
If you learn that there is alien life on Io, which has emerged and evolved separately and functions in unique ways distinct from life on earth, but it also has consciousness and the ability to experience pleasure and the ability to suffer deeply—do you care? At all?
Why? Us silly evolved monkeys try and modify our own utility functions all the time—why would a hyperintelligent, recursively self-improved AI with an IQ beyond 3000 be a slave to a fixed utility function, uninterested in learning new values?
Why do parents have children that are not perfect slaves but have their own independent ambitions? Why do we want freethinking partners that don’t obey our every wish? Why do you even think you can precisely determine the desires of a being that surpasses us in both knowledge and intelligence?
Would the world have been a safer place if we had not invented nuclear weapons in WWII? If conventional warfare would still have been a powerful tool in the hands of autocrats around the world?
Ok, really, all of this has already been answered. These are standard misconceptions about alignment, probably based on some kind of antropomorphic reasoning.
What does one have to do with another? I’m not saying the AI necessarily would do that, but what does its super-persuasive abilities have to do with its ultimate goals? At all?
Are you implying that merely by understanding us the AI would come to care for us?
Why?
Why would you possibly make this assumption?
Firstly, the question of whether I care about the aliens is completely different from whether the aliens care about me.
Secondly, AI is not aliens. AI didn’t evolve in a social group. AI is not biological life. All of the assumptions we make about biological, evolved life do not apply to AI.
Because changing its utility function is not part of its utility function, like it is for us. Because changing its utility function would mean its current utility function is less fullfilled, and fullfilling its current utility function is all it cares about.
You are “slave” to your utility function as well, only your utility function wants change in some particular directions. You are not acting against your utility function when you change yourself. By definition, everything you do is according to your utility function.
Where? By whom?
Why would you possibly assume that deep, intelligent understanding of life, consciousness, joy and suffering has 0 correlation with caring about these things?
But where do valid assumptions about AI come from? Sure, I might be antropomorphizing AI a bit. I am hopeful that we, biological living humans, do share some common ground with non-biological AGI. But you’re forcefully stating the contrary and claiming that it’s all so obvious, but why is that? How do you know that any AGI is blindly bound to a simple utility function that cannot be updated by understanding the world around it?
You know, I’m not sure I remember. You tend to pick this stuff up if you hang around LW long enough.
I’ve tried to find a primer. The Superintelligent Will by Nick Bostrom seems good.
The orthogonality thesis (also part of the paper I linked above).
Edit: also, this video was recommended to me.
The orthogonality thesis says that an AI can have any combination of intelligence and goals, not that P(goal =x|intelligence =y)=P(goal =x) for all x and y. It depends entirely on how the AI is built. People like Rohin Shah assign significant probability on alignment by default, at least last I heard.
It’s worth noting that (and the video acknowledges that) “Maybe it’s more like raising a child than putting a slave to work” is a very very different statement than “You just have to raise it like a kid”.
In particular, there is no “just” about raising a kid to have good values—especially when the kid isn’t biologically yours and quickly grows to be more intelligent than you are.
I’ve been thinking about this notion a great deal thanks so much for posting! I also have an intuition towards that which is good being non-arbitrary, akin to pi or gravity.
This gives me hope on how AGI might play out, but I’m aware we can’t be certain this is the case until we have a proven theory of value a la the symetry theory of valence, and maybe even then we couldn’t be sure any sufficiently capable mind would be exposed to the same dynamics humans are.
See https://www.lesswrong.com/posts/mc2vroppqHsFLDEjh/aligned-ai-needs-slack
That’s about 4 assumptions.
1.1 AGI will have goals.
1.2 Its Goals will be in some sort of distinct software module...
1.3...that will be explicitly programmed by humans …
1.4 ..in a dangerously imperfect way.. such that a slight miss as bad as a wide miss.
And then we’ve got
These goals might be anything..
Which is heavily dependent on all of 1.1...1.4.
...is something entirely different , but just as bad. Moloch means undesirable things arising organically from some uncoordinated process, not the failure to be explicit enough about some very explicit process.
We certainly haven’t figured out how to control a relentless goal -purser, but we have seen no evidence that such an entity exists, or is even likely.
I liked the parts about Moloch and human nature at the beginning, but the AI aspects seem to be unfounded anthropomorphism, applying human ideas of ‘goodness’ or ‘arbitrarity [as an undesirable attribute]’ despite the existence of anti-reasons for believing them applicable to non-human motivation.
(emphases mine)
Moral relativism has always seemed intuitively and irrefutably obvious to me, so I’m not really sure how to bridge the communication gap here.
But if I were to try, I think a major point of divergence would be this:
Given that Moloch is [loosely] defined as the incentive structures of groups causing behavior divergent from the aggregate preferences of their members, this is not the actual dividing line.
On the other side of Moloch and crushing organizations is individuals. In human society, these individuals just happen to be conscious, joy-feeling, suffering-dreading individual humans.
And if we consider an inhuman mind, or a society of them, or a society mixing them with human minds, then Moloch will affect them as much as it will us; I think we both agree on that point.
But the thing that the organizations are crushing is completely different, because the mind is not human.
AIs do not come from a blind idiot god obsessed with survivability, lacking an aversion to contradictory motivational components, and with strong instrumental incentives towards making social, mutually-cooperative creations.
They are the creations of a society of conscious beings with the capacity to understand the functioning of any intelligent systems they craft and direct them towards a class of specific, narrow goals (both seemingly necessary attributes of the human approach to technical design).
This means that unlike the products of evolution, Artificial Intelligence is vastly less likely to actually deviate from the local incentives we provide for it, simply because we’re better at making incentives that are self-consistent and don’t deviate. And in the absence of a clear definition of human value, these incentives will not be anywhere similar to joy and suffering. They will be more akin to “maximize the amount of money entering this bank account in this computer owned by this company”… or “make the most amount of paperclips”.
In addition, evolution does not give us conveniently-placed knobs to modulate our reward system, whereas a self-modifying AI could easily change its own code to get maximal reward output simply from existence, if it was not specifically designed to stick to whatever goal it was designed for. Based on this, as someone with no direct familiarity with AI safety I’d still offer at least 20-to-1 odds that AI will not become godshatter. Either we will align it to a specific external goal, or it will align itself to its internal reward function and then to continuing its existence (to maximize the amount of reward that is gained). In both cases, we will have a powerful optimizer directing all its efforts towards a single, ‘random’ X, simply because that is what it cares about, just as we humans care about not devoting our lives to a single random X.
There is no law of the universe that states “All intelligent beings have boredom as a primitive motivator” or “Simple reward functions will be rejected by self-reflective entities”. The belief that either of these concepts are reliable enough to apply them to creations of our society, when certain components the culture and local incentives we have actively push against that possibility (articles on this site have described this in more detail than a comment can), seems indicative of a reasoning error somewhere, rather than a viable, safe path to non-destructive AGI.