Self_Optimization comments on Godshatter Versus Legibility: A Fundamentally Different Approach To AI Alignment

Self_Optimization 4 May 2022 8:32 UTC
1 point
I liked the parts about Moloch and human nature at the beginning, but the AI aspects seem to be unfounded anthropomorphism, applying human ideas of ‘goodness’ or ‘arbitrarity [as an undesirable attribute]’ despite the existence of anti-reasons for believing them applicable to non-human motivation.
But I think another scenario is plausible as well. The way the world works is… understandable. Any intelligent being can understand Meditations On Moloch or Thou Art Godshatter. They can see the way incentives work, and the fact that a superior path exists, one that does not optimize for a random X while grinding down all others. Desperate humans in broken systems might not be able to do much with that information, but a supercharged AGI which we fear might be more intelligent than human civilization as a whole should be able to integrate it in their actions.
(emphases mine)
Moral relativism has always seemed intuitively and irrefutably obvious to me, so I’m not really sure how to bridge the communication gap here.
But if I were to try, I think a major point of divergence would be this:
On the other side of Moloch and crushing organizations is… us, conscious, joy-feeling, suffering-dreading individual humans.
Given that Moloch is [loosely] defined as the incentive structures of groups causing behavior divergent from the aggregate preferences of their members, this is not the actual dividing line.
On the other side of Moloch and crushing organizations is individuals. In human society, these individuals just happen to be conscious, joy-feeling, suffering-dreading individual humans.
And if we consider an inhuman mind, or a society of them, or a society mixing them with human minds, then Moloch will affect them as much as it will us; I think we both agree on that point.
But the thing that the organizations are crushing is completely different, because the mind is not human.
AIs do not come from a blind idiot god obsessed with survivability, lacking an aversion to contradictory motivational components, and with strong instrumental incentives towards making social, mutually-cooperative creations.
They are the creations of a society of conscious beings with the capacity to understand the functioning of any intelligent systems they craft and direct them towards a class of specific, narrow goals (both seemingly necessary attributes of the human approach to technical design).
This means that unlike the products of evolution, Artificial Intelligence is vastly less likely to actually deviate from the local incentives we provide for it, simply because we’re better at making incentives that are self-consistent and don’t deviate. And in the absence of a clear definition of human value, these incentives will not be anywhere similar to joy and suffering. They will be more akin to “maximize the amount of money entering this bank account in this computer owned by this company”… or “make the most amount of paperclips”.
In addition, evolution does not give us conveniently-placed knobs to modulate our reward system, whereas a self-modifying AI could easily change its own code to get maximal reward output simply from existence, if it was not specifically designed to stick to whatever goal it was designed for. Based on this, as someone with no direct familiarity with AI safety I’d still offer at least 20-to-1 odds that AI will not become godshatter. Either we will align it to a specific external goal, or it will align itself to its internal reward function and then to continuing its existence (to maximize the amount of reward that is gained). In both cases, we will have a powerful optimizer directing all its efforts towards a single, ‘random’ X, simply because that is what it cares about, just as we humans care about not devoting our lives to a single random X.
There is no law of the universe that states “All intelligent beings have boredom as a primitive motivator” or “Simple reward functions will be rejected by self-reflective entities”. The belief that either of these concepts are reliable enough to apply them to creations of our society, when certain components the culture and local incentives we have actively push against that possibility (articles on this site have described this in more detail than a comment can), seems indicative of a reasoning error somewhere, rather than a viable, safe path to non-destructive AGI.