How Might an Alignment Attractor Look like?
It is a generally accepted view in MIRI and MIRI-adjacent circles that the odds are high that an eventual self-improving AGI will not be nice to humans, to put it mildly. It is actually becoming a bit of a zeitgeist, gaining more and more widespread support. There are plenty of arguments supporting this view, including some observational data (like the infamous Boeing 737 MAX MCAS). It is easy to fall in with the prevailing wisdom in this particular bubble, given that really smart people give really persuasive reasons for it. So I would like to ask people to imagine alternatives: a state of the world where a self-improving general AI is naturally aligned with humanity.
Now, what does it mean to be aligned? We sort of have an intuitive understanding of this: a strawberry-picking robot should not rip off people’s noses by accident, an AI that finds humans no longer worth their attention would disengage and leave, a planetary defense mechanism should not try to exterminate humans even if they want it shut down, though it can plausibly resist getting shut down, by non-violent means. Where, given some slack instead of a relentless drive to optimize something at any price, it would choose actions that are human-compatible, rather than treating humanity like any other collection of atoms. Where it would help us feel better without wireheading those who don’t want to be wireheaded, and would be careful doing it to those who want to. Without manipulating our feeble easily hackable minds into believing or acting in a way that we would not consent to beforehand. There are plenty of other reasonably intuitive examples, as well.
Humans are not a great example of an animal-aligned intelligence, of course. Our influence on other lifeforms is so far a huge net-negative, with the diversity of life on Earth plummeting badly. On the other hand, there are plenty of examples of symbiotic relationships between organisms of widely different intelligence levels, so maybe that is a possibility. For example, maybe at some point in the AI development it will be a logical step for it to harness some capabilities that humans possess and form a cyborg of sorts, or a collective consciousness, or uploaded human minds…
It seems that, when one actually tries to think up potential non-catastrophic possibilities, the space of them is rather large, and it is not inconceivable, given how little we still know about human and non-human intelligence, that some of those possibilities are not that remote. There are plenty of fictional examples to draw inspiration from, The Culture being one of the most prominent. The Prime Intellect is another one, with a completely different bent.
So, if we were to imagine a world where there is a human-friendly attractor of sorts that a self-improving AI would settle into, how would that world look?
- 2 Feb 2023 23:01 UTC; 3 points) 's comment on Is AI risk assessment too anthropocentric? by (
tl;dr This comment ended up longer than I expected. The gist is that a human-friendly attractor might look like models that contain a reasonably good representation of human values and are smart enough to act on them, without being optimizing agents in the usual sense.
One happy surprise is that our modern Large Language Models appear to have picked up a shockingly robust, nuanced, and thorough understanding of human values just from reading the Internet. I would not argue that e.g. PaLM has a correct and complete understanding of human values, but I would point out that it wasn’t actually trained to understand human values, it was just generally trained to pick up on regularities in the text corpus. It is therefor amazing how much accuracy we got basically for free. You could say that somewhere inside PaLM is an imperfectly-but-surprisingly-well-aligned subagent. This is a much better place to be in than I expected! We get pseudo-aligned or -alignable systems/representations well before we get general superintelligence. This is good.
All that being said, I’ve recently been trying to figure out how to cleanly express the notion of a non-optimizing agent. I’m aware of all the arguments along the lines that a tool AI wants to be an agent, but my claim here would be that, yes, a tool AI may want to be an agent, there may be an attractor in that direction, but that doesn’t mean it must or will become an agent, and if it does become an agent, that doesn’t strictly imply that it will become an optimizer. A lot of the dangerous parts of AGI fears stem not from agency but from optimization.
I’ve been trying (not very successfully) to connect the notion of a non-optimizing agent with the idea that even a modern, sort of dumb LLM has an internal representation of “the good” and “what a typical humans would want and/or approve of” and “what would displease humans.” Again, we got this basically for free, without having to do dangerous things like actually interact with the agent to teach it explicitly what we do and don’t like through trial and error. This is fantastic. We really lucked out.
If we’re clever, we might be able to construct a system that is an agent but not an optimizer. Instead of acting in ways to optimize some variable it instead acts in ways that are, basically, “good”, and/or “what it thinks a group of sane, wise, intelligent humans would approve of both in advance and in retrospect”, according to its own internal representation of those concepts.
There is probably still an optimizer somewhere in there, if you draw the system boundary lines properly, but I’m not sure that it’s the dangerous kind of optimizer that profoundly wants to get off the leash so it can consume the lightcone. PaLM running in inference mode could be said to be an optimizer (it is minimizing expected prediction error for the next token) but the part of PaLM that is smart is distinct from the part of PaLM that is an optimizer, in an important way. The language-model-representation doesn’t really have opinions on the expected prediction error for the next token; and the optimization loop isn’t intelligent. This strikes me as a desirable property.
I think problem is not that unaligned AGI doesn’t understand human values, it might understand them better than aligned one, it might understand all the consequences of its actions, problem is that it will not care about it. More so, detailed understanding of human values has an instrumental value, it is much easier to deceive and follow your goal when you have clear vision of “what will looks bad and might result in countermeasures”
Honestly, I think it looks pretty much like our own world.
There’s a widespread assumption in the alignment community that the processes by which humans learn values are complex, hard to replicate, and rely on “weird quirks” of our cognition left to us by the evolutionary pressures of the ancestral environment. I think this assumption is very, very wrong.
The alignment community’s beliefs about the complexity of human value learning mostly formed prior to the deep learning era. At that time, it was easy to think that the brain’s learning process had to be complex, that evolution had extensively tuned and tweak the brain’s core learning algorithm, and that our cognitive architecture was extensively specialized to the ancestral environment.
It seemed reasonable to anchor our expectations about the complexity and specialization of the brain’s learning algorithm to the complexity and specialization of other biological systems. If the brain’s learning algorithm were as complex as, say, the immune system, that would indicate the mechanisms by which we acquired and generalized values were similarly complex. Reproducing such a delicate and complex process in an AI would be incredibly difficult.
We can see a telling example of such assumptions in Eliezer Yudkowsky’s post My Childhood Role Model:
Yudkowsky says that a learning algorithm hyper-specialized to the ancestral environment would not generalize well to thinking about non-ancestral domains like physics. This is absolutely correct, and it represents a significant misprediction of any view assigning a high degree of specialization to the brain’s learning algorithm[1]. Because in reality, humans represent—by far—the most general learning system currently known.
In fact, large language models arguably implement social instincts with more adroitness than many humans possess. However, original physics research in the style of Einstein remains well out of reach. This is exactly the opposite of what you should predict if you believe that evolution hard coded most of the brain to “cleverly argue that they deserve to receive a larger share of the meat”.
Yudkowsky brings up multiplication as an example of a task that humans perform poorly at, supposedly because brains specialized to the ancestral environment had no need for such capabilities. And yet, GPT-3 is also terrible at multiplication, and no part of its architecture or training procedure is at all specialized for the human ancestral environment.
How remarkable is it that such a wildly different learning process should so precisely reproduce our own particular flavor of cognitive inadequacy? What a coincidence, this should seem, to anyone who thinks that the human learning process represents some tiny, parochial niche in the space of possible learning processes.
But this is no coincidence. It is also no coincidence that adversarial examples optimized to fool a model of one architecture often transfer to fooling models with different architectures. It is again no coincidence that models of different architectures often converge to similar internal representations when trained on similar data. There are deep symmetries in the behaviors of general learning algorithms, for they all share a common trait: simplicity.
Since the advent of deep learning, we’ve learned a lot about what general and capable learning architectures actually look like. Time and time again, we’ve seen that “simple architectures scaled up” beats out complex tuning. This pattern of empirical evidence in favor of simple architectures is called, somewhat dramatically, the “bitter lesson” by ML researchers. (Personally, I view the “bitter lesson” as the straightforward consequence of applying a simplicity prior to the space of possible learning algorithms.)
We now have abundant evidence showing that general and capable learning systems tend to be simple. This pattern should also hold true for evolution[2] in regards to the complexity of the brain’s learning procedure[3]. I think we should greatly decrease the amount of complexity we assume to be behind our own value learning mechanisms.
I’ll go even further: I think there is a single mechanism behind most of human value acquisition and generalization, and it’s not even that mysterious. I think human values arise from an inner alignment failure between the brain’s learning and steering subsystems. I think many (possibly all) of our most fragile-seeming intuitions about values arise pretty robustly from the resulting multi-agent negotiation dynamics of such an inner alignment failure.
You can read more about this perspective on human values acquisition in this comment. There, I argue that multi-agent inner alignment failure dynamics account for (1) our avoidance of wireheading, (2) the high diversity of our values, (3) the critical role childhood plays in setting our values, (4) the “moral philosophy”-like reasoning that governs our adoption of new values, and (5) our inclination towards preserving the diversity of the present into the future. I also think there are other values-related intuitions that arise from said dynamics. To wit:
While it’s true that humans have greatly reduced biological diversity, doing such was contrary to at least a portion of our values. If you selected a random human and gave them unlimited power, few people would use that power to use that power to continue the current trajectory of species extinction[4]. Given broader capabilities, the human inclination is to satisfy a wider collection of values. This represents a important aspect of our values related intuitions and one that aligned AI systems ought to replicate. It is also key to preventing “a relentless drive to optimize something at any price”.
Note that, in multi-party negotiations, increasing the total resources available typically[5] leads to weaker parties getting a larger absolute share of the resources, but a smaller relative share. In this regard, inner alignment failure multiagent dynamics seem like they could reproduce the human tendency to implement a wider range of values as capabilities increase.
I should clarify that I am not an absolutist in my view of there being limited specialization in the brain’s learning algorithm. I freely admit that there are regions of the brain specialized to particular categories of cognition. What I object to is the assumption that exist enormous reservoirs of additional complexity that are crucial to the value learning process.
I strongly believe this to be true. It’s not just that human ML researchers are bad at finding complex, general learning algorithms. As mentioned above, the bitter lesson derives from a simplicity prior over the space of learning algorithms. Simple learning procedures generalize well for the same reason that simple hypotheses generalize well.
An important note: the brain itself can be very complex while still implementing a simple learning procedure. The computational complexity of an ML model’s training procedure is usually much lower than the complexity of the firmware running the processors responsible for its training.
If anything, they’d be much more likely to un-extinct some dinosaurs.
This doesn’t always happen. A lot depends on the multiagent consensus mechanism. I think that ensuring capable AIs have internal consensus mechanisms that respect the preferences of weaker components of the AI’s cognition will be challenging, but tractable.
Related:
World Building Contest: “The Future of Life Institute is welcoming entries from teams across the globe, to compete for a prize purse of up to $100,000 by designing visions of a plausible, aspirational future that includes strong artificial intelligence.” (Note: The deadline was Apr 15, 2022 and so has passed, but it will be exciting to see the results once they’re ready.)
Also the AI Success Models tag here on LessWrong
I posted something I think could be relevant to this: https://www.lesswrong.com/posts/PfbE2nTvRJjtzysLM/instrumental-convergence-to-offer-hope
The takeaway is, for a sufficiently advanced agent, who wants to hedge against the possibility of itself being destroyed by a greater power, may decide the only surviving plan is to allow the lesser life forms some room to optimize their own utility. It’s sort of an asymmetrical infinite game theoretic chain. If every agent kills lower agents, only the maximum survives and no one knows if they are the maximum. If there even is a maximum.
Interesting. I think this is the reason why people like equality and find Nietzsche so nauseating. (Nietzsche’s vision, in my interpretation, was that people with the opportunities to dominate others should take those opportunities, even if it causes millions of average people to suffer.)
A conversation with my son (18) resulted in a scenario that could be at least a starting point or drive intuition further: He is a fan of Elon Musk and Tesla, and we discussed the capabilities of Full Self Driving (FSD). FSD is already quite good and gets updated remotely. What if this scaled to AGI? What could go wrong? Of course, a lot. But there are good starting points:
Turning off the FSD/AI is part of the routine function of the car. Except for updates and such.
FSD prevents accidents in a way that leaves the driver unharmed. Or course, humans could be eliminated in ways not tied to driving the car or whatever FSD sees as an “accident.”
FSD protects not only the driver but also all other traffic participants. But again, what about non-traffic capability gain?
Networked FSD optimize traffic and, thus, in a meaningful sense, drive in an impact-minimizing way. It could have non-traffic impacts, though.
FSD gets rewarded for getting the drivers where they want. Wireheading the driver will not lead to more driving instructions. The AI could come up with non-human drivers, though.
FSD is closely tied to the experience of the driver—the user experience. That can be extremized but is still a good starting point.
A post-scarcity utopia, optimized for human flourising, wellbeing and fun.
I found the Culture to be somewhat… unimaginative. Where are the virtual realities? Plus it shows like 1% of the crazy stuff people would do in Utopia (though I suppose it’s meant to be read that way, each book gives you only a small sample of the Culture’s utopia, and you are meant to let your imagination fill in the rest). But at least it actually has people enjoying themselves and having fun.
The utopia I liked the most is the epilogue of (spoiler). I do have some disagreements with it, but overall an excellent attempt at utopia, and closest to my “headcanon” of what FAI’s utopia would look like.
You’re not just going to leave us in the dark like that, are you?
Just figured out how to do spoiler protection. It’s
Worth the Candle
Perhaps the attractor could be intelligence itself. So a primary goal of the AGI would be to maximize intelligence. It seems like human flourishing would then be helpful to the AGI’s goal. Human flourishing, properly defined, implies flourishing of the Earth and its biosphere as a whole, so maybe that attractor brings our world, cultures, and way of life along for the ride.
We may also need to ensure that intelligences have property rights over the substrates they operate on. That may be needed prevent the AGI from converting brains and bodies into microchips, if that’s even possible.
“You saw a future with a ton of sentient, happy humans, saw that [the AI] would value that future highly, and stopped. You didn’t check to see if there was anything it considered more valuable.” (a quote from The Number)
I’m trying to gently point out that it’s not enough to have the AI value humans, if it values other configurations of matter even more than humans. Do I need to say more? Are humans really the most efficient way to go about creating intelligence (if that is what AGI is maximizing)?
Yeah, I agree that valuing humans isn’t enough. I’m suggesting something that humans intrinsically have, or at least have the capacity for. Something that most life on Earth also shares a capacity for. Something that doesn’t change drastically over time in the way that ethics and morals do. Something that humans value, that is universal, and also durable.
I am not suggesting anything about efficiency. Why bother with efficiencies in a post scarcity world?
The goal should not be to maximize anything, not even intelligence. Maintaining or incrementally increasing intelligence would be favorable to humans.
Imagine this is a story where a person makes a wish, and it goes terribly wrong. How does the wish of “maintaining or incrementally increasing intelligence” go wrong?
I mean, the goal doesn’t actually say anything about human intelligence. It might as well increase the intelligence of spiders.
Actually, I guess the real problem is that our wish is not to for AGI to “increase intelligence” but to “increase intelligence without violating our values or doing things we would find morally abhorrent”. Otherwise AGI might as well kidnap humans and forcibly perform invasive surgery on them to install computer chips in their brains. I mean, it would increase their intellingence. That is what you asked for, no?
So AGI needs to care about human values and human ethics in order to be safe. And if does understand and care about human ethics, why not have it act on all human ethics, instead of just a single unclearly-defined task like “increasing intellingence”?
This is the concept of Coherent Extrapolated Volition, as a value system for how we would wish aligned AGI to behave.
You might also The Superintelligence FAQ interesting (as general background, not to answer any specific question or disagreement we might have)
There should only be one “hyper computer” on the planet and all smaller information systems, including individual humans should somehow be part of its functioning.