Aligned AI Needs Slack
(Half-baked)
Much has been said about slack on this site, starting with Zvi’s seminal post. The point I couldn’t find easily (probably missed) is that an aligned AI would need a fair bit of it. Having a utility function means zero slack: there is one thing you optimize, to the exclusion of everything else. And all precisely defined goals are necessarily Goodharted (or, in the DnD terms, munchkined). An AI armed with a utility function will tile the world (the whole world, or its own “mental” world, or both) with smiley paperclips. For an AI (or for a natural intelligence) to behave non-destructively it needs room to satisfice, not optimize. Optimal utility corresponds to a single state of the world among infinitely many, while adding slack to the mix expands the space of acceptable world state enough to potentially include those that are human-aligned. If an AGI is indifferent between a great many world states, it might well include some that would be acceptable to humanity and have no incentive to try to trick its creators. Not being an ML person, I have no idea how to formalize it, or if it has been formalized already. But figured it’s worth writing a short note about. That is all.
- How Might an Alignment Attractor Look like? by 28 Apr 2022 6:46 UTC; 47 points) (
- 9 Apr 2022 23:16 UTC; 4 points) 's comment on Godshatter Versus Legibility: A Fundamentally Different Approach To AI Alignment by (
- 16 Apr 2022 19:19 UTC; 2 points) 's comment on Everything I Need To Know About Takeoff Speeds I Learned From Air Conditioner Ratings On Amazon by (
Another point is that when you optimize relentlessly for one thing, you have might have trouble exploring the space adequately (get stuck at local maxima). That’s why RL agents/algorithms often take random actions when they are training (they call this “exploration” instead of “exploitation”). Maybe random actions can be thought of as a form of slack? Micro-slacks?
Look at Kenneth Stanley’s arguments about why objective functions are bad (video talk on it here). Basically he’s saying we need a lot more random exploration. Humans are similar—we have an open-ended drive to explore in addition to drives to optimize a utility function. Of course maybe you can argue the open-ended drive to explore is ultimately in the service of utility optimization, but you can argue the same about slack, too.
Good point. I can see a couple main ways to give an AI slack in its pursuit of optimization:
Set the level of precision on the various features of the world that it optimizes for. Lower precision would then mean the AI puts forth less effort toward optimizing it and puts up less of a fight against other agents (i.e., humans) who try to steer it away from the optimum. The AI might also have a goal-generating module that tries to predict what goals would produce what level of utility. Having lower precision would mean that it would sample goal states from a broader region of latent space and therefore be willing to settle for reaching the sub-optimal goals that result.
Seek Pareto improvements. When the AI is trying to optimize simultaneously for an entire ensemble of human preferences (or heuristics of those preferences) with no strict ordering among them, it would only fight for goals that improve on multiple dimensions simultaneously. World states on the same Pareto frontier would be equivalent from its perspective, and it could then work with humans who want to improve things according to their hidden preferences by just constraining any changes to lie on that same manifold.
I disagree somewhat. It is in principle possible to have an AI with a utility function, and that single optimum it reaches for is actually really nice. Most random utility functions are bad, but there are a few good ones.
Suppose a maximally indifferent AI. U(world)=3. Ie constant. Whatever happens, the AI gets utility 3. It doesn’t care in the slightest about anything. How the AI behaves depends entirely on the tiebreaker mechanism.
Just because the AI is guiding towards a huge range of worlds, some of them good, doesn’t mean we get a good outcome. It has “no incentive” to trick its creators, but no incentive not to. You have specified an AI that imagines a trillion trillion paths through time. Some good. Many not. Then it uses some undefined piece of code to pick one. Until you specify how this works, we can’t tell if the outcome will be good.
*Citation Needed
https://www.lesswrong.com/posts/tnWRXkcDi5Tw9rzXw/the-design-space-of-minds-in-general
There are some states of the world you would consider good, so the utility functions that aim for those states are good too. There are utility functions that think X is bad and Y is good to the exact same extent you think these things are bad or good.
Thissss.… seems like a really really important point, and I kind of love it. Thanks for posting. I’m now going to sit around and think about this for a bit.
What you are describing is a satisficer. An AI which only optimizes it’s utility to some extent, then doesn’t care anymore. You may find this video interesting, though I haven’t watched it, it is said to show why satisficers are still dangerous.
What you’re describing is the difference between maximizers and satisficers.
I don’t think you really understand what slack is. You’re articulating something closer to the idea that the AI needs to be low-impact, but that’s completely different from being unconstrained. People with lots of “slack” in the sociological sense of the term that Zvi describes can still be extremely ambitious and resourceful. They tend to have more room to pursue prosocial ends, but the problem with an AI is that its goals might not be prosocial.
I understand the low impact idea, and it’s a great heuristic, but that’s not quite what I am getting at. The impact may be high, but the space of acceptable outcomes should be broad enough so there is no temptation for the AGI to hide and deceive. A tool becoming an agent and destroying the world because it strives to perform the requested operation is more of a ” keep it low-impact” domain, but to avoid tension with the optimization goal, the binding optimizations constraints should not be tight, which is what slack is. I guess it hints as the issues raised in https://en.wikipedia.org/wiki/Human_Compatible, just not the approach advocated there, “the AI’s true objective remain uncertain, with the AI only approaching certainty about it as it gains more information about humans and the world”.