Laziness in AI
I did a quick google search and couldn’t find much in the way of “lazy ai”, but it’s starting to seem more and more obvious as a strategy for alignment. Here’s the thought process:
-- Given almost any goal, an AI will want to obtain more power to be able to accomplish the goal, or to be more sure that it did accomplish the goal, or to prevent anyone from getting in its way of accomplishing the goal.
-- So even an AI with “human-level” intelligence will want to get more computing power, have safeguards against being shut down, and obtain resources like money and contacts.
-- Regular humans don’t usually act like this[1]. This has nothing to do with their intelligence (after all, we specified “human-level” intelligence), but instead with their willpower/personality/motivation. A normal human is more concerned with comfort, more afraid of hard work, and more complacent to do what everyone else is doing: not much. These traits together can be (somewhat harshly) described as Laziness.
So maybe one further safeguard to put on an AI is to make it lazy? No lazy person has ever attempted to take over the world, after all. But lazy people are often nice, and willing to, say, answer questions.
What would this look like in actuality? Give the AI harsh utility penalties for doing hard work, especially hard work right now. For example, at any given time, half of the AI’s utility function depends on not doing work above a certain threshold in the next hour (So if the AI does well, 50% of its utility function would be permanently completed in its first hour, and after its second hour, 75%, and after its third hour, 87.5%).
Is this an insanely bad idea? Would this cripple the AI of its most valuable tool? Are there other human characteristics that would possibly make sense to implant in an AI, like boredom or self-consciousness? Has anyone else written about this?
- ^
Most people don’t look at the world and say “what do I most want?” and “how can I best achieve it?” and then spend the next sixty years working at 100% capacity to gather resources, grow in power, and bend all of humanity to their iron will in hopes that this will give them a better chance at accomplishing their task.
The general area of minimizing impact is called impact measures.
It seems that Richard is pointing more toward a means of minimizing how much effort an AI puts toward satisfying its preferences rather than how impactful it allows its goals to be, although the two are very tightly linked (more like minimizing its behavior’s impact on its own energy reserves than on other agents or the environment).
One approach to laziness might be to predict the amount of physical joules it would take to reach each candidate goal that it considers. Goals that would be more satisfying according to its value metric but that would require too much more energy to achieve could be passed over for less satisfying goals that require less energy. As an example, an AI that values seeing smiles on human faces might consider either speaking friendly words to everybody or wiring up everyone’s facial muscles into perpetual smiles. Since the latter would require much more energy to achieve, laziness may cause it to prefer the former.
Another approach could be to minimize the amount of computation required to plan how to achieve its goals (which could, incidentally, also be measured in joules). It would thus prefer simple plans that it can figure out quickly over more complicated plans that might take hours of Monte-Carlo Tree Search to figure out. Simpler plans would be easier for humans to understand and react to, in theory.
Obviously, this doesn’t get anywhere close to solving alignment, and it likely won’t offer any guarantees, but I think it could still be a helpful tool in the alignment toolbox.
This is a good idea, and current methods to try to instill this kind of behavior are called [quantilization](https://www.lesswrong.com/tag/quantilization). The problem with these lazy-AI (or bounded-optimization-power AGIs) are threefold:
We’d still need to solve the inner alignment problem
You can’t get the AGI to do anything particularly complex or clever
There is still some probability the AGI kills you, it is just made super small. Also, in cases where the relevant killing-you action is a disjunction of many atomic actions (for instance, if it’s trying to build a bomb, and the probability it outputs the next action in the bomb construction process is 0.1%, and the probability it outputs a not-terrible action is 99.9%, and there’s 10 bomb construction steps), then in the limit of number of actions taken, the probability it completes the bomb goes to 1).
This is very close to an Idea I had a couple of weeks ago: Giving the AI a strong time preference to prevent long term plotting and to make sure that a treacherous turn happens (too) early.
So maybe this kind of approach should be investigated more. (Or is it already and I am unaware?)
Isn’t that what Eliezer referred to as opti-meh-zation?