It seems that Richard is pointing more toward a means of minimizing how much effort an AI puts toward satisfying its preferences rather than how impactful it allows its goals to be, although the two are very tightly linked (more like minimizing its behavior’s impact on its own energy reserves than on other agents or the environment).
One approach to laziness might be to predict the amount of physical joules it would take to reach each candidate goal that it considers. Goals that would be more satisfying according to its value metric but that would require too much more energy to achieve could be passed over for less satisfying goals that require less energy. As an example, an AI that values seeing smiles on human faces might consider either speaking friendly words to everybody or wiring up everyone’s facial muscles into perpetual smiles. Since the latter would require much more energy to achieve, laziness may cause it to prefer the former.
Another approach could be to minimize the amount of computation required to plan how to achieve its goals (which could, incidentally, also be measured in joules). It would thus prefer simple plans that it can figure out quickly over more complicated plans that might take hours of Monte-Carlo Tree Search to figure out. Simpler plans would be easier for humans to understand and react to, in theory.
Obviously, this doesn’t get anywhere close to solving alignment, and it likely won’t offer any guarantees, but I think it could still be a helpful tool in the alignment toolbox.
The general area of minimizing impact is called impact measures.
It seems that Richard is pointing more toward a means of minimizing how much effort an AI puts toward satisfying its preferences rather than how impactful it allows its goals to be, although the two are very tightly linked (more like minimizing its behavior’s impact on its own energy reserves than on other agents or the environment).
One approach to laziness might be to predict the amount of physical joules it would take to reach each candidate goal that it considers. Goals that would be more satisfying according to its value metric but that would require too much more energy to achieve could be passed over for less satisfying goals that require less energy. As an example, an AI that values seeing smiles on human faces might consider either speaking friendly words to everybody or wiring up everyone’s facial muscles into perpetual smiles. Since the latter would require much more energy to achieve, laziness may cause it to prefer the former.
Another approach could be to minimize the amount of computation required to plan how to achieve its goals (which could, incidentally, also be measured in joules). It would thus prefer simple plans that it can figure out quickly over more complicated plans that might take hours of Monte-Carlo Tree Search to figure out. Simpler plans would be easier for humans to understand and react to, in theory.
Obviously, this doesn’t get anywhere close to solving alignment, and it likely won’t offer any guarantees, but I think it could still be a helpful tool in the alignment toolbox.