I had a thought while reading the section on Goodharting. You could fix a lot of the potential issues with an agentic AI by training it to want its impact on the world to be within small bounds. Give it a strong and ongoing bias toward ‘leave the world the way I found it.’ This could only be overcome by a very clear and large benefit toward its other goals per small amount of change. It should not be just part of the decision making process though, but part of the goal state. This wouldn’t solve every possible issue, but it would solve a lot of them. In other words, make it unambitious and conservative, and then its interventions will be limited and precise if it has a good model of the world.
I think this is part of the idea behind Eliezer-corrigibility, and I agree that if executed correctly, this would be helpful.
The difficulty with this approach that I see is:
1) how do you precisely specifying what you mean by “impact on the world to be within small bounds”—this seems to require a measure of impact. This would be amazing (and is related to value extrapolation).
2) how do you induce the inner value of “being low impact” into the agent, and make sure that this generalizes in the intended way off of the training distribution.
(these roughly correspond to the inner and outer alignment problem)
3) being low impact is strongly counter to the “core of consequentialism”: convergent instrumental goals for pretty much any agent cause it to seek power.
As with alternatives to utility functions, the practical solution seems to be to avoid explicit optimization (which is known to do the wrong things), and instead work on model-generated behaviors in other ways, without getting them explicitly reshaped into optimization. If there is no good theory of optimization (that doesn’t predictably only do the wrong things), it needs to be kept out of the architecture, so that it’s up to the system to come up with optimization it decides on later, when it grows up. What an architecture needs to ensure is clarity of aligned cognition sufficient to eventually make decisions like that, not optimization (of the world) directly.
The easy first step, is a simple bias toward inaction, which you can provide with a large punishment per output of any kind. For instance, a language model with this bias would write out something extremely likely, and then stop quickly thereafter. This is only a partial measure, of course, but it is a significant first step.
Second through n-th step, harder, I really don’t even know, how do you figure out what values to try to train it with to reduce impact. The immediate things I can think of might also train deceit, so it would take some thought.
Also, across the time period of training, ask a panel (many separate panels) of judges to determine whether actions it is promoting for use in hypothetical situations or games was the minimal action it could have taken for the level of positive impact. Obviously, if the impact is negative, it wasn’t the minimal action. Perhaps also train a network explicitly on the decisions of similar panels on such actions humans have taken, and use those same criteria.
Somewhere in there, best place unknown, penalize heavy use of computation in coming up with plans (though perhaps not with evaluating them.).
Final step (and perhaps at other stages too), penalize any actions taken that humans don’t like. This can be done in a variety of ways. For instance, have 3 random humans vote on each action it takes, and for each person that dislikes the action, give it a penalty.
I had a thought while reading the section on Goodharting. You could fix a lot of the potential issues with an agentic AI by training it to want its impact on the world to be within small bounds. Give it a strong and ongoing bias toward ‘leave the world the way I found it.’ This could only be overcome by a very clear and large benefit toward its other goals per small amount of change. It should not be just part of the decision making process though, but part of the goal state. This wouldn’t solve every possible issue, but it would solve a lot of them. In other words, make it unambitious and conservative, and then its interventions will be limited and precise if it has a good model of the world.
I think this is part of the idea behind Eliezer-corrigibility, and I agree that if executed correctly, this would be helpful.
The difficulty with this approach that I see is:
1) how do you precisely specifying what you mean by “impact on the world to be within small bounds”—this seems to require a measure of impact. This would be amazing (and is related to value extrapolation).
2) how do you induce the inner value of “being low impact” into the agent, and make sure that this generalizes in the intended way off of the training distribution.
(these roughly correspond to the inner and outer alignment problem)
3) being low impact is strongly counter to the “core of consequentialism”: convergent instrumental goals for pretty much any agent cause it to seek power.
As with alternatives to utility functions, the practical solution seems to be to avoid explicit optimization (which is known to do the wrong things), and instead work on model-generated behaviors in other ways, without getting them explicitly reshaped into optimization. If there is no good theory of optimization (that doesn’t predictably only do the wrong things), it needs to be kept out of the architecture, so that it’s up to the system to come up with optimization it decides on later, when it grows up. What an architecture needs to ensure is clarity of aligned cognition sufficient to eventually make decisions like that, not optimization (of the world) directly.
The easy first step, is a simple bias toward inaction, which you can provide with a large punishment per output of any kind. For instance, a language model with this bias would write out something extremely likely, and then stop quickly thereafter. This is only a partial measure, of course, but it is a significant first step.
Second through n-th step, harder, I really don’t even know, how do you figure out what values to try to train it with to reduce impact. The immediate things I can think of might also train deceit, so it would take some thought.
Also, across the time period of training, ask a panel (many separate panels) of judges to determine whether actions it is promoting for use in hypothetical situations or games was the minimal action it could have taken for the level of positive impact. Obviously, if the impact is negative, it wasn’t the minimal action. Perhaps also train a network explicitly on the decisions of similar panels on such actions humans have taken, and use those same criteria.
Somewhere in there, best place unknown, penalize heavy use of computation in coming up with plans (though perhaps not with evaluating them.).
Final step (and perhaps at other stages too), penalize any actions taken that humans don’t like. This can be done in a variety of ways. For instance, have 3 random humans vote on each action it takes, and for each person that dislikes the action, give it a penalty.