We tried to be careful not to claim that merely making the decision algorithm aspiration-based is already sufficient to solve the AI safety problem, but maybe we need to add an even more explicit disclaimer in that direction. We explore this approach as a potentially necessary ingredient for safety, not as a complete plan for safety.
In particular, I perfectly agree that conflicting goals are also a severe problem for safety that needs to be addressed (while I don’t believe there is a unique problem for safety that deserves being called “the” problem). In my thinking, the goals of an AGI system are always the direct or indirect consequences of the task it is given by some human that is authorized to give the system a task. If that is the case, the problem of conflicting goals is ultimately an issue of conflicting goals between humans. In your paperclip example, the system should reject the task of producing a trillion paperclips because that likely interferes with the foreseeable goals of other humans. I firmly believe we need to find a design feature that makes sure that the system rejects tasks that are conflicting with other human goals in this way. For the most powerful systems, we might have to do something like what davidad suggests in his Open Agency Architecture, where plans devised by the AGI need to be approved by some form of human jury. I believe such a system would reject almost any maximization-type goals and would only accept almost exclusively aspiration-type goals, and this is the reason why I want to find out how such a goal could then be fulfilled in a rather safe way.
Re quantilization/satisficing: I think that apart from the potentially conflicting goals issue, there are at least two more issues with plain satisficing/quantilization (understood as picking a policy uniformly at random from those that promise at least X return in expectation or among the top X% percent of the feasibility interval): (1) It might be computationally intractable in complex environments that require many steps, unless one finds a way to do that sequentially (i.e., from time step to time step). (2) The unsafe ways to fulfill the goal might not be scarce enough to have sufficiently small probability when choosing policies uniformly at random. The latter is the reason why I currently believe that the freedom to solve a given aspiration-type goal in all kinds of different ways should be used to select a policy that does so in a rather safe way, as judged on the basis of some generic safety criteria. This is why we also investigate in this project how generic safety criteria (such as those discussed for impact regularization in the maximization framework) should be integrated (see post #3 in the sequence).
Thank you for the warm encouragement.
We tried to be careful not to claim that merely making the decision algorithm aspiration-based is already sufficient to solve the AI safety problem, but maybe we need to add an even more explicit disclaimer in that direction. We explore this approach as a potentially necessary ingredient for safety, not as a complete plan for safety.
In particular, I perfectly agree that conflicting goals are also a severe problem for safety that needs to be addressed (while I don’t believe there is a unique problem for safety that deserves being called “the” problem). In my thinking, the goals of an AGI system are always the direct or indirect consequences of the task it is given by some human that is authorized to give the system a task. If that is the case, the problem of conflicting goals is ultimately an issue of conflicting goals between humans. In your paperclip example, the system should reject the task of producing a trillion paperclips because that likely interferes with the foreseeable goals of other humans. I firmly believe we need to find a design feature that makes sure that the system rejects tasks that are conflicting with other human goals in this way. For the most powerful systems, we might have to do something like what davidad suggests in his Open Agency Architecture, where plans devised by the AGI need to be approved by some form of human jury. I believe such a system would reject almost any maximization-type goals and would only accept almost exclusively aspiration-type goals, and this is the reason why I want to find out how such a goal could then be fulfilled in a rather safe way.
Re quantilization/satisficing: I think that apart from the potentially conflicting goals issue, there are at least two more issues with plain satisficing/quantilization (understood as picking a policy uniformly at random from those that promise at least X return in expectation or among the top X% percent of the feasibility interval): (1) It might be computationally intractable in complex environments that require many steps, unless one finds a way to do that sequentially (i.e., from time step to time step). (2) The unsafe ways to fulfill the goal might not be scarce enough to have sufficiently small probability when choosing policies uniformly at random. The latter is the reason why I currently believe that the freedom to solve a given aspiration-type goal in all kinds of different ways should be used to select a policy that does so in a rather safe way, as judged on the basis of some generic safety criteria. This is why we also investigate in this project how generic safety criteria (such as those discussed for impact regularization in the maximization framework) should be integrated (see post #3 in the sequence).