If I understand it correctly, and please correct me if I am mistaken, an approval-directed agent is an artificial intelligence that perfectly/near perfectly simulates a person, and then implements a decision only if that (simulation of a) person would like the decision (and here it is important that it does not compute the outcome of such decisions and then deteremines which outcome maximises the person’s happiness, but instead it uses the person’s heuristics (via the simulation) to determine whether or not the person would implement the decision given more time to think about it). So the decision making algorithm of the AI consists entirely of implementing the decisions that a faster human would.
Could you explain the difference between this approval-directed AI Arthur and an upload of the human Hugh? Or is there no difference? Under which conditions would they act differenty, i.e. implement a different strategy?
An approval-directed agent doesn’t simulate a person any more than a goal-directed agent simulates the universe. It tries to predict what actions the person would approve of, just as a goal-directed agent tries to predict what actions lead to good consequences. In the limit, the approval-directed agent is more like an emulation. This is analogous to the way in which a goal-directed agent approaches a simulation of the universe.
So there are two big differences:
You can implement it now; it’s just an objective for your system, which it can satisfy to varying degrees of excellence—in the same way that you can build a system to rationally pursue a goal, with varying degrees of excellence.
The overseer can use the agent’s help, when deciding what actions it approves of. This results in a form of implicit bootstrapping, since the agent is maximizing the approval of the (overseer+agent) system. In the limit of infinite computing the result would be an emulation with infinite time (or more precisely, the ability to instantiate copies of itself and immediately see their outputs, such that the copies can themselves delegate further). The hope is that a realistic system will converge to this ideal as well as it can, given its limited capabilities—in the same way that a goal-directed system would move towards perfect rational behavior.
Technology which can predict whether an action would be approved by a person or by an organization is:
-Practical to create, first applied to test cases, then to limited circumstances, then in more general cases.
-For the test cases and for the limited circumstances, it can be created using some existing machine learning technology without deploying full-scale natural language processing.
-Approval/disapproval is a binary value, and appropriate machine learning approaches would includes logistic regression or forest-and-trees methods. We create a model using training data, and the model may output P(approval | conditions) . The model is not that different from one used to predict a purchase or a variety of other online behaviors.
-A system which could forecast approval and disapproval would be useful to PEOPLE, well before it became useful as a basis for selecting AI motivations.
Predicting whether people would approve of a particular action is something that we could use machine learning for now.
These approaches advance the idea from a theoretical construct to an actual, implementable project.
In addition to determining whether an action would be approved using a priori reasoning, an approval-directed AI could also reference a large database of past actions which have either been approved or disapproved.
Alternatively, in advance of ever making any real-world decision, the approval-directed AI could generate example scenarios and propose actions to people deemed effective moral reasoners many thousands of times. Their responses would greatly assist the system in constructing a model of whether an action is approvable, and by whom.
A lot of approval data could be created fairly readily. The AI can train on this data.
If I understand it correctly, and please correct me if I am mistaken, an approval-directed agent is an artificial intelligence that perfectly/near perfectly simulates a person, and then implements a decision only if that (simulation of a) person would like the decision (and here it is important that it does not compute the outcome of such decisions and then deteremines which outcome maximises the person’s happiness, but instead it uses the person’s heuristics (via the simulation) to determine whether or not the person would implement the decision given more time to think about it). So the decision making algorithm of the AI consists entirely of implementing the decisions that a faster human would.
Could you explain the difference between this approval-directed AI Arthur and an upload of the human Hugh? Or is there no difference? Under which conditions would they act differenty, i.e. implement a different strategy?
An approval-directed agent doesn’t simulate a person any more than a goal-directed agent simulates the universe. It tries to predict what actions the person would approve of, just as a goal-directed agent tries to predict what actions lead to good consequences. In the limit, the approval-directed agent is more like an emulation. This is analogous to the way in which a goal-directed agent approaches a simulation of the universe.
So there are two big differences:
You can implement it now; it’s just an objective for your system, which it can satisfy to varying degrees of excellence—in the same way that you can build a system to rationally pursue a goal, with varying degrees of excellence.
The overseer can use the agent’s help, when deciding what actions it approves of. This results in a form of implicit bootstrapping, since the agent is maximizing the approval of the (overseer+agent) system. In the limit of infinite computing the result would be an emulation with infinite time (or more precisely, the ability to instantiate copies of itself and immediately see their outputs, such that the copies can themselves delegate further). The hope is that a realistic system will converge to this ideal as well as it can, given its limited capabilities—in the same way that a goal-directed system would move towards perfect rational behavior.
Technology which can predict whether an action would be approved by a person or by an organization is:
-Practical to create, first applied to test cases, then to limited circumstances, then in more general cases.
-For the test cases and for the limited circumstances, it can be created using some existing machine learning technology without deploying full-scale natural language processing.
-Approval/disapproval is a binary value, and appropriate machine learning approaches would includes logistic regression or forest-and-trees methods. We create a model using training data, and the model may output P(approval | conditions) . The model is not that different from one used to predict a purchase or a variety of other online behaviors.
-A system which could forecast approval and disapproval would be useful to PEOPLE, well before it became useful as a basis for selecting AI motivations.
Predicting whether people would approve of a particular action is something that we could use machine learning for now.
These approaches advance the idea from a theoretical construct to an actual, implementable project.
Thanks to Paul for the seed insight.
In addition to determining whether an action would be approved using a priori reasoning, an approval-directed AI could also reference a large database of past actions which have either been approved or disapproved.
Alternatively, in advance of ever making any real-world decision, the approval-directed AI could generate example scenarios and propose actions to people deemed effective moral reasoners many thousands of times. Their responses would greatly assist the system in constructing a model of whether an action is approvable, and by whom.
A lot of approval data could be created fairly readily. The AI can train on this data.