It seems like Paul’s proposed solution here depends on the rest of Paul’s scheme working (you need the human’s opinions on what effects are important to be accurate). Of course if Paul’s scheme works in general, then it can be used for avoiding undesirable side effects.
My current understanding of how a task-directed AGI could work is: it has some multi-level world model that is mappable to a human-understood ontology (e.g. an ontology in which there is spacetime and objects), and you give it a goal that is something like “cause this variable here to be this value at this time step”. In general you want causal consequences of changing the variable to happen, but few other effects.
It may be possible to use the concept of a causal
counterfactual (as formalized by Pearl [2000]) to separate some intended effects from some unintended ones. Roughly, “follow-on effects” could be defined as those that are causally downstream from the achievement of the goal of building the house (such as the effect of allowing the operator to live somewhere). Follow-on effects are likely to be intended and other effects are likely to be unintended, although the correspondence is not perfect. With some additional work, perhaps it will be possible to use the causal structure of the system’s world-model to select a policy that has the follow-on effects of the goal achievement but few other effects.
For things like “make money” there are going to be effects other than you having more money, e.g. some product was sold and others have less money. The hope here is that, since you have ontology mapping, you can (a) enumerate these effects and see if they seem good according to some scoring function (which need not be a utility function; conservatism may be appropriate here), and (b) check that there aren’t additional future consequences not explained by these effects (e.g. that are different from when you take a counterfactual on these effects).
I think “win this war” is going to be a pretty difficult goal to formalize (as a bunch of what is implied by “winning a war” is psychological/sociological); probably it is better to think about achieving specific military objectives.
I realize I’m shoving most of the problem into the ontology mapping / transparency problem; I think this is correct, and that this problem should be prioritized, with the understanding that avoiding unintended side effects will
be one use of the ontology mapping system.
EDIT: also worth mentioning that things get weird when humans are involved. One effect of a robot building a house is that someone sees a robot building a house, but how does this effect get formalized? I am not sure whether the right approach will be to dodge the issue (by e.g. using only very simple models of humans) or to work out some ontology for theory of mind that could allow reasoning about these sorts of effects.
(a) enumerate these effects and see if they seem good according to some scoring function (which need not be a utility function; conservatism may be appropriate here), and (b) check that there aren’t additional future consequences not explained by these effects (e.g. that are different from when you take a counterfactual on these effects).
Are you aware of any previous discussion of this, in any papers or posts? I’m skeptical that there’s a good way to implement this scoring function. For example we do want our AI to make money by inventing, manufacturing, and selling useful gadgets, and we don’t want our AI to make money by hacking into a bank, selling a biological weapon design to a terrorist, running a Ponzi scheme, or selling gadgets that may become fire hazards. I don’t see how to accomplish this without the scoring function being a utility function. Can you perhaps explain more about how “conservatism” might work here?
It should definitely take desiderata into account, I just mean it doesn’t have to be VNM. One reason why it might not be VNM is if it’s trying to produce a non-dangerous distribution over possible outcomes rather than an outcome that is not dangerous in expectation; see Quantilizers for an example of this.
In general things like “don’t have side effects” are motivated by robustness desiderata, where we don’t trust the AI to make certain decisions so would rather it be conservative. We might not want the AI to cause X but also not want the AI to cause not-X. Things like this are likely to be non-VNM.
My current thoughts on this:
It seems like Paul’s proposed solution here depends on the rest of Paul’s scheme working (you need the human’s opinions on what effects are important to be accurate). Of course if Paul’s scheme works in general, then it can be used for avoiding undesirable side effects.
My current understanding of how a task-directed AGI could work is: it has some multi-level world model that is mappable to a human-understood ontology (e.g. an ontology in which there is spacetime and objects), and you give it a goal that is something like “cause this variable here to be this value at this time step”. In general you want causal consequences of changing the variable to happen, but few other effects.
From this paper I wrote:
For things like “make money” there are going to be effects other than you having more money, e.g. some product was sold and others have less money. The hope here is that, since you have ontology mapping, you can (a) enumerate these effects and see if they seem good according to some scoring function (which need not be a utility function; conservatism may be appropriate here), and (b) check that there aren’t additional future consequences not explained by these effects (e.g. that are different from when you take a counterfactual on these effects).
I think “win this war” is going to be a pretty difficult goal to formalize (as a bunch of what is implied by “winning a war” is psychological/sociological); probably it is better to think about achieving specific military objectives.
I realize I’m shoving most of the problem into the ontology mapping / transparency problem; I think this is correct, and that this problem should be prioritized, with the understanding that avoiding unintended side effects will be one use of the ontology mapping system.
EDIT: also worth mentioning that things get weird when humans are involved. One effect of a robot building a house is that someone sees a robot building a house, but how does this effect get formalized? I am not sure whether the right approach will be to dodge the issue (by e.g. using only very simple models of humans) or to work out some ontology for theory of mind that could allow reasoning about these sorts of effects.
Are you aware of any previous discussion of this, in any papers or posts? I’m skeptical that there’s a good way to implement this scoring function. For example we do want our AI to make money by inventing, manufacturing, and selling useful gadgets, and we don’t want our AI to make money by hacking into a bank, selling a biological weapon design to a terrorist, running a Ponzi scheme, or selling gadgets that may become fire hazards. I don’t see how to accomplish this without the scoring function being a utility function. Can you perhaps explain more about how “conservatism” might work here?
It should definitely take desiderata into account, I just mean it doesn’t have to be VNM. One reason why it might not be VNM is if it’s trying to produce a non-dangerous distribution over possible outcomes rather than an outcome that is not dangerous in expectation; see Quantilizers for an example of this.
In general things like “don’t have side effects” are motivated by robustness desiderata, where we don’t trust the AI to make certain decisions so would rather it be conservative. We might not want the AI to cause X but also not want the AI to cause not-X. Things like this are likely to be non-VNM.