therefore requires learning a detailed model of the user’s values in order to backchain through only good side effects
That would just be the normal utility function. The motivation here isn’t finding solutions to low-impact problems, it’s minimizing impact while solving problems. One way to do that is by, well, measuring impact.
If the AI has a detailed model of the user’s values and can therefore safely accomplish goals that intrinsically have lots of side effects, it can also apply that to safely accomplish goals that don’t intrinsically have lots of side effects, without needing a separate “avoiding side effects” solution.
This seems to assume a totally-aligned agent and to then ask, “do we need anything else”? Well, no, we don’t need anything beyond an agent which works to further human values just how we want.
But we may not know that the AI is fully aligned, so we might want to install off-buttons and impact measures for extra safety. Furthermore, having large side effects correlates strongly with optimizing away to extreme regions of the solution space; balancing maximizing original utility with minimizing a satisfactory, conservative impact measure (which whitelisting is not yet) bounds our risk in the case where the agent is not totally aligned.
Ok, if I understand your view correctly, the long-term problem is better described as “minimizing impact” rather than “avoiding side effects” and it’s meant to be a second line of defense or a backup safety mechanism rather than a primary one.
Since “Concrete Problems in AI Safety” takes the short/medium term view and introduces “avoiding side effects” as a primary safety mechanism, and some people might not extrapolate correctly from that to the long run, do you know a good introduction to the “avoiding side effects”/”minimizing impact” problem that lays out both the short-term and long-term views?
ETA: Found this and this, however both of them also seem to view “low impact” as a primary safety mechanism, in other words, as a way to get safe and useful work out of advanced AIs before we know how to give them the “right” utility function or otherwise make them fully value aligned.
Whoops, illusion of transparency! The Arbital page is the best I’ve found (for the long-term view); the rest I reasoned on my own and sharpened in some conversations with MIRI staff.
What do you think about Paul Christiano’s argument in the comment to that Arbital page?
Do you think avoiding side effects / low impact could work if an AGI was given a task like “make money” or “win this war” that unavoidably has lots of side effects? If so, can you explain why or give a rough idea of how that might work?
(Feel free not to answer if you don’t have well formed thoughts on these questions. I’m curious what people working on this topic think about these questions, and don’t mean to put you in particular on the spot.)
It seems like Paul’s proposed solution here depends on the rest of Paul’s scheme working (you need the human’s opinions on what effects are important to be accurate). Of course if Paul’s scheme works in general, then it can be used for avoiding undesirable side effects.
My current understanding of how a task-directed AGI could work is: it has some multi-level world model that is mappable to a human-understood ontology (e.g. an ontology in which there is spacetime and objects), and you give it a goal that is something like “cause this variable here to be this value at this time step”. In general you want causal consequences of changing the variable to happen, but few other effects.
It may be possible to use the concept of a causal
counterfactual (as formalized by Pearl [2000]) to separate some intended effects from some unintended ones. Roughly, “follow-on effects” could be defined as those that are causally downstream from the achievement of the goal of building the house (such as the effect of allowing the operator to live somewhere). Follow-on effects are likely to be intended and other effects are likely to be unintended, although the correspondence is not perfect. With some additional work, perhaps it will be possible to use the causal structure of the system’s world-model to select a policy that has the follow-on effects of the goal achievement but few other effects.
For things like “make money” there are going to be effects other than you having more money, e.g. some product was sold and others have less money. The hope here is that, since you have ontology mapping, you can (a) enumerate these effects and see if they seem good according to some scoring function (which need not be a utility function; conservatism may be appropriate here), and (b) check that there aren’t additional future consequences not explained by these effects (e.g. that are different from when you take a counterfactual on these effects).
I think “win this war” is going to be a pretty difficult goal to formalize (as a bunch of what is implied by “winning a war” is psychological/sociological); probably it is better to think about achieving specific military objectives.
I realize I’m shoving most of the problem into the ontology mapping / transparency problem; I think this is correct, and that this problem should be prioritized, with the understanding that avoiding unintended side effects will
be one use of the ontology mapping system.
EDIT: also worth mentioning that things get weird when humans are involved. One effect of a robot building a house is that someone sees a robot building a house, but how does this effect get formalized? I am not sure whether the right approach will be to dodge the issue (by e.g. using only very simple models of humans) or to work out some ontology for theory of mind that could allow reasoning about these sorts of effects.
(a) enumerate these effects and see if they seem good according to some scoring function (which need not be a utility function; conservatism may be appropriate here), and (b) check that there aren’t additional future consequences not explained by these effects (e.g. that are different from when you take a counterfactual on these effects).
Are you aware of any previous discussion of this, in any papers or posts? I’m skeptical that there’s a good way to implement this scoring function. For example we do want our AI to make money by inventing, manufacturing, and selling useful gadgets, and we don’t want our AI to make money by hacking into a bank, selling a biological weapon design to a terrorist, running a Ponzi scheme, or selling gadgets that may become fire hazards. I don’t see how to accomplish this without the scoring function being a utility function. Can you perhaps explain more about how “conservatism” might work here?
It should definitely take desiderata into account, I just mean it doesn’t have to be VNM. One reason why it might not be VNM is if it’s trying to produce a non-dangerous distribution over possible outcomes rather than an outcome that is not dangerous in expectation; see Quantilizers for an example of this.
In general things like “don’t have side effects” are motivated by robustness desiderata, where we don’t trust the AI to make certain decisions so would rather it be conservative. We might not want the AI to cause X but also not want the AI to cause not-X. Things like this are likely to be non-VNM.
That would just be the normal utility function. The motivation here isn’t finding solutions to low-impact problems, it’s minimizing impact while solving problems. One way to do that is by, well, measuring impact.
This seems to assume a totally-aligned agent and to then ask, “do we need anything else”? Well, no, we don’t need anything beyond an agent which works to further human values just how we want.
But we may not know that the AI is fully aligned, so we might want to install off-buttons and impact measures for extra safety. Furthermore, having large side effects correlates strongly with optimizing away to extreme regions of the solution space; balancing maximizing original utility with minimizing a satisfactory, conservative impact measure (which whitelisting is not yet) bounds our risk in the case where the agent is not totally aligned.
Ok, if I understand your view correctly, the long-term problem is better described as “minimizing impact” rather than “avoiding side effects” and it’s meant to be a second line of defense or a backup safety mechanism rather than a primary one.
Since “Concrete Problems in AI Safety” takes the short/medium term view and introduces “avoiding side effects” as a primary safety mechanism, and some people might not extrapolate correctly from that to the long run, do you know a good introduction to the “avoiding side effects”/”minimizing impact” problem that lays out both the short-term and long-term views?
ETA: Found this and this, however both of them also seem to view “low impact” as a primary safety mechanism, in other words, as a way to get safe and useful work out of advanced AIs before we know how to give them the “right” utility function or otherwise make them fully value aligned.
Whoops, illusion of transparency! The Arbital page is the best I’ve found (for the long-term view); the rest I reasoned on my own and sharpened in some conversations with MIRI staff.
What do you think about Paul Christiano’s argument in the comment to that Arbital page?
Do you think avoiding side effects / low impact could work if an AGI was given a task like “make money” or “win this war” that unavoidably has lots of side effects? If so, can you explain why or give a rough idea of how that might work?
(Feel free not to answer if you don’t have well formed thoughts on these questions. I’m curious what people working on this topic think about these questions, and don’t mean to put you in particular on the spot.)
My current thoughts on this:
It seems like Paul’s proposed solution here depends on the rest of Paul’s scheme working (you need the human’s opinions on what effects are important to be accurate). Of course if Paul’s scheme works in general, then it can be used for avoiding undesirable side effects.
My current understanding of how a task-directed AGI could work is: it has some multi-level world model that is mappable to a human-understood ontology (e.g. an ontology in which there is spacetime and objects), and you give it a goal that is something like “cause this variable here to be this value at this time step”. In general you want causal consequences of changing the variable to happen, but few other effects.
From this paper I wrote:
For things like “make money” there are going to be effects other than you having more money, e.g. some product was sold and others have less money. The hope here is that, since you have ontology mapping, you can (a) enumerate these effects and see if they seem good according to some scoring function (which need not be a utility function; conservatism may be appropriate here), and (b) check that there aren’t additional future consequences not explained by these effects (e.g. that are different from when you take a counterfactual on these effects).
I think “win this war” is going to be a pretty difficult goal to formalize (as a bunch of what is implied by “winning a war” is psychological/sociological); probably it is better to think about achieving specific military objectives.
I realize I’m shoving most of the problem into the ontology mapping / transparency problem; I think this is correct, and that this problem should be prioritized, with the understanding that avoiding unintended side effects will be one use of the ontology mapping system.
EDIT: also worth mentioning that things get weird when humans are involved. One effect of a robot building a house is that someone sees a robot building a house, but how does this effect get formalized? I am not sure whether the right approach will be to dodge the issue (by e.g. using only very simple models of humans) or to work out some ontology for theory of mind that could allow reasoning about these sorts of effects.
Are you aware of any previous discussion of this, in any papers or posts? I’m skeptical that there’s a good way to implement this scoring function. For example we do want our AI to make money by inventing, manufacturing, and selling useful gadgets, and we don’t want our AI to make money by hacking into a bank, selling a biological weapon design to a terrorist, running a Ponzi scheme, or selling gadgets that may become fire hazards. I don’t see how to accomplish this without the scoring function being a utility function. Can you perhaps explain more about how “conservatism” might work here?
It should definitely take desiderata into account, I just mean it doesn’t have to be VNM. One reason why it might not be VNM is if it’s trying to produce a non-dangerous distribution over possible outcomes rather than an outcome that is not dangerous in expectation; see Quantilizers for an example of this.
In general things like “don’t have side effects” are motivated by robustness desiderata, where we don’t trust the AI to make certain decisions so would rather it be conservative. We might not want the AI to cause X but also not want the AI to cause not-X. Things like this are likely to be non-VNM.