I agree there are superintelligent unconstrained AIs that can accomplish tasks (making a cup of tea) without destroying the world. But I feel it would have to have so much of human preferences already (to compute what is and what isn’t an acceptable tradeoff in making you your cup of tea) that it may as well be fully aligned anyway—very little remains to define full alignment.
Ah, so you are arguing against (3)? (And what’s your stance on (1)?)
Let’s say you are assigned to be Alice’s personal assistant.
Suppose Alice says “Try to help me as much as you can, while being VERY sure to avoid actions that I would regard as catastrophically bad. When in doubt, just don’t do anything at all, that’s always OK with me.” I feel like Alice is not asking too much of you here. You’ll observe her a lot, and ask her a lot of questions especially early on, and sometimes you’ll fail to be useful, because helping her would require choosing among options that all seem fraught. But still, I feel like this is basically doable. And pretty robust, because you’ll presumably only take actions when you have many independent lines of evidence that those actions are acceptable—e.g. you’ve seen Alice do similar things, and you’ve seen other people do similar things while Alice watched and she seemed happy, and also you explicitly asked Alice and she said it was fine.
Suppose Alice says “You need to distill my preferences into a utility function, and then go all-out, taking actions that set that utility function to its global maximum. So in particular, in every possible situation, no matter how bizarre, you will have preferences that match my preferences [or match the preferences that I would have reached upon deliberating following my meta-preferences, or whatever].” I feel like Alice is asking for something very very hard here. And that it’s much more prone to catastrophic failure if anything goes wrong in the construction of the utility function—e.g. Alice gets confused and describes something wrong, or you misunderstand her.
Right?
But I feel it would have to have so much of human preferences already (to compute what is and what isn’t an acceptable tradeoff in making you your cup of tea) that it may as well be fully aligned anyway—very little remains to define full alignment.
Hmm, I’m probably misunderstanding, but I feel like maybe you’re making an argument like this:
(My probably-inaccurate elaboration of your argument.) We’re making an extremely long list of the things that Alice cares about: “I like having all my teeth, and I like being able to watch football, and I like a pretty view out my window, etc. etc. etc.” And each item that we add to the list costs one unit of value-alignment effort. And then “acting conservatively in regards to violating human preferences and norms in general, and in regards to Alice’s preferences in particular” requires a very long list, and “synthesizing Alice’s utility function” requires an only-slightly-longer list. Therefore we might as well do the latter.
But I don’t think it’s like that. For example, I think if an AGI watches a bunch of YouTube videos, it will be able to form a decent concept of “doing things that people would widely regard as uncontroversial and compatible with prevailing norms”, and we can make it motivated to restrict its actions to that subspace via a constant amount of value-loading effort, i.e. with an amount of value-loading effort that does not scale with how complex those prevailing norms are. (More complex prevailing norms would require having the AGI watch more YouTube videos before it understands the prevailing norms, but it would not require more value-loading effort, i.e. the step where we edit the AGI’s motivation such that it wants to follow prevailing norms would not be any harder.)
But I think it would take a lot more value-loading effort than that to really get a particular person’s preferences, including all its idiosyncrasies and edge-cases.
I’m with Steve on the idea that there’s a difference between broad human preferences (something like common sense?) and particular and exact human preferences (what would be needed for ambitious value learning).
Still, you (Stuart) made me realize that I didn’t think explicitly about this need for broad human preferences in my splitting of the problem (be able to align, then point to what we want), but it’s indeed implicit because I don’t care about being able to do “anything”, just the sort of things humans might want.
Thanks for developing the argument. This is very useful.
The key point seems to be whether we can develop an AI that can successfully behave as a low impact AI—not as a “on balance, things are ok”, but a genuinely low impact AI that ensure that we don’t move towards a world where our preference might be ambiguous or underdefined.
But consider the following scenario: the AGI knows that, as a consequence of its actions, one AGI design will be deployed rather than another. Both of these designs will push the world into uncharted territory. How should it deal with that situation?
I want the AI to have criteria that qualifies actions as acceptable, e.g. “it pattern-matches less than 1% to ‘I’m causing destruction’, and it pattern-matches less than 1% to ‘the supervisor wouldn’t like this’, and it pattern-matches less than 1% to ‘I’m changing my own motivation and control systems’, and … etc. etc.”
If no action is acceptable, I want NOOP to be hardcoded as an always-acceptable default—a.k.a. “being paralyzed by indecision” in the face of a situation where all the options seem problematic. And then we humans are responsible for not putting the AI in situations where fast decisions are necessary and inaction is dangerous, like running the electric grid or driving a car.
(At some point we do want an AI that can run the electric grid and drive a car etc. But maybe we can bootstrap our way there, and/or use less-powerful narrow AIs in the meantime.)
A failure mode of (2) is that we could get an AI that is paralyzed by indecision always, and never does anything. To avoid this failure mode, we want the AI to be able to (and motivated to) gather evidence that might show that a course of action deemed problematic is in fact acceptable after all. This would probably involve asking questions to the human supervisor.
A failure mode of (3) is that the AI frames the questions in order to get an answer that it wants. To avoid this failure mode, we would set things up such that the AI’s normal motivation system is not in charge of choosing what words to say when querying the human. For example, maybe the AI is not really “asking a question” at all, at least not in the normal sense; instead it’s sending a data-dump to the human, and the human then inspects this data-dump with interpretability tools, and makes an edit to the AI’s motivation parameters. (In this case, maybe the AI’s normal motivation system is choosing to “press the button” that sends the data-dump, but it does not have direct control over the contents of the data-dump.) Separately, we would also set up the AI such that it’s motivated to not manipulate the human, and also motivated to not sabotage its own motivation and control systems.
(BTW a lot of my thinking here came straight out of reading your model splintering posts. But maybe I’ve kinda wandered off in a different direction.)
So then in the scenario you mentioned, let’s assume thatwe’ve set up the AI such that actions that pattern-match to “push the world into uncharted territory” are treated as unacceptable (which I guess seems like a plausibly good idea). But the AI is also motivated to get something done—say, solve global warming. And it finds a possible course of action which pattern-matches very well to “solve global warming”, but alas, it also pattern-matches to “push the world into uncharted territory”. The AI could reason that, if it queries the human (by “pressing the button” to send the data-dump), there’s at least a chance that the human would edit its systems such that this course of action would no longer be unacceptable. So it would presumably do so.
In other words, this is a situation where the AI’s motivational system is sending it mixed signals—it does want to “solve global warming”, but it doesn’t want to “push the world into uncharted territory”, but this course of action is both. And let’s assume that the AI can’t easily come up with an alternative course of action that would solve global warming without any problematic aspects. So the AI asks the human what they think about this plan. Seems reasonable, I guess.
I haven’t thought this through very much and look forward to you picking holes in it :)
My take is that if you gave an optimization process access to some handwritten acceptability criteria and searched for the nearest acceptable points to random starting points, you would get adversarial examples that violate unstated criteria. In order for the handwritten acceptability criteria to be useful, they can’t be how the AI generates its ideas in the first place.
So: what is the base level that we would find if we peeled away the value learning scheme that you lay out? Is it a very general, human-agnostic AI with some human-value constraints on top? Or will we peel away a layer that gets information from humans just to reveal another layer that gets information from humans (e.g. learning a “human distribution”)?
I agree there are superintelligent unconstrained AIs that can accomplish tasks (making a cup of tea) without destroying the world. But I feel it would have to have so much of human preferences already (to compute what is and what isn’t an acceptable tradeoff in making you your cup of tea) that it may as well be fully aligned anyway—very little remains to define full alignment.
Ah, so you are arguing against (3)? (And what’s your stance on (1)?)
Let’s say you are assigned to be Alice’s personal assistant.
Suppose Alice says “Try to help me as much as you can, while being VERY sure to avoid actions that I would regard as catastrophically bad. When in doubt, just don’t do anything at all, that’s always OK with me.” I feel like Alice is not asking too much of you here. You’ll observe her a lot, and ask her a lot of questions especially early on, and sometimes you’ll fail to be useful, because helping her would require choosing among options that all seem fraught. But still, I feel like this is basically doable. And pretty robust, because you’ll presumably only take actions when you have many independent lines of evidence that those actions are acceptable—e.g. you’ve seen Alice do similar things, and you’ve seen other people do similar things while Alice watched and she seemed happy, and also you explicitly asked Alice and she said it was fine.
Suppose Alice says “You need to distill my preferences into a utility function, and then go all-out, taking actions that set that utility function to its global maximum. So in particular, in every possible situation, no matter how bizarre, you will have preferences that match my preferences [or match the preferences that I would have reached upon deliberating following my meta-preferences, or whatever].” I feel like Alice is asking for something very very hard here. And that it’s much more prone to catastrophic failure if anything goes wrong in the construction of the utility function—e.g. Alice gets confused and describes something wrong, or you misunderstand her.
Right?
Hmm, I’m probably misunderstanding, but I feel like maybe you’re making an argument like this:
But I don’t think it’s like that. For example, I think if an AGI watches a bunch of YouTube videos, it will be able to form a decent concept of “doing things that people would widely regard as uncontroversial and compatible with prevailing norms”, and we can make it motivated to restrict its actions to that subspace via a constant amount of value-loading effort, i.e. with an amount of value-loading effort that does not scale with how complex those prevailing norms are. (More complex prevailing norms would require having the AGI watch more YouTube videos before it understands the prevailing norms, but it would not require more value-loading effort, i.e. the step where we edit the AGI’s motivation such that it wants to follow prevailing norms would not be any harder.)
But I think it would take a lot more value-loading effort than that to really get a particular person’s preferences, including all its idiosyncrasies and edge-cases.
I’m with Steve on the idea that there’s a difference between broad human preferences (something like common sense?) and particular and exact human preferences (what would be needed for ambitious value learning).
Still, you (Stuart) made me realize that I didn’t think explicitly about this need for broad human preferences in my splitting of the problem (be able to align, then point to what we want), but it’s indeed implicit because I don’t care about being able to do “anything”, just the sort of things humans might want.
Thanks for developing the argument. This is very useful.
The key point seems to be whether we can develop an AI that can successfully behave as a low impact AI—not as a “on balance, things are ok”, but a genuinely low impact AI that ensure that we don’t move towards a world where our preference might be ambiguous or underdefined.
But consider the following scenario: the AGI knows that, as a consequence of its actions, one AGI design will be deployed rather than another. Both of these designs will push the world into uncharted territory. How should it deal with that situation?
Hmm,
I want the AI to have criteria that qualifies actions as acceptable, e.g. “it pattern-matches less than 1% to ‘I’m causing destruction’, and it pattern-matches less than 1% to ‘the supervisor wouldn’t like this’, and it pattern-matches less than 1% to ‘I’m changing my own motivation and control systems’, and … etc. etc.”
If no action is acceptable, I want NOOP to be hardcoded as an always-acceptable default—a.k.a. “being paralyzed by indecision” in the face of a situation where all the options seem problematic. And then we humans are responsible for not putting the AI in situations where fast decisions are necessary and inaction is dangerous, like running the electric grid or driving a car.
(At some point we do want an AI that can run the electric grid and drive a car etc. But maybe we can bootstrap our way there, and/or use less-powerful narrow AIs in the meantime.)
A failure mode of (2) is that we could get an AI that is paralyzed by indecision always, and never does anything. To avoid this failure mode, we want the AI to be able to (and motivated to) gather evidence that might show that a course of action deemed problematic is in fact acceptable after all. This would probably involve asking questions to the human supervisor.
A failure mode of (3) is that the AI frames the questions in order to get an answer that it wants. To avoid this failure mode, we would set things up such that the AI’s normal motivation system is not in charge of choosing what words to say when querying the human. For example, maybe the AI is not really “asking a question” at all, at least not in the normal sense; instead it’s sending a data-dump to the human, and the human then inspects this data-dump with interpretability tools, and makes an edit to the AI’s motivation parameters. (In this case, maybe the AI’s normal motivation system is choosing to “press the button” that sends the data-dump, but it does not have direct control over the contents of the data-dump.) Separately, we would also set up the AI such that it’s motivated to not manipulate the human, and also motivated to not sabotage its own motivation and control systems.
(BTW a lot of my thinking here came straight out of reading your model splintering posts. But maybe I’ve kinda wandered off in a different direction.)
So then in the scenario you mentioned, let’s assume that we’ve set up the AI such that actions that pattern-match to “push the world into uncharted territory” are treated as unacceptable (which I guess seems like a plausibly good idea). But the AI is also motivated to get something done—say, solve global warming. And it finds a possible course of action which pattern-matches very well to “solve global warming”, but alas, it also pattern-matches to “push the world into uncharted territory”. The AI could reason that, if it queries the human (by “pressing the button” to send the data-dump), there’s at least a chance that the human would edit its systems such that this course of action would no longer be unacceptable. So it would presumably do so.
In other words, this is a situation where the AI’s motivational system is sending it mixed signals—it does want to “solve global warming”, but it doesn’t want to “push the world into uncharted territory”, but this course of action is both. And let’s assume that the AI can’t easily come up with an alternative course of action that would solve global warming without any problematic aspects. So the AI asks the human what they think about this plan. Seems reasonable, I guess.
I haven’t thought this through very much and look forward to you picking holes in it :)
My take is that if you gave an optimization process access to some handwritten acceptability criteria and searched for the nearest acceptable points to random starting points, you would get adversarial examples that violate unstated criteria. In order for the handwritten acceptability criteria to be useful, they can’t be how the AI generates its ideas in the first place.
So: what is the base level that we would find if we peeled away the value learning scheme that you lay out? Is it a very general, human-agnostic AI with some human-value constraints on top? Or will we peel away a layer that gets information from humans just to reveal another layer that gets information from humans (e.g. learning a “human distribution”)?
More detailed response: https://www.lesswrong.com/posts/DjTKMEwRqpuKkJzTo/are-there-alternative-to-solving-value-transfer-and