Another con of the motivation-competence decomposition: unlike definition-optimization, it doesn’t actually seem to be a clean decomposition of the larger task, such that we can solve each subtask independently and then combine the solutions.
For example one way we could solve the motivation problem is by building a perfect human imitation (of someone who really wants to help H do what H wants), but then we seem to be stuck on the “competence” front, and there’s no clear way to plug this solution of “motivation” into a better generic solution to “competence” to get a more competent intent-aligned agent. Instead it seems like we have to solve the competence problem that is particular to the specific solution to motivation, or solve motivation and competence together as one large problem.
In contrast, the problem of specifying an aligned utility function and the problem of building a safe EU maximizers seem to be naturally independent problems, such that once we have a specification of an aligned utility function (or a method of specifying aligned utility functions), we can just plug that into more and more powerful and robust EU maximizers.
Furthermore I think this lack of clean decomposition shows up at the conceptual level too, not just the pragmatic level. For example, suppose we tried to increase the competence of the human imitation by combining it with a superintelligent Oracle, and it turns out the human imitation isn’t very careful and in most timelines destroys the world by asking unsafe questions that cause the Oracle to perform malign optimizations. Is this a failure of motivation or a failure of competence, or both? It seems arguable or hard to say. In contrast, in a system that is built using the definition-optimization decomposition, it seems like it would be easy to trace any safety failures to either the “definition” solution or the “optimization” solution.
I overall agree that this is a con. Certainly there are AI systems that are weak enough that you can’t talk coherently about their “motivation”. Probably all deep-learning-based systems fall into this category.
I also agree that (at least for now, and probably in the future as well) you can’t formally specify the “type signature” of motivation such that you could separately solve the competence problem without knowing the details of the solution to the motivation problem.
My hope here would be to solve the motivation problem and leave the competence problem for later, since by my view that solves most of the problem (I’m aware that you disagree with this).
I don’t agree that it’s not clean at the conceptual level. It’s perhaps less clean than the definition-optimization decomposition, but not much less.
For example, suppose we tried to increase the competence of the human imitation by combining it with a superintelligent Oracle, and it turns out the human imitation isn’t very careful and in most timelines destroys the world by asking unsafe questions that cause the Oracle to perform malign optimizations. Is this a failure of motivation or a failure of competence, or both?
This seems pretty clearly like a failure of competence to me, since the human imitation would (presumably) say that they don’t want the world to be destroyed, and they (presumably) did not predict that that was what would happen when they queried the oracle.
This seems pretty clearly like a failure of competence to me, since the human imitation would (presumably) say that they don’t want the world to be destroyed, and they (presumably) did not predict that that was what would happen when they queried the oracle.
It also seems like a failure of motivation though, because as soon as the Oracle started to do malign optimization, the system as a whole is no longer trying to do what H wants.
Or is the idea that as long as the top-level or initial optimizer is trying (or tried) to do what H wants, then all subsequent failures of motivation don’t count, so we’re excluding problems like inner alignment from motivation / intent alignment?
I’m unsure what your answer would be, and what Paul’s answer would be, and whether they would be the same, which at least suggests that the concepts haven’t been cleanly decomposed yet.
ETA: Or to put it another way, supposed AI safety researchers determined ahead of time what kinds of questions won’t cause the Oracle to perform malign optimizations. Would that not count as part of the solution to motivation / intent alignment of this system (i.e., combination of human imitation and Oracle)? It seems really counterintuitive if the answer is “no”.
Oh, I see, you’re talking about the system as a whole, whereas I was thinking of the human imitation specifically. That seems like a multiagent system and I wouldn’t apply single-agent reasoning to it, so I agree motivation-competence is not the right way to think about it (but if you insisted on it, I’d say it fails motivation, mostly because the system doesn’t really have a single “motivation”).
It doesn’t seem like the definition-optimization decomposition helps either? I don’t know whether I’d call that a failure of definition or optimization.
Or to put it another way, supposed AI safety researchers determined ahead of time what kinds of questions won’t cause the Oracle to perform malign optimizations. Would that not count as part of the solution to motivation / intent alignment of this system (i.e., combination of human imitation and Oracle)?
I would say the human imitation was intent aligned, and this helped improve the competence of the human imitation. I mostly wouldn’t apply this framework to the system (and I also wouldn’t apply definition-optimization to the system).
That seems like a multiagent system and I wouldn’t apply single-agent reasoning to it, so I agree motivation-competence is not the right way to think about it
This was an unexpected answer. Isn’t HCH also such a multiagent system? (It seems very similar to what I described: a human with access to a superhuman Oracle, although HCH wasn’t what I initially had in mind.) IDA should converge to HCH in the limit of infinite compute and training data, so this would seem to imply that the motivation-competence framework doesn’t apply to IDA either. I’m pretty sure Paul would give a different answer, if we ask him about “intent alignment”.
It doesn’t seem like the definition-optimization decomposition helps either? I don’t know whether I’d call that a failure of definition or optimization.
It seems more obvious that multiagent systems just fall outside of the definition-optimization framework, which seems to be a point in its favor as far as conceptual clarity is concerned.
I’m pretty sure Paul would give a different answer, if we ask him about “intent alignment”.
Yes, I’d say that to the extent that “trying to do X” is a useful concept, it applies to systems with lots of agents just as well as it applies to one agent.
Even a very theoretically simple system like AIXI doesn’t seem to be “trying” to do just one thing, in the sense that it can e.g. exert considerable optimization power at things other than reward, even in cases where the system seems to “know” that its actions won’t lead to reward.
You could say that AIXI is “optimizing” the right thing and just messing up when it suffers inner alignment failures, but I’m not convinced that this division is actually doing much useful work. I think it’s meaningful to say “defining what we want is useful,” but beyond that it doesn’t seem like a workable way to actually analyze the hard parts of alignment or divide up the problem.
(For example, I think we can likely get OK definitions of what we value, along the lines of A Formalization of Indirect Normativity, but I’ve mostly stopped working along these lines because it no longer seems directly useful.)
It seems more obvious that multiagent systems just falls outside of the definition-optimization framework, which seems to be a point in its favor as far as conceptual clarity is concerned.
I agree.
Of course, it also seems quite likely that AIs of the kind that will probably be built (“by default”) also fall outside of the definition-optimization framework. So adopting this framework as a way to analyze potential aligned AIs seems to amount to narrowing the space considerably.
Yes, I’d say that to the extent that “trying to do X” is a useful concept, it applies to systems with lots of agents just as well as it applies to one agent.
So how do you see it applying in my example? Would you say that the system in my example is both trying to do what H wants it to do, and also trying to do something that H doesn’t want? Is it intent aligned period, or intent aligned at some points in time and not at others, or simultaneously intent aligned and not aligned, or something else? (I feel like we’ve had a similar discussion before and either it didn’t get resolved or I didn’t understand your position. I didn’t see a direct attempt to answer this in the comment I’m replying to, and it’s fine if you don’t want to go down this road again but I want to convey my continued confusion.)
You could say that AIXI is “optimizing” the right thing and just messing up when it suffers inner alignment failures, but I’m not convinced that this division is actually doing much useful work. I think it’s meaningful to say “defining what we want is useful,” but beyond that it doesn’t seem like a workable way to actually analyze the hard parts of alignment or divide up the problem.
I don’t understand how this is connected to what I was saying. (In general I often find it significantly harder to understand your comments compared to say Rohin’s. Not necessarily saying you should do something differently, as you might already be making a difficult tradeoff between how much time to spend here and elsewhere, but just offering feedback in case you didn’t realize.)
Of course, it also seems quite likely that AIs of the kind that will probably be built (“by default”) also fall outside of the definition-optimization framework. So adopting this framework as a way to analyze potential aligned AIs seems to amount to narrowing the space considerably.
Would you say that the system in my example is both trying to do what H wants it to do, and also trying to do something that H doesn’t want? Is it intent aligned period, or intent aligned at some points in time and not at others, or simultaneously intent aligned and not aligned, or something else?
The oracle is not aligned when asked questions that cause it to do malign optimization.
The human+oracle system is not aligned in situations where the human would pose such questions.
For a coherent system (e.g. a multiagent system which has converged to a Pareto efficient compromise), it make sense to talk about the one thing that it is trying to do.
For an incoherent system this abstraction may not make sense, and a system may be trying to do lots of things. I try to use benign when talking about possibly-incoherent systems, or things that don’t even resemble optimizers.
The definition in this post is a bit sloppy here, but I’m usually imagining that we are building roughly-coherent AI systems (and that if they are incoherent, some parts are malign). If you wanted to be a bit more careful with the definition, and want to admit vagueness in “what H wants it to do” (such that there can be several different preferences that are “what H wants”) we could say something like:
A is aligned with H if everything it is trying to do is “what H wants.”
That’s not great either though (and I think the original post is more at an appropriate level of attempted-precision).
(In the following I will also use “aligned” to mean “intent aligned”.)
The human+oracle system is not aligned in situations where the human would pose such questions.
Ok, sounds like “intent aligned at some points in time and not at others” was the closest guess. To confirm, would you endorse “the system was aligned when the human imitation was still trying to figure out what questions to ask the oracle (since the system was still only trying to do what H wants), and then due to its own incompetence became not aligned when the oracle started working on the unsafe question”?
Given that intent alignment in this sense seems to be property of a system+situation instead of the system itself, how would you define when the “intent alignment problem” has been solved for an AI, or when would you call an AI (such as IDA) itself “intent aligned”? (When we can reasonably expect to keep it out of situations where its alignment fails, for some reasonable amount of time, perhaps?) Or is it the case that whenever you use “intent alignment” you always have some specific situation or set of situations in mind?
Fwiw having read this exchange, I think I approximately agree with Paul. Going back to the original response to my comment:
Isn’t HCH also such a multiagent system?
Yes, I shouldn’t have made a categorical statement about multiagent systems. What I should have said was that the particular multiagent system you proposed did not have a single thing it is “trying to do”, i.e. I wouldn’t say it has a single “motivation”. This allows you to say “the system is not intent-aligned”, even though you can’t say “the system is trying to do X”.
Another way of saying this is that it is an incoherent system and so the motivation abstraction / motivation-competence decomposition doesn’t make sense, but HCH is one of the few multiagent systems that is coherent. (Idk if I believe that claim, but it seems plausible.) This seems to map on to the statement:
For an incoherent system this abstraction may not make sense, and a system may be trying to do lots of things.
Also, I want to note strong agreement with this:
Of course, it also seems quite likely that AIs of the kind that will probably be built (“by default”) also fall outside of the definition-optimization framework. So adopting this framework as a way to analyze potential aligned AIs seems to amount to narrowing the space considerably.
Another way of saying this is that it is an incoherent system and so the motivation abstraction / motivation-competence decomposition doesn’t make sense, but HCH is one of the few multiagent systems that is coherent.
HCH can be incoherent. I think one example that came up in an earlier discussion was the top node in HCH trying to help the user by asking (due to incompetence / insufficient understanding of corrigibility) “What is a good approximation of the user’s utility function?” followed by “What action would maximize EU according to this utility function?”
ETA: If this isn’t clearly incoherent, imagine that due to further incompetence, lower nodes work on subgoals in a way that conflict with each other.
Another con of the motivation-competence decomposition: unlike definition-optimization, it doesn’t actually seem to be a clean decomposition of the larger task, such that we can solve each subtask independently and then combine the solutions.
For example one way we could solve the motivation problem is by building a perfect human imitation (of someone who really wants to help H do what H wants), but then we seem to be stuck on the “competence” front, and there’s no clear way to plug this solution of “motivation” into a better generic solution to “competence” to get a more competent intent-aligned agent. Instead it seems like we have to solve the competence problem that is particular to the specific solution to motivation, or solve motivation and competence together as one large problem.
In contrast, the problem of specifying an aligned utility function and the problem of building a safe EU maximizers seem to be naturally independent problems, such that once we have a specification of an aligned utility function (or a method of specifying aligned utility functions), we can just plug that into more and more powerful and robust EU maximizers.
Furthermore I think this lack of clean decomposition shows up at the conceptual level too, not just the pragmatic level. For example, suppose we tried to increase the competence of the human imitation by combining it with a superintelligent Oracle, and it turns out the human imitation isn’t very careful and in most timelines destroys the world by asking unsafe questions that cause the Oracle to perform malign optimizations. Is this a failure of motivation or a failure of competence, or both? It seems arguable or hard to say. In contrast, in a system that is built using the definition-optimization decomposition, it seems like it would be easy to trace any safety failures to either the “definition” solution or the “optimization” solution.
I overall agree that this is a con. Certainly there are AI systems that are weak enough that you can’t talk coherently about their “motivation”. Probably all deep-learning-based systems fall into this category.
I also agree that (at least for now, and probably in the future as well) you can’t formally specify the “type signature” of motivation such that you could separately solve the competence problem without knowing the details of the solution to the motivation problem.
My hope here would be to solve the motivation problem and leave the competence problem for later, since by my view that solves most of the problem (I’m aware that you disagree with this).
I don’t agree that it’s not clean at the conceptual level. It’s perhaps less clean than the definition-optimization decomposition, but not much less.
This seems pretty clearly like a failure of competence to me, since the human imitation would (presumably) say that they don’t want the world to be destroyed, and they (presumably) did not predict that that was what would happen when they queried the oracle.
It also seems like a failure of motivation though, because as soon as the Oracle started to do malign optimization, the system as a whole is no longer trying to do what H wants.
Or is the idea that as long as the top-level or initial optimizer is trying (or tried) to do what H wants, then all subsequent failures of motivation don’t count, so we’re excluding problems like inner alignment from motivation / intent alignment?
I’m unsure what your answer would be, and what Paul’s answer would be, and whether they would be the same, which at least suggests that the concepts haven’t been cleanly decomposed yet.
ETA: Or to put it another way, supposed AI safety researchers determined ahead of time what kinds of questions won’t cause the Oracle to perform malign optimizations. Would that not count as part of the solution to motivation / intent alignment of this system (i.e., combination of human imitation and Oracle)? It seems really counterintuitive if the answer is “no”.
Oh, I see, you’re talking about the system as a whole, whereas I was thinking of the human imitation specifically. That seems like a multiagent system and I wouldn’t apply single-agent reasoning to it, so I agree motivation-competence is not the right way to think about it (but if you insisted on it, I’d say it fails motivation, mostly because the system doesn’t really have a single “motivation”).
It doesn’t seem like the definition-optimization decomposition helps either? I don’t know whether I’d call that a failure of definition or optimization.
I would say the human imitation was intent aligned, and this helped improve the competence of the human imitation. I mostly wouldn’t apply this framework to the system (and I also wouldn’t apply definition-optimization to the system).
This was an unexpected answer. Isn’t HCH also such a multiagent system? (It seems very similar to what I described: a human with access to a superhuman Oracle, although HCH wasn’t what I initially had in mind.) IDA should converge to HCH in the limit of infinite compute and training data, so this would seem to imply that the motivation-competence framework doesn’t apply to IDA either. I’m pretty sure Paul would give a different answer, if we ask him about “intent alignment”.
It seems more obvious that multiagent systems just fall outside of the definition-optimization framework, which seems to be a point in its favor as far as conceptual clarity is concerned.
Yes, I’d say that to the extent that “trying to do X” is a useful concept, it applies to systems with lots of agents just as well as it applies to one agent.
Even a very theoretically simple system like AIXI doesn’t seem to be “trying” to do just one thing, in the sense that it can e.g. exert considerable optimization power at things other than reward, even in cases where the system seems to “know” that its actions won’t lead to reward.
You could say that AIXI is “optimizing” the right thing and just messing up when it suffers inner alignment failures, but I’m not convinced that this division is actually doing much useful work. I think it’s meaningful to say “defining what we want is useful,” but beyond that it doesn’t seem like a workable way to actually analyze the hard parts of alignment or divide up the problem.
(For example, I think we can likely get OK definitions of what we value, along the lines of A Formalization of Indirect Normativity, but I’ve mostly stopped working along these lines because it no longer seems directly useful.)
I agree.
Of course, it also seems quite likely that AIs of the kind that will probably be built (“by default”) also fall outside of the definition-optimization framework. So adopting this framework as a way to analyze potential aligned AIs seems to amount to narrowing the space considerably.
So how do you see it applying in my example? Would you say that the system in my example is both trying to do what H wants it to do, and also trying to do something that H doesn’t want? Is it intent aligned period, or intent aligned at some points in time and not at others, or simultaneously intent aligned and not aligned, or something else? (I feel like we’ve had a similar discussion before and either it didn’t get resolved or I didn’t understand your position. I didn’t see a direct attempt to answer this in the comment I’m replying to, and it’s fine if you don’t want to go down this road again but I want to convey my continued confusion.)
I don’t understand how this is connected to what I was saying. (In general I often find it significantly harder to understand your comments compared to say Rohin’s. Not necessarily saying you should do something differently, as you might already be making a difficult tradeoff between how much time to spend here and elsewhere, but just offering feedback in case you didn’t realize.)
This makes sense.
The oracle is not aligned when asked questions that cause it to do malign optimization.
The human+oracle system is not aligned in situations where the human would pose such questions.
For a coherent system (e.g. a multiagent system which has converged to a Pareto efficient compromise), it make sense to talk about the one thing that it is trying to do.
For an incoherent system this abstraction may not make sense, and a system may be trying to do lots of things. I try to use benign when talking about possibly-incoherent systems, or things that don’t even resemble optimizers.
The definition in this post is a bit sloppy here, but I’m usually imagining that we are building roughly-coherent AI systems (and that if they are incoherent, some parts are malign). If you wanted to be a bit more careful with the definition, and want to admit vagueness in “what H wants it to do” (such that there can be several different preferences that are “what H wants”) we could say something like:
That’s not great either though (and I think the original post is more at an appropriate level of attempted-precision).
(In the following I will also use “aligned” to mean “intent aligned”.)
Ok, sounds like “intent aligned at some points in time and not at others” was the closest guess. To confirm, would you endorse “the system was aligned when the human imitation was still trying to figure out what questions to ask the oracle (since the system was still only trying to do what H wants), and then due to its own incompetence became not aligned when the oracle started working on the unsafe question”?
Given that intent alignment in this sense seems to be property of a system+situation instead of the system itself, how would you define when the “intent alignment problem” has been solved for an AI, or when would you call an AI (such as IDA) itself “intent aligned”? (When we can reasonably expect to keep it out of situations where its alignment fails, for some reasonable amount of time, perhaps?) Or is it the case that whenever you use “intent alignment” you always have some specific situation or set of situations in mind?
Fwiw having read this exchange, I think I approximately agree with Paul. Going back to the original response to my comment:
Yes, I shouldn’t have made a categorical statement about multiagent systems. What I should have said was that the particular multiagent system you proposed did not have a single thing it is “trying to do”, i.e. I wouldn’t say it has a single “motivation”. This allows you to say “the system is not intent-aligned”, even though you can’t say “the system is trying to do X”.
Another way of saying this is that it is an incoherent system and so the motivation abstraction / motivation-competence decomposition doesn’t make sense, but HCH is one of the few multiagent systems that is coherent. (Idk if I believe that claim, but it seems plausible.) This seems to map on to the statement:
Also, I want to note strong agreement with this:
HCH can be incoherent. I think one example that came up in an earlier discussion was the top node in HCH trying to help the user by asking (due to incompetence / insufficient understanding of corrigibility) “What is a good approximation of the user’s utility function?” followed by “What action would maximize EU according to this utility function?”
ETA: If this isn’t clearly incoherent, imagine that due to further incompetence, lower nodes work on subgoals in a way that conflict with each other.