That seems like a multiagent system and I wouldn’t apply single-agent reasoning to it, so I agree motivation-competence is not the right way to think about it
This was an unexpected answer. Isn’t HCH also such a multiagent system? (It seems very similar to what I described: a human with access to a superhuman Oracle, although HCH wasn’t what I initially had in mind.) IDA should converge to HCH in the limit of infinite compute and training data, so this would seem to imply that the motivation-competence framework doesn’t apply to IDA either. I’m pretty sure Paul would give a different answer, if we ask him about “intent alignment”.
It doesn’t seem like the definition-optimization decomposition helps either? I don’t know whether I’d call that a failure of definition or optimization.
It seems more obvious that multiagent systems just fall outside of the definition-optimization framework, which seems to be a point in its favor as far as conceptual clarity is concerned.
I’m pretty sure Paul would give a different answer, if we ask him about “intent alignment”.
Yes, I’d say that to the extent that “trying to do X” is a useful concept, it applies to systems with lots of agents just as well as it applies to one agent.
Even a very theoretically simple system like AIXI doesn’t seem to be “trying” to do just one thing, in the sense that it can e.g. exert considerable optimization power at things other than reward, even in cases where the system seems to “know” that its actions won’t lead to reward.
You could say that AIXI is “optimizing” the right thing and just messing up when it suffers inner alignment failures, but I’m not convinced that this division is actually doing much useful work. I think it’s meaningful to say “defining what we want is useful,” but beyond that it doesn’t seem like a workable way to actually analyze the hard parts of alignment or divide up the problem.
(For example, I think we can likely get OK definitions of what we value, along the lines of A Formalization of Indirect Normativity, but I’ve mostly stopped working along these lines because it no longer seems directly useful.)
It seems more obvious that multiagent systems just falls outside of the definition-optimization framework, which seems to be a point in its favor as far as conceptual clarity is concerned.
I agree.
Of course, it also seems quite likely that AIs of the kind that will probably be built (“by default”) also fall outside of the definition-optimization framework. So adopting this framework as a way to analyze potential aligned AIs seems to amount to narrowing the space considerably.
Yes, I’d say that to the extent that “trying to do X” is a useful concept, it applies to systems with lots of agents just as well as it applies to one agent.
So how do you see it applying in my example? Would you say that the system in my example is both trying to do what H wants it to do, and also trying to do something that H doesn’t want? Is it intent aligned period, or intent aligned at some points in time and not at others, or simultaneously intent aligned and not aligned, or something else? (I feel like we’ve had a similar discussion before and either it didn’t get resolved or I didn’t understand your position. I didn’t see a direct attempt to answer this in the comment I’m replying to, and it’s fine if you don’t want to go down this road again but I want to convey my continued confusion.)
You could say that AIXI is “optimizing” the right thing and just messing up when it suffers inner alignment failures, but I’m not convinced that this division is actually doing much useful work. I think it’s meaningful to say “defining what we want is useful,” but beyond that it doesn’t seem like a workable way to actually analyze the hard parts of alignment or divide up the problem.
I don’t understand how this is connected to what I was saying. (In general I often find it significantly harder to understand your comments compared to say Rohin’s. Not necessarily saying you should do something differently, as you might already be making a difficult tradeoff between how much time to spend here and elsewhere, but just offering feedback in case you didn’t realize.)
Of course, it also seems quite likely that AIs of the kind that will probably be built (“by default”) also fall outside of the definition-optimization framework. So adopting this framework as a way to analyze potential aligned AIs seems to amount to narrowing the space considerably.
Would you say that the system in my example is both trying to do what H wants it to do, and also trying to do something that H doesn’t want? Is it intent aligned period, or intent aligned at some points in time and not at others, or simultaneously intent aligned and not aligned, or something else?
The oracle is not aligned when asked questions that cause it to do malign optimization.
The human+oracle system is not aligned in situations where the human would pose such questions.
For a coherent system (e.g. a multiagent system which has converged to a Pareto efficient compromise), it make sense to talk about the one thing that it is trying to do.
For an incoherent system this abstraction may not make sense, and a system may be trying to do lots of things. I try to use benign when talking about possibly-incoherent systems, or things that don’t even resemble optimizers.
The definition in this post is a bit sloppy here, but I’m usually imagining that we are building roughly-coherent AI systems (and that if they are incoherent, some parts are malign). If you wanted to be a bit more careful with the definition, and want to admit vagueness in “what H wants it to do” (such that there can be several different preferences that are “what H wants”) we could say something like:
A is aligned with H if everything it is trying to do is “what H wants.”
That’s not great either though (and I think the original post is more at an appropriate level of attempted-precision).
(In the following I will also use “aligned” to mean “intent aligned”.)
The human+oracle system is not aligned in situations where the human would pose such questions.
Ok, sounds like “intent aligned at some points in time and not at others” was the closest guess. To confirm, would you endorse “the system was aligned when the human imitation was still trying to figure out what questions to ask the oracle (since the system was still only trying to do what H wants), and then due to its own incompetence became not aligned when the oracle started working on the unsafe question”?
Given that intent alignment in this sense seems to be property of a system+situation instead of the system itself, how would you define when the “intent alignment problem” has been solved for an AI, or when would you call an AI (such as IDA) itself “intent aligned”? (When we can reasonably expect to keep it out of situations where its alignment fails, for some reasonable amount of time, perhaps?) Or is it the case that whenever you use “intent alignment” you always have some specific situation or set of situations in mind?
Fwiw having read this exchange, I think I approximately agree with Paul. Going back to the original response to my comment:
Isn’t HCH also such a multiagent system?
Yes, I shouldn’t have made a categorical statement about multiagent systems. What I should have said was that the particular multiagent system you proposed did not have a single thing it is “trying to do”, i.e. I wouldn’t say it has a single “motivation”. This allows you to say “the system is not intent-aligned”, even though you can’t say “the system is trying to do X”.
Another way of saying this is that it is an incoherent system and so the motivation abstraction / motivation-competence decomposition doesn’t make sense, but HCH is one of the few multiagent systems that is coherent. (Idk if I believe that claim, but it seems plausible.) This seems to map on to the statement:
For an incoherent system this abstraction may not make sense, and a system may be trying to do lots of things.
Also, I want to note strong agreement with this:
Of course, it also seems quite likely that AIs of the kind that will probably be built (“by default”) also fall outside of the definition-optimization framework. So adopting this framework as a way to analyze potential aligned AIs seems to amount to narrowing the space considerably.
Another way of saying this is that it is an incoherent system and so the motivation abstraction / motivation-competence decomposition doesn’t make sense, but HCH is one of the few multiagent systems that is coherent.
HCH can be incoherent. I think one example that came up in an earlier discussion was the top node in HCH trying to help the user by asking (due to incompetence / insufficient understanding of corrigibility) “What is a good approximation of the user’s utility function?” followed by “What action would maximize EU according to this utility function?”
ETA: If this isn’t clearly incoherent, imagine that due to further incompetence, lower nodes work on subgoals in a way that conflict with each other.
This was an unexpected answer. Isn’t HCH also such a multiagent system? (It seems very similar to what I described: a human with access to a superhuman Oracle, although HCH wasn’t what I initially had in mind.) IDA should converge to HCH in the limit of infinite compute and training data, so this would seem to imply that the motivation-competence framework doesn’t apply to IDA either. I’m pretty sure Paul would give a different answer, if we ask him about “intent alignment”.
It seems more obvious that multiagent systems just fall outside of the definition-optimization framework, which seems to be a point in its favor as far as conceptual clarity is concerned.
Yes, I’d say that to the extent that “trying to do X” is a useful concept, it applies to systems with lots of agents just as well as it applies to one agent.
Even a very theoretically simple system like AIXI doesn’t seem to be “trying” to do just one thing, in the sense that it can e.g. exert considerable optimization power at things other than reward, even in cases where the system seems to “know” that its actions won’t lead to reward.
You could say that AIXI is “optimizing” the right thing and just messing up when it suffers inner alignment failures, but I’m not convinced that this division is actually doing much useful work. I think it’s meaningful to say “defining what we want is useful,” but beyond that it doesn’t seem like a workable way to actually analyze the hard parts of alignment or divide up the problem.
(For example, I think we can likely get OK definitions of what we value, along the lines of A Formalization of Indirect Normativity, but I’ve mostly stopped working along these lines because it no longer seems directly useful.)
I agree.
Of course, it also seems quite likely that AIs of the kind that will probably be built (“by default”) also fall outside of the definition-optimization framework. So adopting this framework as a way to analyze potential aligned AIs seems to amount to narrowing the space considerably.
So how do you see it applying in my example? Would you say that the system in my example is both trying to do what H wants it to do, and also trying to do something that H doesn’t want? Is it intent aligned period, or intent aligned at some points in time and not at others, or simultaneously intent aligned and not aligned, or something else? (I feel like we’ve had a similar discussion before and either it didn’t get resolved or I didn’t understand your position. I didn’t see a direct attempt to answer this in the comment I’m replying to, and it’s fine if you don’t want to go down this road again but I want to convey my continued confusion.)
I don’t understand how this is connected to what I was saying. (In general I often find it significantly harder to understand your comments compared to say Rohin’s. Not necessarily saying you should do something differently, as you might already be making a difficult tradeoff between how much time to spend here and elsewhere, but just offering feedback in case you didn’t realize.)
This makes sense.
The oracle is not aligned when asked questions that cause it to do malign optimization.
The human+oracle system is not aligned in situations where the human would pose such questions.
For a coherent system (e.g. a multiagent system which has converged to a Pareto efficient compromise), it make sense to talk about the one thing that it is trying to do.
For an incoherent system this abstraction may not make sense, and a system may be trying to do lots of things. I try to use benign when talking about possibly-incoherent systems, or things that don’t even resemble optimizers.
The definition in this post is a bit sloppy here, but I’m usually imagining that we are building roughly-coherent AI systems (and that if they are incoherent, some parts are malign). If you wanted to be a bit more careful with the definition, and want to admit vagueness in “what H wants it to do” (such that there can be several different preferences that are “what H wants”) we could say something like:
That’s not great either though (and I think the original post is more at an appropriate level of attempted-precision).
(In the following I will also use “aligned” to mean “intent aligned”.)
Ok, sounds like “intent aligned at some points in time and not at others” was the closest guess. To confirm, would you endorse “the system was aligned when the human imitation was still trying to figure out what questions to ask the oracle (since the system was still only trying to do what H wants), and then due to its own incompetence became not aligned when the oracle started working on the unsafe question”?
Given that intent alignment in this sense seems to be property of a system+situation instead of the system itself, how would you define when the “intent alignment problem” has been solved for an AI, or when would you call an AI (such as IDA) itself “intent aligned”? (When we can reasonably expect to keep it out of situations where its alignment fails, for some reasonable amount of time, perhaps?) Or is it the case that whenever you use “intent alignment” you always have some specific situation or set of situations in mind?
Fwiw having read this exchange, I think I approximately agree with Paul. Going back to the original response to my comment:
Yes, I shouldn’t have made a categorical statement about multiagent systems. What I should have said was that the particular multiagent system you proposed did not have a single thing it is “trying to do”, i.e. I wouldn’t say it has a single “motivation”. This allows you to say “the system is not intent-aligned”, even though you can’t say “the system is trying to do X”.
Another way of saying this is that it is an incoherent system and so the motivation abstraction / motivation-competence decomposition doesn’t make sense, but HCH is one of the few multiagent systems that is coherent. (Idk if I believe that claim, but it seems plausible.) This seems to map on to the statement:
Also, I want to note strong agreement with this:
HCH can be incoherent. I think one example that came up in an earlier discussion was the top node in HCH trying to help the user by asking (due to incompetence / insufficient understanding of corrigibility) “What is a good approximation of the user’s utility function?” followed by “What action would maximize EU according to this utility function?”
ETA: If this isn’t clearly incoherent, imagine that due to further incompetence, lower nodes work on subgoals in a way that conflict with each other.