This seems pretty clearly like a failure of competence to me, since the human imitation would (presumably) say that they don’t want the world to be destroyed, and they (presumably) did not predict that that was what would happen when they queried the oracle.
It also seems like a failure of motivation though, because as soon as the Oracle started to do malign optimization, the system as a whole is no longer trying to do what H wants.
Or is the idea that as long as the top-level or initial optimizer is trying (or tried) to do what H wants, then all subsequent failures of motivation don’t count, so we’re excluding problems like inner alignment from motivation / intent alignment?
I’m unsure what your answer would be, and what Paul’s answer would be, and whether they would be the same, which at least suggests that the concepts haven’t been cleanly decomposed yet.
ETA: Or to put it another way, supposed AI safety researchers determined ahead of time what kinds of questions won’t cause the Oracle to perform malign optimizations. Would that not count as part of the solution to motivation / intent alignment of this system (i.e., combination of human imitation and Oracle)? It seems really counterintuitive if the answer is “no”.
Oh, I see, you’re talking about the system as a whole, whereas I was thinking of the human imitation specifically. That seems like a multiagent system and I wouldn’t apply single-agent reasoning to it, so I agree motivation-competence is not the right way to think about it (but if you insisted on it, I’d say it fails motivation, mostly because the system doesn’t really have a single “motivation”).
It doesn’t seem like the definition-optimization decomposition helps either? I don’t know whether I’d call that a failure of definition or optimization.
Or to put it another way, supposed AI safety researchers determined ahead of time what kinds of questions won’t cause the Oracle to perform malign optimizations. Would that not count as part of the solution to motivation / intent alignment of this system (i.e., combination of human imitation and Oracle)?
I would say the human imitation was intent aligned, and this helped improve the competence of the human imitation. I mostly wouldn’t apply this framework to the system (and I also wouldn’t apply definition-optimization to the system).
That seems like a multiagent system and I wouldn’t apply single-agent reasoning to it, so I agree motivation-competence is not the right way to think about it
This was an unexpected answer. Isn’t HCH also such a multiagent system? (It seems very similar to what I described: a human with access to a superhuman Oracle, although HCH wasn’t what I initially had in mind.) IDA should converge to HCH in the limit of infinite compute and training data, so this would seem to imply that the motivation-competence framework doesn’t apply to IDA either. I’m pretty sure Paul would give a different answer, if we ask him about “intent alignment”.
It doesn’t seem like the definition-optimization decomposition helps either? I don’t know whether I’d call that a failure of definition or optimization.
It seems more obvious that multiagent systems just fall outside of the definition-optimization framework, which seems to be a point in its favor as far as conceptual clarity is concerned.
I’m pretty sure Paul would give a different answer, if we ask him about “intent alignment”.
Yes, I’d say that to the extent that “trying to do X” is a useful concept, it applies to systems with lots of agents just as well as it applies to one agent.
Even a very theoretically simple system like AIXI doesn’t seem to be “trying” to do just one thing, in the sense that it can e.g. exert considerable optimization power at things other than reward, even in cases where the system seems to “know” that its actions won’t lead to reward.
You could say that AIXI is “optimizing” the right thing and just messing up when it suffers inner alignment failures, but I’m not convinced that this division is actually doing much useful work. I think it’s meaningful to say “defining what we want is useful,” but beyond that it doesn’t seem like a workable way to actually analyze the hard parts of alignment or divide up the problem.
(For example, I think we can likely get OK definitions of what we value, along the lines of A Formalization of Indirect Normativity, but I’ve mostly stopped working along these lines because it no longer seems directly useful.)
It seems more obvious that multiagent systems just falls outside of the definition-optimization framework, which seems to be a point in its favor as far as conceptual clarity is concerned.
I agree.
Of course, it also seems quite likely that AIs of the kind that will probably be built (“by default”) also fall outside of the definition-optimization framework. So adopting this framework as a way to analyze potential aligned AIs seems to amount to narrowing the space considerably.
Yes, I’d say that to the extent that “trying to do X” is a useful concept, it applies to systems with lots of agents just as well as it applies to one agent.
So how do you see it applying in my example? Would you say that the system in my example is both trying to do what H wants it to do, and also trying to do something that H doesn’t want? Is it intent aligned period, or intent aligned at some points in time and not at others, or simultaneously intent aligned and not aligned, or something else? (I feel like we’ve had a similar discussion before and either it didn’t get resolved or I didn’t understand your position. I didn’t see a direct attempt to answer this in the comment I’m replying to, and it’s fine if you don’t want to go down this road again but I want to convey my continued confusion.)
You could say that AIXI is “optimizing” the right thing and just messing up when it suffers inner alignment failures, but I’m not convinced that this division is actually doing much useful work. I think it’s meaningful to say “defining what we want is useful,” but beyond that it doesn’t seem like a workable way to actually analyze the hard parts of alignment or divide up the problem.
I don’t understand how this is connected to what I was saying. (In general I often find it significantly harder to understand your comments compared to say Rohin’s. Not necessarily saying you should do something differently, as you might already be making a difficult tradeoff between how much time to spend here and elsewhere, but just offering feedback in case you didn’t realize.)
Of course, it also seems quite likely that AIs of the kind that will probably be built (“by default”) also fall outside of the definition-optimization framework. So adopting this framework as a way to analyze potential aligned AIs seems to amount to narrowing the space considerably.
Would you say that the system in my example is both trying to do what H wants it to do, and also trying to do something that H doesn’t want? Is it intent aligned period, or intent aligned at some points in time and not at others, or simultaneously intent aligned and not aligned, or something else?
The oracle is not aligned when asked questions that cause it to do malign optimization.
The human+oracle system is not aligned in situations where the human would pose such questions.
For a coherent system (e.g. a multiagent system which has converged to a Pareto efficient compromise), it make sense to talk about the one thing that it is trying to do.
For an incoherent system this abstraction may not make sense, and a system may be trying to do lots of things. I try to use benign when talking about possibly-incoherent systems, or things that don’t even resemble optimizers.
The definition in this post is a bit sloppy here, but I’m usually imagining that we are building roughly-coherent AI systems (and that if they are incoherent, some parts are malign). If you wanted to be a bit more careful with the definition, and want to admit vagueness in “what H wants it to do” (such that there can be several different preferences that are “what H wants”) we could say something like:
A is aligned with H if everything it is trying to do is “what H wants.”
That’s not great either though (and I think the original post is more at an appropriate level of attempted-precision).
(In the following I will also use “aligned” to mean “intent aligned”.)
The human+oracle system is not aligned in situations where the human would pose such questions.
Ok, sounds like “intent aligned at some points in time and not at others” was the closest guess. To confirm, would you endorse “the system was aligned when the human imitation was still trying to figure out what questions to ask the oracle (since the system was still only trying to do what H wants), and then due to its own incompetence became not aligned when the oracle started working on the unsafe question”?
Given that intent alignment in this sense seems to be property of a system+situation instead of the system itself, how would you define when the “intent alignment problem” has been solved for an AI, or when would you call an AI (such as IDA) itself “intent aligned”? (When we can reasonably expect to keep it out of situations where its alignment fails, for some reasonable amount of time, perhaps?) Or is it the case that whenever you use “intent alignment” you always have some specific situation or set of situations in mind?
Fwiw having read this exchange, I think I approximately agree with Paul. Going back to the original response to my comment:
Isn’t HCH also such a multiagent system?
Yes, I shouldn’t have made a categorical statement about multiagent systems. What I should have said was that the particular multiagent system you proposed did not have a single thing it is “trying to do”, i.e. I wouldn’t say it has a single “motivation”. This allows you to say “the system is not intent-aligned”, even though you can’t say “the system is trying to do X”.
Another way of saying this is that it is an incoherent system and so the motivation abstraction / motivation-competence decomposition doesn’t make sense, but HCH is one of the few multiagent systems that is coherent. (Idk if I believe that claim, but it seems plausible.) This seems to map on to the statement:
For an incoherent system this abstraction may not make sense, and a system may be trying to do lots of things.
Also, I want to note strong agreement with this:
Of course, it also seems quite likely that AIs of the kind that will probably be built (“by default”) also fall outside of the definition-optimization framework. So adopting this framework as a way to analyze potential aligned AIs seems to amount to narrowing the space considerably.
Another way of saying this is that it is an incoherent system and so the motivation abstraction / motivation-competence decomposition doesn’t make sense, but HCH is one of the few multiagent systems that is coherent.
HCH can be incoherent. I think one example that came up in an earlier discussion was the top node in HCH trying to help the user by asking (due to incompetence / insufficient understanding of corrigibility) “What is a good approximation of the user’s utility function?” followed by “What action would maximize EU according to this utility function?”
ETA: If this isn’t clearly incoherent, imagine that due to further incompetence, lower nodes work on subgoals in a way that conflict with each other.
It also seems like a failure of motivation though, because as soon as the Oracle started to do malign optimization, the system as a whole is no longer trying to do what H wants.
Or is the idea that as long as the top-level or initial optimizer is trying (or tried) to do what H wants, then all subsequent failures of motivation don’t count, so we’re excluding problems like inner alignment from motivation / intent alignment?
I’m unsure what your answer would be, and what Paul’s answer would be, and whether they would be the same, which at least suggests that the concepts haven’t been cleanly decomposed yet.
ETA: Or to put it another way, supposed AI safety researchers determined ahead of time what kinds of questions won’t cause the Oracle to perform malign optimizations. Would that not count as part of the solution to motivation / intent alignment of this system (i.e., combination of human imitation and Oracle)? It seems really counterintuitive if the answer is “no”.
Oh, I see, you’re talking about the system as a whole, whereas I was thinking of the human imitation specifically. That seems like a multiagent system and I wouldn’t apply single-agent reasoning to it, so I agree motivation-competence is not the right way to think about it (but if you insisted on it, I’d say it fails motivation, mostly because the system doesn’t really have a single “motivation”).
It doesn’t seem like the definition-optimization decomposition helps either? I don’t know whether I’d call that a failure of definition or optimization.
I would say the human imitation was intent aligned, and this helped improve the competence of the human imitation. I mostly wouldn’t apply this framework to the system (and I also wouldn’t apply definition-optimization to the system).
This was an unexpected answer. Isn’t HCH also such a multiagent system? (It seems very similar to what I described: a human with access to a superhuman Oracle, although HCH wasn’t what I initially had in mind.) IDA should converge to HCH in the limit of infinite compute and training data, so this would seem to imply that the motivation-competence framework doesn’t apply to IDA either. I’m pretty sure Paul would give a different answer, if we ask him about “intent alignment”.
It seems more obvious that multiagent systems just fall outside of the definition-optimization framework, which seems to be a point in its favor as far as conceptual clarity is concerned.
Yes, I’d say that to the extent that “trying to do X” is a useful concept, it applies to systems with lots of agents just as well as it applies to one agent.
Even a very theoretically simple system like AIXI doesn’t seem to be “trying” to do just one thing, in the sense that it can e.g. exert considerable optimization power at things other than reward, even in cases where the system seems to “know” that its actions won’t lead to reward.
You could say that AIXI is “optimizing” the right thing and just messing up when it suffers inner alignment failures, but I’m not convinced that this division is actually doing much useful work. I think it’s meaningful to say “defining what we want is useful,” but beyond that it doesn’t seem like a workable way to actually analyze the hard parts of alignment or divide up the problem.
(For example, I think we can likely get OK definitions of what we value, along the lines of A Formalization of Indirect Normativity, but I’ve mostly stopped working along these lines because it no longer seems directly useful.)
I agree.
Of course, it also seems quite likely that AIs of the kind that will probably be built (“by default”) also fall outside of the definition-optimization framework. So adopting this framework as a way to analyze potential aligned AIs seems to amount to narrowing the space considerably.
So how do you see it applying in my example? Would you say that the system in my example is both trying to do what H wants it to do, and also trying to do something that H doesn’t want? Is it intent aligned period, or intent aligned at some points in time and not at others, or simultaneously intent aligned and not aligned, or something else? (I feel like we’ve had a similar discussion before and either it didn’t get resolved or I didn’t understand your position. I didn’t see a direct attempt to answer this in the comment I’m replying to, and it’s fine if you don’t want to go down this road again but I want to convey my continued confusion.)
I don’t understand how this is connected to what I was saying. (In general I often find it significantly harder to understand your comments compared to say Rohin’s. Not necessarily saying you should do something differently, as you might already be making a difficult tradeoff between how much time to spend here and elsewhere, but just offering feedback in case you didn’t realize.)
This makes sense.
The oracle is not aligned when asked questions that cause it to do malign optimization.
The human+oracle system is not aligned in situations where the human would pose such questions.
For a coherent system (e.g. a multiagent system which has converged to a Pareto efficient compromise), it make sense to talk about the one thing that it is trying to do.
For an incoherent system this abstraction may not make sense, and a system may be trying to do lots of things. I try to use benign when talking about possibly-incoherent systems, or things that don’t even resemble optimizers.
The definition in this post is a bit sloppy here, but I’m usually imagining that we are building roughly-coherent AI systems (and that if they are incoherent, some parts are malign). If you wanted to be a bit more careful with the definition, and want to admit vagueness in “what H wants it to do” (such that there can be several different preferences that are “what H wants”) we could say something like:
That’s not great either though (and I think the original post is more at an appropriate level of attempted-precision).
(In the following I will also use “aligned” to mean “intent aligned”.)
Ok, sounds like “intent aligned at some points in time and not at others” was the closest guess. To confirm, would you endorse “the system was aligned when the human imitation was still trying to figure out what questions to ask the oracle (since the system was still only trying to do what H wants), and then due to its own incompetence became not aligned when the oracle started working on the unsafe question”?
Given that intent alignment in this sense seems to be property of a system+situation instead of the system itself, how would you define when the “intent alignment problem” has been solved for an AI, or when would you call an AI (such as IDA) itself “intent aligned”? (When we can reasonably expect to keep it out of situations where its alignment fails, for some reasonable amount of time, perhaps?) Or is it the case that whenever you use “intent alignment” you always have some specific situation or set of situations in mind?
Fwiw having read this exchange, I think I approximately agree with Paul. Going back to the original response to my comment:
Yes, I shouldn’t have made a categorical statement about multiagent systems. What I should have said was that the particular multiagent system you proposed did not have a single thing it is “trying to do”, i.e. I wouldn’t say it has a single “motivation”. This allows you to say “the system is not intent-aligned”, even though you can’t say “the system is trying to do X”.
Another way of saying this is that it is an incoherent system and so the motivation abstraction / motivation-competence decomposition doesn’t make sense, but HCH is one of the few multiagent systems that is coherent. (Idk if I believe that claim, but it seems plausible.) This seems to map on to the statement:
Also, I want to note strong agreement with this:
HCH can be incoherent. I think one example that came up in an earlier discussion was the top node in HCH trying to help the user by asking (due to incompetence / insufficient understanding of corrigibility) “What is a good approximation of the user’s utility function?” followed by “What action would maximize EU according to this utility function?”
ETA: If this isn’t clearly incoherent, imagine that due to further incompetence, lower nodes work on subgoals in a way that conflict with each other.