Note that I don’t think of this as the Tool AI vs. Agent AI argument: it seems possible to build agent AI systems that are not goal-directed. For example, imitation learning allows you to create an agent that behaves similarly to another agent—I would classify this as “Agent AI that is not goal-directed”.
I’m not very convinced by this example, or alternatively I’m not getting the distinction you’re drawing between “agent” and “goal-directed”. Suppose the agent you’re trying to imitate is itself goal-directed. In order for the imitator to generalize beyond its training distribution, it seemingly has to learn to become goal-directed (i.e., perform the same sort of computations that a goal-directed agent would). I don’t see how else it can predict what the goal-directed agent would do in a novel situation. If the imitator is not able to generalize, then it seems more tool-like than agent-like. On the other hand, if the imitatee is not goal-directed… I guess the agent could imitate humans and be not entirely goal-directed to the extent that humans are not entirely goal-directed. (Is this the point you’re trying to make, or are you saying that an imitation of a goal-directed agent would constitute a non-goal-directed agent?)
Your post reminded me of Paul Christiano’s approval-directed agents which was also about trying to find an alternative to goal-directed agents. Looking at it again, it actually sounds a lot like applying imitation learning to humans (except imitating a speeded-up human):
Estimate the expected rating Hugh would give each action if he considered it at length. Take the action with the highest expected rating.
Can approval-directed agents be considered a form of imitation learning, and if not, are there any safety-relevant differences between imitation learning of (speeded-up) humans, and approval-directed agents?
On the other hand, if the imitatee is not goal-directed… I guess the agent could imitate humans and be not entirely goal-directed to the extent that humans are not entirely goal-directed. (Is this the point you’re trying to make, or are you saying that an imitation of a goal-directed agent would constitute a non-goal-directed agent?)
I definitely endorse this point, think that it’s an important aspect, and that it alone justifies the claim that I was making about non-goal-directed Agent AI being possible.
That said, I do have an intuition that agents whose goal-directedness comes from other agents shouldn’t be considered goal-directed, at least if it happens in a particular way. Let’s say that I’m pursuing goal X, and my assistant AI agent is also pursuing goal X as a result. If I then start to pursue goal Y, and my AI agent also starts pursuing Y because it is aligned with me, then it feels like the AI was not really directed at goal X, but more directed at “whatever goal Rohin has”, and this feels distinctly less goal-directed to me. (In particular, my AI agent would not have all of the convergent instrumental subgoals in this setting, so it is really different in kind from an AI agent that was simply pursuing X to the best of its ability.)
“Goal-directed” may not be the right word to capture the property I’m thinking about. It might be something like “thing that pursues the standard convergent instrumental subgoals”, or “thing that pursues a goal that is not defined in terms of someone else’s goal”.
Your post reminded me of Paul Christiano’s approval-directed agents which was also about trying to find an alternative to goal-directed agents.
Yeah, that idea was a big influence on the views that caused me to write this post.
Can approval-directed agents be considered a form of imitation learning, and if not, are there any safety-relevant differences between imitation learning of (speeded-up) humans, and approval-directed agents?
It’s not exactly the same, but it is very similar. You could think of approval-direction as imitation of a particular weird kind of human, who deliberates for a while before choosing any action.
They feel different enough to me that there probably are safety-relevant differences, but I don’t know of any off the top of my head. Initially I was going to say that myopia was a safety-relevant difference, but thinking about it more I don’t think that’s an actual difference. Approval-directed agents are more explicitly myopic, but I think imitation learning could be myopic in the same way.
Btw, this post also views Paul’s agenda through the lens of constructing imitations of humans.
For example, imitation learning allows you to create an agent that behaves similarly to another agent—I would classify this as “Agent AI that is not goal-directed”.
Let’s say that I’m pursuing goal X, and my assistant AI agent is also pursuing goal X as a result. If I then start to pursue goal Y, and my AI agent also starts pursuing Y because it is aligned with me, then it feels like the AI was not really directed at goal X, but more directed at “whatever goal Rohin has”
What causes the agent to switch from X to Y?
Are you thinking of the “agent” as A) the product of the demonstrations and training (e.g. the resulting neural network), or as B) a system that includes both the trained agent and also the training process itself (and facilities for continual online learning)?
I would assume A by default, but then I would expect that if you trained such an agent with imitation learning while pursuing goal X, you’d likely get an agent that continues to pursue goal X even after you’ve switched to pursuing goal Y. (Unless the agent also learned to imitate whatever the decision-making process was that led you to switch from X to Y, in which case the agent seems non-goal-directed only insofar as you decided to switch from X to Y for non-goal-related reasons rather than in service of some higher level goal Ω. Is that what you want?)
Are you thinking of the “agent” as A) the product of the demonstrations and training (e.g. the resulting neural network), or as B) a system that includes both the trained agent and also the training process itself (and facilities for continual online learning)?
I was imagining something more like B for the imitation learning case.
I would assume A by default, but then I would expect that if you trained such an agent with imitation learning while pursuing goal X, you’d likely get an agent that continues to pursue goal X even after you’ve switched to pursuing goal Y. (Unless the agent also learned to imitate whatever the decision-making process was that led you to switch from X to Y, in which case the agent seems non-goal-directed only insofar as you decided to switch from X to Y for non-goal-related reasons rather than in service of some higher level goal Ω. Is that what you want?)
That analysis seems right to me.
With respect to whether it is what I want, I wouldn’t say that I want any of these things in particular, I’m more pointing at the existence of systems that aren’t goal-directed, yet behave like an agent.
With respect to whether it is what I want, I wouldn’t say that I want any of these things in particular, I’m more pointing at the existence of systems that aren’t goal-directed, yet behave like an agent.
Would you agree that a B-type agent would be basically as goal-directed as a human (because it exhibits goal-directed behavior when the human does, and doesn’t when the human doesn’t)?
In which case, would it be fair to summarize (part of) your argument as:
1) Many of the potential problems with building safe superintelligent systems comes from them being too goal-directed.
2) An agent that is only as goal-directed as a human is much less susceptible to many of these failure modes.
3) It is likely possible to build superintelligent systems that are only as goal-directed as humans.
I don’t think so. Maybe this would be true if you had a perfect imitation of a human, but in practice you’ll be uncertain about what the human is going to do. If you’re uncertain in this way, and you are getting your goals from a human, then you don’t do all of the instrumental subgoals. (See The Off-Switch Game for a simple analysis showing that you can avoid the survival incentive.)
It may be that “goal-directed” is the wrong word for the property I’m talking about, but I’m predicting that agents of this form are less susceptible to convergent instrumental subgoals than humans are.
If you’ve seen the human acquire resources, then you’ll acquire resources in the same way.
If there’s now some new resource that you’ve never seen before, you may acquire it if you’re sufficiently confident that the human would, but otherwise you might try to gather more evidence to see what the human would do. This is assuming that we have some way of doing imitation learning that allows the resulting system to have uncertainty that it can resolve by watching the human, or asking the human. If you imagine the exact way that we do imitation learning today, it would extrapolate somehow in a way that isn’t actually what the human would do. Maybe it acquires the new resource, maybe it leaves it alone, maybe it burns it to prevent anyone from having it, who knows.
Btw, this post also views Paul’s agenda through the lens of constructing imitations of humans.
Right, so I think I wasn’t really making a new observation, but just clearing up a confusion on my own part, where for a long time I didn’t understand how the idea of approval-directed agency fits into IDA because people switched from talking about approval-directed agency to imitation learning (or were talking about them interchangeably) and I didn’t catch the connection. So at this point I understand Paul’s trajectory of views as follows:
goal-directed agent ⇒ approval-directed agent ⇒ use IDA to scale up approval-direct agent ⇒ approval-directed agency as a form of imitation learning / generalize to other forms of imitation learning ⇒ generalize IDA to safely scale up other (including more goal-directed / consequentialist) forms of ML (see An Unaligned Benchmark which I think represents his current views)
(Someone please chime in if this still seems wrong or confused.)
They feel different enough to me that there probably are safety-relevant differences
It looks like imitation learning isn’t one thing but a fairly broad category in ML which even includes IRL. But if we compare approval direction to the narrower kinds of imitation learning, approval direction seems a lot riskier because you’re optimizing over an estimation of human approval, which seems to be an adversarial process that could easily trigger safety problems in both the ground-truth human approval as well as in the estimation process. I wonder when you wrote the OP, which form of imitation learning did you have in mind?
ETA: From this comment it looks like you were thinking of an online version of narrow imitation learning. Might be good to clarify that in the post?
But if we compare approval direction to the narrower kinds of imitation learning, approval direction seems a lot riskier because you’re optimizing over an estimation of human approval, which seems to be an adversarial process that could easily trigger safety problems in both the ground-truth human approval as well as in the estimation process.
But if there are safety problems in approval, wouldn’t there also be safety problems in the human’s behavior, which imitation learning would copy?
Similarly, if there are safety problems in the estimation process, wouldn’t there also be safety problems in the prediction of what action a human would take?
From this comment it looks like you were thinking of an online version of narrow imitation learning. Might be good to clarify that in the post?
I somewhat think that it applies to most imitation learning, not just the online variant of narrow imitation learning, but I am pretty confused/unsure. I’ll add a pointer to this discussion to the post.
But if there are safety problems in approval, wouldn’t there also be safety problems in the human’s behavior, which imitation learning would copy?
The human’s behavior could be safer because a human mind doesn’t optimize so much as to move outside of the range of inputs where approval is safe, or it has a “proposal generator” that only generates possible actions that with high probability stay within that range.
Similarly, if there are safety problems in the estimation process, wouldn’t there also be safety problems in the prediction of what action a human would take?
Same here, if you just predict what action a human would take, you’re less likely to optimize so much that you likely end up outside of where the estimation process is safe.
I somewhat think that it applies to most imitation learning, not just the online variant of narrow imitation learning, but I am pretty confused/unsure.
Ok, I’d be interested to hear more if you clarify your thoughts.
Can approval-directed agents be considered a form of imitation learning, and if not, are there any safety-relevant differences between imitation learning of (speeded-up) humans, and approval-directed agents?
I think that the only reason to be interested in approval-directed agents rather than straightforward imitation learners is that it may be harder to effectively imitate behavior than to solve the same task in a very different way.
Your post reminded me of Paul Christiano’s approval-directed agents which was also about trying to find an alternative to goal-directed agents. Looking at it again, it actually sounds a lot like applying imitation learning to humans (except imitating a speeded-up human):
It seems like approval direction allows for creative actions that the human operator approves of but would not have thought of doing themselves. Not sure if imitation learning does this.
That’s a good question. It looks like imitation learning actually covers a number of ML techniques (see this) none of which exactly matches approval-directed agents. But the category seems broad enough that I think approval-directed agents can be considered to be a form of imitation learning. In particular, IRL is considered a form of imitation learning and IRL would also be able to perform actions that the human would not have thought of doing themselves.
A little bit of nuance: IRL is considered to be a form of imitation learning because in many cases the inferred reward in IRL is only meant to reproduce the human’s performance and isn’t expected to generalize outside of the training distribution.
There are versions of IRL which are meant to go beyond imitation. For example, adversarial IRL was trying to infer a reward that would generalize to new environments, in which case it would be doing something more than imitation.
Suppose the agent you’re trying to imitate is itself goal-directed. In order for the imitator to generalize beyond its training distribution, it seemingly has to learn to become goal-directed (i.e., perform the same sort of computations that a goal-directed agent would). I don’t see how else it can predict what the goal-directed agent would do in a novel situation. If the imitator is not able to generalize, then it seems more tool-like than agent-like. On the other hand, if the imitatee is not goal-directed… I guess the agent could imitate humans and be not entirely goal-directed to the extent that humans are not entirely goal-directed. (Is this the point you’re trying to make, or are you saying that an imitation of a goal-directed agent would constitute a non-goal-directed agent?)
I’m not sure these are the points Rohin was trying to make, but there seem to be at least two important points here:
Imitation learning applied to humans produces goal-directed behavior only insofar humans are goal-directed
Imitation learning applied to humans produces agents no more capable than humans. (I think IDA goes beyond this by adding amplification steps, which are separate. And IRL goes beyond this by trying to correct “errors” that the humans make.)
Regarding the second point, there’s a safety-relevant sense in which a human-imitating agent is less goal-directed than the human. Because if you scale the human’s capabilities, the human will become better at achieving its personal objectives. By contrast, if you scale the imitator’s capabilities, it’s only supposed to become even better at imitating the unscaled human.
I’m not very convinced by this example, or alternatively I’m not getting the distinction you’re drawing between “agent” and “goal-directed”. Suppose the agent you’re trying to imitate is itself goal-directed. In order for the imitator to generalize beyond its training distribution, it seemingly has to learn to become goal-directed (i.e., perform the same sort of computations that a goal-directed agent would). I don’t see how else it can predict what the goal-directed agent would do in a novel situation. If the imitator is not able to generalize, then it seems more tool-like than agent-like. On the other hand, if the imitatee is not goal-directed… I guess the agent could imitate humans and be not entirely goal-directed to the extent that humans are not entirely goal-directed. (Is this the point you’re trying to make, or are you saying that an imitation of a goal-directed agent would constitute a non-goal-directed agent?)
Your post reminded me of Paul Christiano’s approval-directed agents which was also about trying to find an alternative to goal-directed agents. Looking at it again, it actually sounds a lot like applying imitation learning to humans (except imitating a speeded-up human):
Can approval-directed agents be considered a form of imitation learning, and if not, are there any safety-relevant differences between imitation learning of (speeded-up) humans, and approval-directed agents?
I definitely endorse this point, think that it’s an important aspect, and that it alone justifies the claim that I was making about non-goal-directed Agent AI being possible.
That said, I do have an intuition that agents whose goal-directedness comes from other agents shouldn’t be considered goal-directed, at least if it happens in a particular way. Let’s say that I’m pursuing goal X, and my assistant AI agent is also pursuing goal X as a result. If I then start to pursue goal Y, and my AI agent also starts pursuing Y because it is aligned with me, then it feels like the AI was not really directed at goal X, but more directed at “whatever goal Rohin has”, and this feels distinctly less goal-directed to me. (In particular, my AI agent would not have all of the convergent instrumental subgoals in this setting, so it is really different in kind from an AI agent that was simply pursuing X to the best of its ability.)
“Goal-directed” may not be the right word to capture the property I’m thinking about. It might be something like “thing that pursues the standard convergent instrumental subgoals”, or “thing that pursues a goal that is not defined in terms of someone else’s goal”.
Yeah, that idea was a big influence on the views that caused me to write this post.
It’s not exactly the same, but it is very similar. You could think of approval-direction as imitation of a particular weird kind of human, who deliberates for a while before choosing any action.
They feel different enough to me that there probably are safety-relevant differences, but I don’t know of any off the top of my head. Initially I was going to say that myopia was a safety-relevant difference, but thinking about it more I don’t think that’s an actual difference. Approval-directed agents are more explicitly myopic, but I think imitation learning could be myopic in the same way.
Btw, this post also views Paul’s agenda through the lens of constructing imitations of humans.
What causes the agent to switch from X to Y?
Are you thinking of the “agent” as A) the product of the demonstrations and training (e.g. the resulting neural network), or as B) a system that includes both the trained agent and also the training process itself (and facilities for continual online learning)?
I would assume A by default, but then I would expect that if you trained such an agent with imitation learning while pursuing goal X, you’d likely get an agent that continues to pursue goal X even after you’ve switched to pursuing goal Y. (Unless the agent also learned to imitate whatever the decision-making process was that led you to switch from X to Y, in which case the agent seems non-goal-directed only insofar as you decided to switch from X to Y for non-goal-related reasons rather than in service of some higher level goal Ω. Is that what you want?)
I was imagining something more like B for the imitation learning case.
That analysis seems right to me.
With respect to whether it is what I want, I wouldn’t say that I want any of these things in particular, I’m more pointing at the existence of systems that aren’t goal-directed, yet behave like an agent.
Would you agree that a B-type agent would be basically as goal-directed as a human (because it exhibits goal-directed behavior when the human does, and doesn’t when the human doesn’t)?
In which case, would it be fair to summarize (part of) your argument as:
1) Many of the potential problems with building safe superintelligent systems comes from them being too goal-directed.
2) An agent that is only as goal-directed as a human is much less susceptible to many of these failure modes.
3) It is likely possible to build superintelligent systems that are only as goal-directed as humans.
?
I don’t think so. Maybe this would be true if you had a perfect imitation of a human, but in practice you’ll be uncertain about what the human is going to do. If you’re uncertain in this way, and you are getting your goals from a human, then you don’t do all of the instrumental subgoals. (See The Off-Switch Game for a simple analysis showing that you can avoid the survival incentive.)
It may be that “goal-directed” is the wrong word for the property I’m talking about, but I’m predicting that agents of this form are less susceptible to convergent instrumental subgoals than humans are.
To clarify, you do do the human’s instrumental sub-goals though, just not extra ones for yourself, right?
If you’ve seen the human acquire resources, then you’ll acquire resources in the same way.
If there’s now some new resource that you’ve never seen before, you may acquire it if you’re sufficiently confident that the human would, but otherwise you might try to gather more evidence to see what the human would do. This is assuming that we have some way of doing imitation learning that allows the resulting system to have uncertainty that it can resolve by watching the human, or asking the human. If you imagine the exact way that we do imitation learning today, it would extrapolate somehow in a way that isn’t actually what the human would do. Maybe it acquires the new resource, maybe it leaves it alone, maybe it burns it to prevent anyone from having it, who knows.
Right, so I think I wasn’t really making a new observation, but just clearing up a confusion on my own part, where for a long time I didn’t understand how the idea of approval-directed agency fits into IDA because people switched from talking about approval-directed agency to imitation learning (or were talking about them interchangeably) and I didn’t catch the connection. So at this point I understand Paul’s trajectory of views as follows:
goal-directed agent ⇒ approval-directed agent ⇒ use IDA to scale up approval-direct agent ⇒ approval-directed agency as a form of imitation learning / generalize to other forms of imitation learning ⇒ generalize IDA to safely scale up other (including more goal-directed / consequentialist) forms of ML (see An Unaligned Benchmark which I think represents his current views)
(Someone please chime in if this still seems wrong or confused.)
It looks like imitation learning isn’t one thing but a fairly broad category in ML which even includes IRL. But if we compare approval direction to the narrower kinds of imitation learning, approval direction seems a lot riskier because you’re optimizing over an estimation of human approval, which seems to be an adversarial process that could easily trigger safety problems in both the ground-truth human approval as well as in the estimation process. I wonder when you wrote the OP, which form of imitation learning did you have in mind?
ETA: From this comment it looks like you were thinking of an online version of narrow imitation learning. Might be good to clarify that in the post?
But if there are safety problems in approval, wouldn’t there also be safety problems in the human’s behavior, which imitation learning would copy?
Similarly, if there are safety problems in the estimation process, wouldn’t there also be safety problems in the prediction of what action a human would take?
I somewhat think that it applies to most imitation learning, not just the online variant of narrow imitation learning, but I am pretty confused/unsure. I’ll add a pointer to this discussion to the post.
The human’s behavior could be safer because a human mind doesn’t optimize so much as to move outside of the range of inputs where approval is safe, or it has a “proposal generator” that only generates possible actions that with high probability stay within that range.
Same here, if you just predict what action a human would take, you’re less likely to optimize so much that you likely end up outside of where the estimation process is safe.
Ok, I’d be interested to hear more if you clarify your thoughts.
I found an old comment from Paul that answers this:
It seems like approval direction allows for creative actions that the human operator approves of but would not have thought of doing themselves. Not sure if imitation learning does this.
That’s a good question. It looks like imitation learning actually covers a number of ML techniques (see this) none of which exactly matches approval-directed agents. But the category seems broad enough that I think approval-directed agents can be considered to be a form of imitation learning. In particular, IRL is considered a form of imitation learning and IRL would also be able to perform actions that the human would not have thought of doing themselves.
^ Yes to all of this.
A little bit of nuance: IRL is considered to be a form of imitation learning because in many cases the inferred reward in IRL is only meant to reproduce the human’s performance and isn’t expected to generalize outside of the training distribution.
There are versions of IRL which are meant to go beyond imitation. For example, adversarial IRL was trying to infer a reward that would generalize to new environments, in which case it would be doing something more than imitation.
I’m not sure these are the points Rohin was trying to make, but there seem to be at least two important points here:
Imitation learning applied to humans produces goal-directed behavior only insofar humans are goal-directed
Imitation learning applied to humans produces agents no more capable than humans. (I think IDA goes beyond this by adding amplification steps, which are separate. And IRL goes beyond this by trying to correct “errors” that the humans make.)
Regarding the second point, there’s a safety-relevant sense in which a human-imitating agent is less goal-directed than the human. Because if you scale the human’s capabilities, the human will become better at achieving its personal objectives. By contrast, if you scale the imitator’s capabilities, it’s only supposed to become even better at imitating the unscaled human.