On the other hand, if the imitatee is not goal-directed… I guess the agent could imitate humans and be not entirely goal-directed to the extent that humans are not entirely goal-directed. (Is this the point you’re trying to make, or are you saying that an imitation of a goal-directed agent would constitute a non-goal-directed agent?)
I definitely endorse this point, think that it’s an important aspect, and that it alone justifies the claim that I was making about non-goal-directed Agent AI being possible.
That said, I do have an intuition that agents whose goal-directedness comes from other agents shouldn’t be considered goal-directed, at least if it happens in a particular way. Let’s say that I’m pursuing goal X, and my assistant AI agent is also pursuing goal X as a result. If I then start to pursue goal Y, and my AI agent also starts pursuing Y because it is aligned with me, then it feels like the AI was not really directed at goal X, but more directed at “whatever goal Rohin has”, and this feels distinctly less goal-directed to me. (In particular, my AI agent would not have all of the convergent instrumental subgoals in this setting, so it is really different in kind from an AI agent that was simply pursuing X to the best of its ability.)
“Goal-directed” may not be the right word to capture the property I’m thinking about. It might be something like “thing that pursues the standard convergent instrumental subgoals”, or “thing that pursues a goal that is not defined in terms of someone else’s goal”.
Your post reminded me of Paul Christiano’s approval-directed agents which was also about trying to find an alternative to goal-directed agents.
Yeah, that idea was a big influence on the views that caused me to write this post.
Can approval-directed agents be considered a form of imitation learning, and if not, are there any safety-relevant differences between imitation learning of (speeded-up) humans, and approval-directed agents?
It’s not exactly the same, but it is very similar. You could think of approval-direction as imitation of a particular weird kind of human, who deliberates for a while before choosing any action.
They feel different enough to me that there probably are safety-relevant differences, but I don’t know of any off the top of my head. Initially I was going to say that myopia was a safety-relevant difference, but thinking about it more I don’t think that’s an actual difference. Approval-directed agents are more explicitly myopic, but I think imitation learning could be myopic in the same way.
Btw, this post also views Paul’s agenda through the lens of constructing imitations of humans.
For example, imitation learning allows you to create an agent that behaves similarly to another agent—I would classify this as “Agent AI that is not goal-directed”.
Let’s say that I’m pursuing goal X, and my assistant AI agent is also pursuing goal X as a result. If I then start to pursue goal Y, and my AI agent also starts pursuing Y because it is aligned with me, then it feels like the AI was not really directed at goal X, but more directed at “whatever goal Rohin has”
What causes the agent to switch from X to Y?
Are you thinking of the “agent” as A) the product of the demonstrations and training (e.g. the resulting neural network), or as B) a system that includes both the trained agent and also the training process itself (and facilities for continual online learning)?
I would assume A by default, but then I would expect that if you trained such an agent with imitation learning while pursuing goal X, you’d likely get an agent that continues to pursue goal X even after you’ve switched to pursuing goal Y. (Unless the agent also learned to imitate whatever the decision-making process was that led you to switch from X to Y, in which case the agent seems non-goal-directed only insofar as you decided to switch from X to Y for non-goal-related reasons rather than in service of some higher level goal Ω. Is that what you want?)
Are you thinking of the “agent” as A) the product of the demonstrations and training (e.g. the resulting neural network), or as B) a system that includes both the trained agent and also the training process itself (and facilities for continual online learning)?
I was imagining something more like B for the imitation learning case.
I would assume A by default, but then I would expect that if you trained such an agent with imitation learning while pursuing goal X, you’d likely get an agent that continues to pursue goal X even after you’ve switched to pursuing goal Y. (Unless the agent also learned to imitate whatever the decision-making process was that led you to switch from X to Y, in which case the agent seems non-goal-directed only insofar as you decided to switch from X to Y for non-goal-related reasons rather than in service of some higher level goal Ω. Is that what you want?)
That analysis seems right to me.
With respect to whether it is what I want, I wouldn’t say that I want any of these things in particular, I’m more pointing at the existence of systems that aren’t goal-directed, yet behave like an agent.
With respect to whether it is what I want, I wouldn’t say that I want any of these things in particular, I’m more pointing at the existence of systems that aren’t goal-directed, yet behave like an agent.
Would you agree that a B-type agent would be basically as goal-directed as a human (because it exhibits goal-directed behavior when the human does, and doesn’t when the human doesn’t)?
In which case, would it be fair to summarize (part of) your argument as:
1) Many of the potential problems with building safe superintelligent systems comes from them being too goal-directed.
2) An agent that is only as goal-directed as a human is much less susceptible to many of these failure modes.
3) It is likely possible to build superintelligent systems that are only as goal-directed as humans.
I don’t think so. Maybe this would be true if you had a perfect imitation of a human, but in practice you’ll be uncertain about what the human is going to do. If you’re uncertain in this way, and you are getting your goals from a human, then you don’t do all of the instrumental subgoals. (See The Off-Switch Game for a simple analysis showing that you can avoid the survival incentive.)
It may be that “goal-directed” is the wrong word for the property I’m talking about, but I’m predicting that agents of this form are less susceptible to convergent instrumental subgoals than humans are.
If you’ve seen the human acquire resources, then you’ll acquire resources in the same way.
If there’s now some new resource that you’ve never seen before, you may acquire it if you’re sufficiently confident that the human would, but otherwise you might try to gather more evidence to see what the human would do. This is assuming that we have some way of doing imitation learning that allows the resulting system to have uncertainty that it can resolve by watching the human, or asking the human. If you imagine the exact way that we do imitation learning today, it would extrapolate somehow in a way that isn’t actually what the human would do. Maybe it acquires the new resource, maybe it leaves it alone, maybe it burns it to prevent anyone from having it, who knows.
Btw, this post also views Paul’s agenda through the lens of constructing imitations of humans.
Right, so I think I wasn’t really making a new observation, but just clearing up a confusion on my own part, where for a long time I didn’t understand how the idea of approval-directed agency fits into IDA because people switched from talking about approval-directed agency to imitation learning (or were talking about them interchangeably) and I didn’t catch the connection. So at this point I understand Paul’s trajectory of views as follows:
goal-directed agent ⇒ approval-directed agent ⇒ use IDA to scale up approval-direct agent ⇒ approval-directed agency as a form of imitation learning / generalize to other forms of imitation learning ⇒ generalize IDA to safely scale up other (including more goal-directed / consequentialist) forms of ML (see An Unaligned Benchmark which I think represents his current views)
(Someone please chime in if this still seems wrong or confused.)
They feel different enough to me that there probably are safety-relevant differences
It looks like imitation learning isn’t one thing but a fairly broad category in ML which even includes IRL. But if we compare approval direction to the narrower kinds of imitation learning, approval direction seems a lot riskier because you’re optimizing over an estimation of human approval, which seems to be an adversarial process that could easily trigger safety problems in both the ground-truth human approval as well as in the estimation process. I wonder when you wrote the OP, which form of imitation learning did you have in mind?
ETA: From this comment it looks like you were thinking of an online version of narrow imitation learning. Might be good to clarify that in the post?
But if we compare approval direction to the narrower kinds of imitation learning, approval direction seems a lot riskier because you’re optimizing over an estimation of human approval, which seems to be an adversarial process that could easily trigger safety problems in both the ground-truth human approval as well as in the estimation process.
But if there are safety problems in approval, wouldn’t there also be safety problems in the human’s behavior, which imitation learning would copy?
Similarly, if there are safety problems in the estimation process, wouldn’t there also be safety problems in the prediction of what action a human would take?
From this comment it looks like you were thinking of an online version of narrow imitation learning. Might be good to clarify that in the post?
I somewhat think that it applies to most imitation learning, not just the online variant of narrow imitation learning, but I am pretty confused/unsure. I’ll add a pointer to this discussion to the post.
But if there are safety problems in approval, wouldn’t there also be safety problems in the human’s behavior, which imitation learning would copy?
The human’s behavior could be safer because a human mind doesn’t optimize so much as to move outside of the range of inputs where approval is safe, or it has a “proposal generator” that only generates possible actions that with high probability stay within that range.
Similarly, if there are safety problems in the estimation process, wouldn’t there also be safety problems in the prediction of what action a human would take?
Same here, if you just predict what action a human would take, you’re less likely to optimize so much that you likely end up outside of where the estimation process is safe.
I somewhat think that it applies to most imitation learning, not just the online variant of narrow imitation learning, but I am pretty confused/unsure.
Ok, I’d be interested to hear more if you clarify your thoughts.
I definitely endorse this point, think that it’s an important aspect, and that it alone justifies the claim that I was making about non-goal-directed Agent AI being possible.
That said, I do have an intuition that agents whose goal-directedness comes from other agents shouldn’t be considered goal-directed, at least if it happens in a particular way. Let’s say that I’m pursuing goal X, and my assistant AI agent is also pursuing goal X as a result. If I then start to pursue goal Y, and my AI agent also starts pursuing Y because it is aligned with me, then it feels like the AI was not really directed at goal X, but more directed at “whatever goal Rohin has”, and this feels distinctly less goal-directed to me. (In particular, my AI agent would not have all of the convergent instrumental subgoals in this setting, so it is really different in kind from an AI agent that was simply pursuing X to the best of its ability.)
“Goal-directed” may not be the right word to capture the property I’m thinking about. It might be something like “thing that pursues the standard convergent instrumental subgoals”, or “thing that pursues a goal that is not defined in terms of someone else’s goal”.
Yeah, that idea was a big influence on the views that caused me to write this post.
It’s not exactly the same, but it is very similar. You could think of approval-direction as imitation of a particular weird kind of human, who deliberates for a while before choosing any action.
They feel different enough to me that there probably are safety-relevant differences, but I don’t know of any off the top of my head. Initially I was going to say that myopia was a safety-relevant difference, but thinking about it more I don’t think that’s an actual difference. Approval-directed agents are more explicitly myopic, but I think imitation learning could be myopic in the same way.
Btw, this post also views Paul’s agenda through the lens of constructing imitations of humans.
What causes the agent to switch from X to Y?
Are you thinking of the “agent” as A) the product of the demonstrations and training (e.g. the resulting neural network), or as B) a system that includes both the trained agent and also the training process itself (and facilities for continual online learning)?
I would assume A by default, but then I would expect that if you trained such an agent with imitation learning while pursuing goal X, you’d likely get an agent that continues to pursue goal X even after you’ve switched to pursuing goal Y. (Unless the agent also learned to imitate whatever the decision-making process was that led you to switch from X to Y, in which case the agent seems non-goal-directed only insofar as you decided to switch from X to Y for non-goal-related reasons rather than in service of some higher level goal Ω. Is that what you want?)
I was imagining something more like B for the imitation learning case.
That analysis seems right to me.
With respect to whether it is what I want, I wouldn’t say that I want any of these things in particular, I’m more pointing at the existence of systems that aren’t goal-directed, yet behave like an agent.
Would you agree that a B-type agent would be basically as goal-directed as a human (because it exhibits goal-directed behavior when the human does, and doesn’t when the human doesn’t)?
In which case, would it be fair to summarize (part of) your argument as:
1) Many of the potential problems with building safe superintelligent systems comes from them being too goal-directed.
2) An agent that is only as goal-directed as a human is much less susceptible to many of these failure modes.
3) It is likely possible to build superintelligent systems that are only as goal-directed as humans.
?
I don’t think so. Maybe this would be true if you had a perfect imitation of a human, but in practice you’ll be uncertain about what the human is going to do. If you’re uncertain in this way, and you are getting your goals from a human, then you don’t do all of the instrumental subgoals. (See The Off-Switch Game for a simple analysis showing that you can avoid the survival incentive.)
It may be that “goal-directed” is the wrong word for the property I’m talking about, but I’m predicting that agents of this form are less susceptible to convergent instrumental subgoals than humans are.
To clarify, you do do the human’s instrumental sub-goals though, just not extra ones for yourself, right?
If you’ve seen the human acquire resources, then you’ll acquire resources in the same way.
If there’s now some new resource that you’ve never seen before, you may acquire it if you’re sufficiently confident that the human would, but otherwise you might try to gather more evidence to see what the human would do. This is assuming that we have some way of doing imitation learning that allows the resulting system to have uncertainty that it can resolve by watching the human, or asking the human. If you imagine the exact way that we do imitation learning today, it would extrapolate somehow in a way that isn’t actually what the human would do. Maybe it acquires the new resource, maybe it leaves it alone, maybe it burns it to prevent anyone from having it, who knows.
Right, so I think I wasn’t really making a new observation, but just clearing up a confusion on my own part, where for a long time I didn’t understand how the idea of approval-directed agency fits into IDA because people switched from talking about approval-directed agency to imitation learning (or were talking about them interchangeably) and I didn’t catch the connection. So at this point I understand Paul’s trajectory of views as follows:
goal-directed agent ⇒ approval-directed agent ⇒ use IDA to scale up approval-direct agent ⇒ approval-directed agency as a form of imitation learning / generalize to other forms of imitation learning ⇒ generalize IDA to safely scale up other (including more goal-directed / consequentialist) forms of ML (see An Unaligned Benchmark which I think represents his current views)
(Someone please chime in if this still seems wrong or confused.)
It looks like imitation learning isn’t one thing but a fairly broad category in ML which even includes IRL. But if we compare approval direction to the narrower kinds of imitation learning, approval direction seems a lot riskier because you’re optimizing over an estimation of human approval, which seems to be an adversarial process that could easily trigger safety problems in both the ground-truth human approval as well as in the estimation process. I wonder when you wrote the OP, which form of imitation learning did you have in mind?
ETA: From this comment it looks like you were thinking of an online version of narrow imitation learning. Might be good to clarify that in the post?
But if there are safety problems in approval, wouldn’t there also be safety problems in the human’s behavior, which imitation learning would copy?
Similarly, if there are safety problems in the estimation process, wouldn’t there also be safety problems in the prediction of what action a human would take?
I somewhat think that it applies to most imitation learning, not just the online variant of narrow imitation learning, but I am pretty confused/unsure. I’ll add a pointer to this discussion to the post.
The human’s behavior could be safer because a human mind doesn’t optimize so much as to move outside of the range of inputs where approval is safe, or it has a “proposal generator” that only generates possible actions that with high probability stay within that range.
Same here, if you just predict what action a human would take, you’re less likely to optimize so much that you likely end up outside of where the estimation process is safe.
Ok, I’d be interested to hear more if you clarify your thoughts.