Are you thinking of the “agent” as A) the product of the demonstrations and training (e.g. the resulting neural network), or as B) a system that includes both the trained agent and also the training process itself (and facilities for continual online learning)?
I was imagining something more like B for the imitation learning case.
I would assume A by default, but then I would expect that if you trained such an agent with imitation learning while pursuing goal X, you’d likely get an agent that continues to pursue goal X even after you’ve switched to pursuing goal Y. (Unless the agent also learned to imitate whatever the decision-making process was that led you to switch from X to Y, in which case the agent seems non-goal-directed only insofar as you decided to switch from X to Y for non-goal-related reasons rather than in service of some higher level goal Ω. Is that what you want?)
That analysis seems right to me.
With respect to whether it is what I want, I wouldn’t say that I want any of these things in particular, I’m more pointing at the existence of systems that aren’t goal-directed, yet behave like an agent.
With respect to whether it is what I want, I wouldn’t say that I want any of these things in particular, I’m more pointing at the existence of systems that aren’t goal-directed, yet behave like an agent.
Would you agree that a B-type agent would be basically as goal-directed as a human (because it exhibits goal-directed behavior when the human does, and doesn’t when the human doesn’t)?
In which case, would it be fair to summarize (part of) your argument as:
1) Many of the potential problems with building safe superintelligent systems comes from them being too goal-directed.
2) An agent that is only as goal-directed as a human is much less susceptible to many of these failure modes.
3) It is likely possible to build superintelligent systems that are only as goal-directed as humans.
I don’t think so. Maybe this would be true if you had a perfect imitation of a human, but in practice you’ll be uncertain about what the human is going to do. If you’re uncertain in this way, and you are getting your goals from a human, then you don’t do all of the instrumental subgoals. (See The Off-Switch Game for a simple analysis showing that you can avoid the survival incentive.)
It may be that “goal-directed” is the wrong word for the property I’m talking about, but I’m predicting that agents of this form are less susceptible to convergent instrumental subgoals than humans are.
If you’ve seen the human acquire resources, then you’ll acquire resources in the same way.
If there’s now some new resource that you’ve never seen before, you may acquire it if you’re sufficiently confident that the human would, but otherwise you might try to gather more evidence to see what the human would do. This is assuming that we have some way of doing imitation learning that allows the resulting system to have uncertainty that it can resolve by watching the human, or asking the human. If you imagine the exact way that we do imitation learning today, it would extrapolate somehow in a way that isn’t actually what the human would do. Maybe it acquires the new resource, maybe it leaves it alone, maybe it burns it to prevent anyone from having it, who knows.
I was imagining something more like B for the imitation learning case.
That analysis seems right to me.
With respect to whether it is what I want, I wouldn’t say that I want any of these things in particular, I’m more pointing at the existence of systems that aren’t goal-directed, yet behave like an agent.
Would you agree that a B-type agent would be basically as goal-directed as a human (because it exhibits goal-directed behavior when the human does, and doesn’t when the human doesn’t)?
In which case, would it be fair to summarize (part of) your argument as:
1) Many of the potential problems with building safe superintelligent systems comes from them being too goal-directed.
2) An agent that is only as goal-directed as a human is much less susceptible to many of these failure modes.
3) It is likely possible to build superintelligent systems that are only as goal-directed as humans.
?
I don’t think so. Maybe this would be true if you had a perfect imitation of a human, but in practice you’ll be uncertain about what the human is going to do. If you’re uncertain in this way, and you are getting your goals from a human, then you don’t do all of the instrumental subgoals. (See The Off-Switch Game for a simple analysis showing that you can avoid the survival incentive.)
It may be that “goal-directed” is the wrong word for the property I’m talking about, but I’m predicting that agents of this form are less susceptible to convergent instrumental subgoals than humans are.
To clarify, you do do the human’s instrumental sub-goals though, just not extra ones for yourself, right?
If you’ve seen the human acquire resources, then you’ll acquire resources in the same way.
If there’s now some new resource that you’ve never seen before, you may acquire it if you’re sufficiently confident that the human would, but otherwise you might try to gather more evidence to see what the human would do. This is assuming that we have some way of doing imitation learning that allows the resulting system to have uncertainty that it can resolve by watching the human, or asking the human. If you imagine the exact way that we do imitation learning today, it would extrapolate somehow in a way that isn’t actually what the human would do. Maybe it acquires the new resource, maybe it leaves it alone, maybe it burns it to prevent anyone from having it, who knows.