I don’t think so. Maybe this would be true if you had a perfect imitation of a human, but in practice you’ll be uncertain about what the human is going to do. If you’re uncertain in this way, and you are getting your goals from a human, then you don’t do all of the instrumental subgoals. (See The Off-Switch Game for a simple analysis showing that you can avoid the survival incentive.)
It may be that “goal-directed” is the wrong word for the property I’m talking about, but I’m predicting that agents of this form are less susceptible to convergent instrumental subgoals than humans are.
If you’ve seen the human acquire resources, then you’ll acquire resources in the same way.
If there’s now some new resource that you’ve never seen before, you may acquire it if you’re sufficiently confident that the human would, but otherwise you might try to gather more evidence to see what the human would do. This is assuming that we have some way of doing imitation learning that allows the resulting system to have uncertainty that it can resolve by watching the human, or asking the human. If you imagine the exact way that we do imitation learning today, it would extrapolate somehow in a way that isn’t actually what the human would do. Maybe it acquires the new resource, maybe it leaves it alone, maybe it burns it to prevent anyone from having it, who knows.
I don’t think so. Maybe this would be true if you had a perfect imitation of a human, but in practice you’ll be uncertain about what the human is going to do. If you’re uncertain in this way, and you are getting your goals from a human, then you don’t do all of the instrumental subgoals. (See The Off-Switch Game for a simple analysis showing that you can avoid the survival incentive.)
It may be that “goal-directed” is the wrong word for the property I’m talking about, but I’m predicting that agents of this form are less susceptible to convergent instrumental subgoals than humans are.
To clarify, you do do the human’s instrumental sub-goals though, just not extra ones for yourself, right?
If you’ve seen the human acquire resources, then you’ll acquire resources in the same way.
If there’s now some new resource that you’ve never seen before, you may acquire it if you’re sufficiently confident that the human would, but otherwise you might try to gather more evidence to see what the human would do. This is assuming that we have some way of doing imitation learning that allows the resulting system to have uncertainty that it can resolve by watching the human, or asking the human. If you imagine the exact way that we do imitation learning today, it would extrapolate somehow in a way that isn’t actually what the human would do. Maybe it acquires the new resource, maybe it leaves it alone, maybe it burns it to prevent anyone from having it, who knows.