With respect to whether it is what I want, I wouldn’t say that I want any of these things in particular, I’m more pointing at the existence of systems that aren’t goal-directed, yet behave like an agent.
Would you agree that a B-type agent would be basically as goal-directed as a human (because it exhibits goal-directed behavior when the human does, and doesn’t when the human doesn’t)?
In which case, would it be fair to summarize (part of) your argument as:
1) Many of the potential problems with building safe superintelligent systems comes from them being too goal-directed.
2) An agent that is only as goal-directed as a human is much less susceptible to many of these failure modes.
3) It is likely possible to build superintelligent systems that are only as goal-directed as humans.
I don’t think so. Maybe this would be true if you had a perfect imitation of a human, but in practice you’ll be uncertain about what the human is going to do. If you’re uncertain in this way, and you are getting your goals from a human, then you don’t do all of the instrumental subgoals. (See The Off-Switch Game for a simple analysis showing that you can avoid the survival incentive.)
It may be that “goal-directed” is the wrong word for the property I’m talking about, but I’m predicting that agents of this form are less susceptible to convergent instrumental subgoals than humans are.
If you’ve seen the human acquire resources, then you’ll acquire resources in the same way.
If there’s now some new resource that you’ve never seen before, you may acquire it if you’re sufficiently confident that the human would, but otherwise you might try to gather more evidence to see what the human would do. This is assuming that we have some way of doing imitation learning that allows the resulting system to have uncertainty that it can resolve by watching the human, or asking the human. If you imagine the exact way that we do imitation learning today, it would extrapolate somehow in a way that isn’t actually what the human would do. Maybe it acquires the new resource, maybe it leaves it alone, maybe it burns it to prevent anyone from having it, who knows.
Would you agree that a B-type agent would be basically as goal-directed as a human (because it exhibits goal-directed behavior when the human does, and doesn’t when the human doesn’t)?
In which case, would it be fair to summarize (part of) your argument as:
1) Many of the potential problems with building safe superintelligent systems comes from them being too goal-directed.
2) An agent that is only as goal-directed as a human is much less susceptible to many of these failure modes.
3) It is likely possible to build superintelligent systems that are only as goal-directed as humans.
?
I don’t think so. Maybe this would be true if you had a perfect imitation of a human, but in practice you’ll be uncertain about what the human is going to do. If you’re uncertain in this way, and you are getting your goals from a human, then you don’t do all of the instrumental subgoals. (See The Off-Switch Game for a simple analysis showing that you can avoid the survival incentive.)
It may be that “goal-directed” is the wrong word for the property I’m talking about, but I’m predicting that agents of this form are less susceptible to convergent instrumental subgoals than humans are.
To clarify, you do do the human’s instrumental sub-goals though, just not extra ones for yourself, right?
If you’ve seen the human acquire resources, then you’ll acquire resources in the same way.
If there’s now some new resource that you’ve never seen before, you may acquire it if you’re sufficiently confident that the human would, but otherwise you might try to gather more evidence to see what the human would do. This is assuming that we have some way of doing imitation learning that allows the resulting system to have uncertainty that it can resolve by watching the human, or asking the human. If you imagine the exact way that we do imitation learning today, it would extrapolate somehow in a way that isn’t actually what the human would do. Maybe it acquires the new resource, maybe it leaves it alone, maybe it burns it to prevent anyone from having it, who knows.