Thanks! I think that our argument doesn’t depend on all possible goals being describable this way. It depends on useful tasks (that AI designers are trying to achieve) being driven in large part by pursuing outcomes. For a counterexample, behavior that is defined entirely by local constraints (e.g. a calculator, or “hand on wall maze algorithm”) aren’t the kind of algorithm that is a source of AI risk (and also isn’t as useful in some ways).
Your example of a pointer to a goal is a good edge case for our way of defining/categorizing goals. Our definitions don’t capture this edge case properly. But we can extend the definitions to include it, e.g. if the goal that ends up eventually being pursued is an outcome, then we could define the observing agent as knowing that outcome in advance. Or alternatively, we could wait until the agent has uncovered its consequentialist goal, but hasn’t yet completed it. In both these cases we can treat it as consequentialist. Either way it still has the property that leads to danger, which is the capacity to overcome large classes of obstacles and still get to its destination.
I’m not sure what you mean by “goal objects robust to capabilities not present early in training”. If you mean “goal objects that specify shutdownable behavior while also specifying useful outcomes, and are robust to capability increases”, then I agree that such objects exist in principle. But I could argue that this isn’t very natural, if this is a crux and I’m understanding what you mean correctly?
Thanks!
I think that our argument doesn’t depend on all possible goals being describable this way. It depends on useful tasks (that AI designers are trying to achieve) being driven in large part by pursuing outcomes. For a counterexample, behavior that is defined entirely by local constraints (e.g. a calculator, or “hand on wall maze algorithm”) aren’t the kind of algorithm that is a source of AI risk (and also isn’t as useful in some ways).
Your example of a pointer to a goal is a good edge case for our way of defining/categorizing goals. Our definitions don’t capture this edge case properly. But we can extend the definitions to include it, e.g. if the goal that ends up eventually being pursued is an outcome, then we could define the observing agent as knowing that outcome in advance. Or alternatively, we could wait until the agent has uncovered its consequentialist goal, but hasn’t yet completed it. In both these cases we can treat it as consequentialist. Either way it still has the property that leads to danger, which is the capacity to overcome large classes of obstacles and still get to its destination.
I’m not sure what you mean by “goal objects robust to capabilities not present early in training”. If you mean “goal objects that specify shutdownable behavior while also specifying useful outcomes, and are robust to capability increases”, then I agree that such objects exist in principle. But I could argue that this isn’t very natural, if this is a crux and I’m understanding what you mean correctly?
This is indeed a crux, maybe it’s still worth talking about.