I’ll be very boring and predictable and make the usual model splintering/value extrapolation point here :-)
Namely that I don’t think we can talk sensibly about an AI having “beneficial goal-directedness” without situational awareness. For instance, it’s of little use to have an AI with the goal of “ensuring human flourishing” if it doesn’t understand the meaning of flourishing or human. And, without situational awareness, it can’t understand either; at best we could have some proxy or pointer towards these key concepts.
The key challenge seems to be to get the AI to generalise properly; even initially poor goals can work if generalised well. For instance, a money-maximising trade-bot AI could be perfectly safe if it notices that money, in its initial setting, is just a proxy for humans being able to satisfy their preferences.
So I’d be focusing on “do the goals stay safe as the AI gains situational awareness?”, rather than “are the goals safe before the AI gains situational awareness?”
Namely that I don’t think we can talk sensibly about an AI having “beneficial goal-directedness” without situational awareness. For instance, it’s of little use to have an AI with the goal of “ensuring human flourishing” if it doesn’t understand the meaning of flourishing or human. And, without situational awareness, it can’t understand either; at best we could have some proxy or pointer towards these key concepts.
Another way of saying this is that inner alignment is more important than outer alignment.
The key challenge seems to be to get the AI to generalise properly; even initially poor goals can work if generalised well. For instance, a money-maximising trade-bot AI could be perfectly safe if it notices that money, in its initial setting, is just a proxy for humans being able to satisfy their preferences.
I’ve also called this “generalise properly” part methodological alignment in this comment. And I conjectured that from methodological alignment and inner alignment, outer alignment follows automatically, we shouldn’t even care about it. Which also seems like what you are saying here.
Another way of saying this is that inner alignment is more important than outer alignment.
Interesting. My intuition is the inner alignment has nothing to do with this problem. It seems that different people view the inner vs outer alignment distinction in different ways.
For instance, a money-maximising trade-bot AI could be perfectly safe if it notices that money, in its initial setting, is just a proxy for humans being able to satisfy their preferences.
There is a critical step missing here, which is when the trade-bot makes a “choice” between maximising money or satisfying preferences. At this point, I see two possibilities:
Modelling the trade-bot as an agent does not break down: the trade-bot has an objective which it tries to optimize, plausibly maximising money (since that is what it was trained for) and probably not satisfying human preferences (unless it had some reason to have that has an objective). A comforting possibility is that it is corrigibly aligned, that it optimizes for a pointer to its best understanding of its developers. Do you think this is likely? If so, why?
An agentic description of the trade-bot is inadequate. The trade-bot is an adaptation-executer, it follows shards of value, or something. What kind of computation is it making that steers it towards satisfying human preferences?
So I’d be focusing on “do the goals stay safe as the AI gains situational awareness?”, rather than “are the goals safe before the AI gains situational awareness?”
This is a false dichotomy. Assuming that when the AI gains situational awareness, it will optimize for its developers’ goals, alignment is already solved. Making the goals safe before situational awareness is not that hard: at that point, the AI is not capable enough for X-risk. (A discussion of X-risk brought about by situationally unaware AIs could be interesting, such as a Christiano failure story, but Soares’s model is not about it, since it assumes autonomous ASI.)
I’ll be very boring and predictable and make the usual model splintering/value extrapolation point here :-)
Namely that I don’t think we can talk sensibly about an AI having “beneficial goal-directedness” without situational awareness. For instance, it’s of little use to have an AI with the goal of “ensuring human flourishing” if it doesn’t understand the meaning of flourishing or human. And, without situational awareness, it can’t understand either; at best we could have some proxy or pointer towards these key concepts.
The key challenge seems to be to get the AI to generalise properly; even initially poor goals can work if generalised well. For instance, a money-maximising trade-bot AI could be perfectly safe if it notices that money, in its initial setting, is just a proxy for humans being able to satisfy their preferences.
So I’d be focusing on “do the goals stay safe as the AI gains situational awareness?”, rather than “are the goals safe before the AI gains situational awareness?”
Agreed.
Another way of saying this is that inner alignment is more important than outer alignment.
I’ve also called this “generalise properly” part methodological alignment in this comment. And I conjectured that from methodological alignment and inner alignment, outer alignment follows automatically, we shouldn’t even care about it. Which also seems like what you are saying here.
Interesting. My intuition is the inner alignment has nothing to do with this problem. It seems that different people view the inner vs outer alignment distinction in different ways.
There is a critical step missing here, which is when the trade-bot makes a “choice” between maximising money or satisfying preferences.
At this point, I see two possibilities:
Modelling the trade-bot as an agent does not break down: the trade-bot has an objective which it tries to optimize, plausibly maximising money (since that is what it was trained for) and probably not satisfying human preferences (unless it had some reason to have that has an objective).
A comforting possibility is that it is corrigibly aligned, that it optimizes for a pointer to its best understanding of its developers. Do you think this is likely? If so, why?
An agentic description of the trade-bot is inadequate. The trade-bot is an adaptation-executer, it follows shards of value, or something. What kind of computation is it making that steers it towards satisfying human preferences?
This is a false dichotomy. Assuming that when the AI gains situational awareness, it will optimize for its developers’ goals, alignment is already solved. Making the goals safe before situational awareness is not that hard: at that point, the AI is not capable enough for X-risk.
(A discussion of X-risk brought about by situationally unaware AIs could be interesting, such as a Christiano failure story, but Soares’s model is not about it, since it assumes autonomous ASI.)