Research on power-seeking tendencies is more useful than nothing, but consider the plausibility of the following retrospective: “AI alignment might not have been solved except for TurnTrout’s deconfusion of power-seeking tendencies.” Doesn’t sound like something which would actually happen in reality, does it?
EDIT: Note this kind of visualization is not always valid—it’s easy to diminish a research approach by reframing it—but in this case I think it’s fine and makes my point.
I think it’s plausible that the alignment community could figure out how to build systems without power-seeking incentives, or with power-seeking tendencies limited to some safe set of options, by building on your formalization, so the retrospective seems plausible to me.
In addition, this work is useful for convincing ML people that alignment is hard, which helps to lay the groundwork for coordinating the AI community to not build AGI. I’ve often pointed researchers at DM (especially RL people) to your power-seeking paper when trying to explain convergent instrumental goals (a formal neurips paper makes a much better reference for that audience than Basic AI Drives).
Research on power-seeking tendencies is more useful than nothing, but consider the plausibility of the following retrospective: “AI alignment might not have been solved except for TurnTrout’s deconfusion of power-seeking tendencies.” Doesn’t sound like something which would actually happen in reality, does it?
EDIT: Note this kind of visualization is not always valid—it’s easy to diminish a research approach by reframing it—but in this case I think it’s fine and makes my point.
I think it’s plausible that the alignment community could figure out how to build systems without power-seeking incentives, or with power-seeking tendencies limited to some safe set of options, by building on your formalization, so the retrospective seems plausible to me.
In addition, this work is useful for convincing ML people that alignment is hard, which helps to lay the groundwork for coordinating the AI community to not build AGI. I’ve often pointed researchers at DM (especially RL people) to your power-seeking paper when trying to explain convergent instrumental goals (a formal neurips paper makes a much better reference for that audience than Basic AI Drives).