Sorry, what? I thought the fear was that we don’t know how to make helpful AI at all. (And that people who think they’re being helped by seductively helpful-sounding LLM assistants are being misled by surface appearances; the shoggoth underneath has its own desires that we won’t like when it’s powerful enough to persue them autonomously.) In contrast, this almost makes it sound like you think it is plausible to align AI to its user’s intent, but that this would be bad if the users aren’t one of “us”—you know, the good alignment researchers who want to use AI to take over the universe, totally unlike those evil capabilities researchers who want to use AI to produce economically valuable goods and services.
My steelman of this (though to be clear I think your comment makes good points):
There is a large difference between a system being helpful and a system being aligned. Ultimately AI existential risk is a coordination problem where I expect catastrophic consequences because a bunch of people want to build AGI without making it safe. Therefore making technologies that in a naive and short-term sense just help AGI developers build whatever they want to build will have bad consequences. If I trusted everyone to use their intelligence just for good things, we wouldn’t have anthropogenic existential risk on our hands.
Some of those technologies might end up useful for also getting the AI to be more properly aligned, or maybe to help with work that reduces the risk of AI catastrophe some other way, though my current sense is that kind of work is pretty different and doesn’t benefit remotely as much from generically locally-helpful AI.
In-general I feel pretty sad about conflating “alignment” with “short-term intent alignment”. I think the two problems are related but have really important crucial differences, I don’t think the latter generalizes that well to the former (for all the usual sycophancy/treacherous-turn reasons), and indeed progress on the latter IMO mostly makes the world marginally worse because the thing it is most likely to be used for is developing existentially dangerous AI systems faster.
Edit: Another really important dimension to model here is also not just the effect of that kind of research on what individual researchers will do, but what effect this kind of research will have on what the market wants to invest in. My standard story of doom is centrally rooted in there being very strong short-term individual economic incentives to build more capable AGI, enabling people to make billions to trillions of dollars, while the downside risk is a distributed negative externality that is not at all priced into the costs of AI development. Developing applications of AI that make a lot of money without accounting for the negative extinction externalities therefore can be really quite bad for the world.
My steelman of this (though to be clear I think your comment makes good points):
There is a large difference between a system being helpful and a system being aligned. Ultimately AI existential risk is a coordination problem where I expect catastrophic consequences because a bunch of people want to build AGI without making it safe. Therefore making technologies that in a naive and short-term sense just help AGI developers build whatever they want to build will have bad consequences. If I trusted everyone to use their intelligence just for good things, we wouldn’t have anthropogenic existential risk on our hands.
Some of those technologies might end up useful for also getting the AI to be more properly aligned, or maybe to help with work that reduces the risk of AI catastrophe some other way, though my current sense is that kind of work is pretty different and doesn’t benefit remotely as much from generically locally-helpful AI.
In-general I feel pretty sad about conflating “alignment” with “short-term intent alignment”. I think the two problems are related but have really important crucial differences, I don’t think the latter generalizes that well to the former (for all the usual sycophancy/treacherous-turn reasons), and indeed progress on the latter IMO mostly makes the world marginally worse because the thing it is most likely to be used for is developing existentially dangerous AI systems faster.
Edit: Another really important dimension to model here is also not just the effect of that kind of research on what individual researchers will do, but what effect this kind of research will have on what the market wants to invest in. My standard story of doom is centrally rooted in there being very strong short-term individual economic incentives to build more capable AGI, enabling people to make billions to trillions of dollars, while the downside risk is a distributed negative externality that is not at all priced into the costs of AI development. Developing applications of AI that make a lot of money without accounting for the negative extinction externalities therefore can be really quite bad for the world.