I’m not sure I understand your weighting argument. Some capabilities are “convergently instrumental” because they are useful for achieving a lot of purposes. I agree that AIs construction techniques will target obtaining such capabilities, precisely because they are useful.
But if you gain a certain convergently instrumental capability, it then automatically allows you to do a lot of random stuff. That’s what the words mean. And most of that random stuff will not be safe.
I don’t get what the difference is between “the AI will get convergently instrumental capabilities, and we’ll point those at AI alignment” and “the AI will get very powerful and we’ll just ask it to be aligned”, other than a bit of technical jargon.
As soon as the AI it gets sufficiently powerful [convergently instrumental capabilities], it is already dangerous. You need to point it precisely at a safe target in outcomes-space or you’re in trouble. Just vaguely pointing it “towards AI alignment” is almost certainly not enough; specifying that outcome safely is the problem we started with.
(And you still have the problem that while it’s working on that someone else can point it at something much worse.)
I don’t see much of a disagreement here? I’m just saying that the way in which random things are accelerated is largely via convergent stuff; and therefore there’s maybe some way that one can “repurpose” all that convergent stuff towards some aligned goal. I agree that this idea is dubious / doesn’t obviously work. As a contrast, one could imagine instead a world in which new capabilities are sort of very idiosyncratic to the particular goal they serve, and when you get an agent with some goals, all its cognitive machinery is idiosyncratic and hard to parse out, and it would be totally infeasible to extract the useful cognitive machinery and repurpose it.
I’m not sure I understand your weighting argument. Some capabilities are “convergently instrumental” because they are useful for achieving a lot of purposes. I agree that AIs construction techniques will target obtaining such capabilities, precisely because they are useful.
But if you gain a certain convergently instrumental capability, it then automatically allows you to do a lot of random stuff. That’s what the words mean. And most of that random stuff will not be safe.
I don’t get what the difference is between “the AI will get convergently instrumental capabilities, and we’ll point those at AI alignment” and “the AI will get very powerful and we’ll just ask it to be aligned”, other than a bit of technical jargon.
As soon as the AI it gets sufficiently powerful [convergently instrumental capabilities], it is already dangerous. You need to point it precisely at a safe target in outcomes-space or you’re in trouble. Just vaguely pointing it “towards AI alignment” is almost certainly not enough; specifying that outcome safely is the problem we started with.
(And you still have the problem that while it’s working on that someone else can point it at something much worse.)
I don’t see much of a disagreement here? I’m just saying that the way in which random things are accelerated is largely via convergent stuff; and therefore there’s maybe some way that one can “repurpose” all that convergent stuff towards some aligned goal. I agree that this idea is dubious / doesn’t obviously work. As a contrast, one could imagine instead a world in which new capabilities are sort of very idiosyncratic to the particular goal they serve, and when you get an agent with some goals, all its cognitive machinery is idiosyncratic and hard to parse out, and it would be totally infeasible to extract the useful cognitive machinery and repurpose it.