> The difference is that reality doesn’t force us to solve the problem, or tell us clearly which analogies are the right ones, > does not have such a large effect on the scientific problem.
Another major difference is that we’re forced to solve the problem using only analogies (and reasoning), as opposed to also getting to study the actual objects in question. And, there’s a big boundary between AIs that would lose vs. win a fight with humanity, which causes big disanalogies between AIs, and how alignment strategies apply to AIs, before and after that boundary. (Presumably there’s major disagreement about how important these disanalogies are / how difficult they are to circumvent with other analogies.)
> AI is accelerating the timetable for both alignment and capabilities
AI accelerates the timetable for things we know how to point AI at (which shades into convergently instrumental things that we point at just by training an AI to do anything). We know how to point AI at things that can be triangulated with clear metrics, like “how well does the sub-AI you programmed perform at such and such tasks”. We much less know how to point AI at alignment, or at more general things like “do good human-legible science / philosophy” that potentially tackle the core hard parts of alignment. So I don’t buy that these rates are very meaningfully coupled. Clearly there’s some coupling, like you could increase the ability of people working on alignment to find interesting papers 10x more easily, but that seems fairly modest and bounded a help; it’s not going to make alignment research 100x faster / better, because it’s still routing the important stuff through human researchers. How do you view AI as accelerating the timetable for alignment?
AI accelerates the timetable for things we know how to point AI at
It also accelerates the timetable for random things that we don’t expect and don’t even try to point the AI at but that just happen to be easier for incrementally-better AI to do.
Since the space of stuff that helps alignment seems much smaller than the space of dangerous things, you’d expect most things the AI randomly accelerates without us pointing it at will be dangerous.
Not exactly, because it’s not exactly “random things”, it’s heavily weighted on convergently instrumental things. If you could repurpose ~all the convergently instrumental stuff that randomly targeted AI can do towards AI alignment, like I think Christiano is trying to do, then you’d have a pretty strong coupling. Whether you can do that though is an open question, whether that would be sufficient is an open question.
I’m not sure I understand your weighting argument. Some capabilities are “convergently instrumental” because they are useful for achieving a lot of purposes. I agree that AIs construction techniques will target obtaining such capabilities, precisely because they are useful.
But if you gain a certain convergently instrumental capability, it then automatically allows you to do a lot of random stuff. That’s what the words mean. And most of that random stuff will not be safe.
I don’t get what the difference is between “the AI will get convergently instrumental capabilities, and we’ll point those at AI alignment” and “the AI will get very powerful and we’ll just ask it to be aligned”, other than a bit of technical jargon.
As soon as the AI it gets sufficiently powerful [convergently instrumental capabilities], it is already dangerous. You need to point it precisely at a safe target in outcomes-space or you’re in trouble. Just vaguely pointing it “towards AI alignment” is almost certainly not enough; specifying that outcome safely is the problem we started with.
(And you still have the problem that while it’s working on that someone else can point it at something much worse.)
I don’t see much of a disagreement here? I’m just saying that the way in which random things are accelerated is largely via convergent stuff; and therefore there’s maybe some way that one can “repurpose” all that convergent stuff towards some aligned goal. I agree that this idea is dubious / doesn’t obviously work. As a contrast, one could imagine instead a world in which new capabilities are sort of very idiosyncratic to the particular goal they serve, and when you get an agent with some goals, all its cognitive machinery is idiosyncratic and hard to parse out, and it would be totally infeasible to extract the useful cognitive machinery and repurpose it.
> The difference is that reality doesn’t force us to solve the problem, or tell us clearly which analogies are the right ones,
> does not have such a large effect on the scientific problem.
Another major difference is that we’re forced to solve the problem using only analogies (and reasoning), as opposed to also getting to study the actual objects in question. And, there’s a big boundary between AIs that would lose vs. win a fight with humanity, which causes big disanalogies between AIs, and how alignment strategies apply to AIs, before and after that boundary. (Presumably there’s major disagreement about how important these disanalogies are / how difficult they are to circumvent with other analogies.)
> AI is accelerating the timetable for both alignment and capabilities
AI accelerates the timetable for things we know how to point AI at (which shades into convergently instrumental things that we point at just by training an AI to do anything). We know how to point AI at things that can be triangulated with clear metrics, like “how well does the sub-AI you programmed perform at such and such tasks”. We much less know how to point AI at alignment, or at more general things like “do good human-legible science / philosophy” that potentially tackle the core hard parts of alignment. So I don’t buy that these rates are very meaningfully coupled. Clearly there’s some coupling, like you could increase the ability of people working on alignment to find interesting papers 10x more easily, but that seems fairly modest and bounded a help; it’s not going to make alignment research 100x faster / better, because it’s still routing the important stuff through human researchers. How do you view AI as accelerating the timetable for alignment?
It also accelerates the timetable for random things that we don’t expect and don’t even try to point the AI at but that just happen to be easier for incrementally-better AI to do.
Since the space of stuff that helps alignment seems much smaller than the space of dangerous things, you’d expect most things the AI randomly accelerates without us pointing it at will be dangerous.
Not exactly, because it’s not exactly “random things”, it’s heavily weighted on convergently instrumental things. If you could repurpose ~all the convergently instrumental stuff that randomly targeted AI can do towards AI alignment, like I think Christiano is trying to do, then you’d have a pretty strong coupling. Whether you can do that though is an open question, whether that would be sufficient is an open question.
I’m not sure I understand your weighting argument. Some capabilities are “convergently instrumental” because they are useful for achieving a lot of purposes. I agree that AIs construction techniques will target obtaining such capabilities, precisely because they are useful.
But if you gain a certain convergently instrumental capability, it then automatically allows you to do a lot of random stuff. That’s what the words mean. And most of that random stuff will not be safe.
I don’t get what the difference is between “the AI will get convergently instrumental capabilities, and we’ll point those at AI alignment” and “the AI will get very powerful and we’ll just ask it to be aligned”, other than a bit of technical jargon.
As soon as the AI it gets sufficiently powerful [convergently instrumental capabilities], it is already dangerous. You need to point it precisely at a safe target in outcomes-space or you’re in trouble. Just vaguely pointing it “towards AI alignment” is almost certainly not enough; specifying that outcome safely is the problem we started with.
(And you still have the problem that while it’s working on that someone else can point it at something much worse.)
I don’t see much of a disagreement here? I’m just saying that the way in which random things are accelerated is largely via convergent stuff; and therefore there’s maybe some way that one can “repurpose” all that convergent stuff towards some aligned goal. I agree that this idea is dubious / doesn’t obviously work. As a contrast, one could imagine instead a world in which new capabilities are sort of very idiosyncratic to the particular goal they serve, and when you get an agent with some goals, all its cognitive machinery is idiosyncratic and hard to parse out, and it would be totally infeasible to extract the useful cognitive machinery and repurpose it.