it sounded like the claim “value/risk of AI is mainly in the long tail” was something you found plausible/likely, but you also thought we could eliminate most of the risk by fixing problems as they come up.
So I don’t think that we can eliminate most of the risk from AI systems making dumb mistakes; I do in fact see that as quite likely. And plausibly such mistakes are even bad enough to cost lives.
What I think we can eliminate is the risk of an AI very competently and intelligently optimizing against us, causing an x-risk; that part doesn’t seem nearly as analogous to “long tail” problems.
I could break this down into a few subclaims:
1. It is very hard to cause existential catastrophes via “mistakes” or “random exploration”, such that we can ignore this aspect of risk. Therefore, we only have to consider cases where an AI system is “trying” to cause an existential catastrophe.
2. To cause an existential catastrophe, an AI system will have to be very good at generalization (at the very least, there will not have been an existential catastrophe in the past that it can learn from).
3. An AI system that is good at generalization would be good at the long tail (or at the very least, it would learn as it experienced the long tail).
A counterargument would be that your AI system could be great at generalizing at capabilities / impacting the world, but not great at generalizing alignment / motivation / translation of human objectives into AI objectives.
I think this is plausible, but I find the “value of long tail” argument much less compelling when talking about alignment / motivation, conditioned on having good generalization in capabilities. I wouldn’t agree with the “value of long tail” argument as applied to humans: for many tasks, it seems like you can explain to a human what the task is, and they are quickly able to do it without too many mistakes, or at least they know when they can’t do the task without too high a risk of error; it seems like this comes from our general reasoning + knowledge of the world, both of which the AI system presumably also has.
A counterargument would be that your AI system could be great at generalizing at capabilities / impacting the world, but not great at generalizing alignment / motivation / translation of human objectives into AI objectives.
I think this is roughly the right counterargument, modulo the distinction between “the AI has a good model of what humans want” and “the AI is programmed to actually do what humans want”. (I don’t think that distinction is key to this discussion, but might be for some people who come along and read this.)
I do think there’s one really strong argument that generalizing alignment / motivation / translation of human objectives is harder than generalizing capabilities: what happens in the limit of infinite data and compute? In that limit, an AI can get best-possible predictive power by Bayesian reasoning on the entire microscopic state of the universe. That’s what best-possible generalizing capabilities look like. The argument in Alignment as Translation was that alignment / motivation / translation of human objectives is still hard, even in that limit, and the way-in-which-it-is-hard involves a long tail of mistranslated corner cases. In other words: generalizable predictive power is very clearly not a sufficient condition for generalizable alignment.
I’d say there’s a strong chance that generalizable predictive power will be enough for generalizable alignment in practice, with realistic data/compute, but we don’t even have a decent model to predict when it will fail—other than that it will fail, once data and compute pass some unknown threshold. Such a model would presumably involve an epistemic analogue of instrumental convergence: it would tell us when two systems with different architectures are likely to converge on similar abstractions in order to model the same world.
I do think there’s one really strong argument that generalizing alignment / motivation / translation of human objectives is harder than generalizing capabilities: what happens in the limit of infinite data and compute?
Strongly agree. I have two arguments for work on AI safety that I really do buy and find motivating; this is one of them. (The other one is the one presented in Human Compatible.)
But with both of these arguments, I see them as establishing that we can’t be confident given our current knowledge that alignment happens by default; therefore given the high stakes we should work on it. This is different from making a prediction that things will probably go badly.
(I don’t think this is actually disagreeing with you anywhere.)
other than that it will fail, once data and compute pass some unknown threshold.
I want to flag a note of confusion here—it feels like it should be possible for a mostly-aligned system to become more aligned, such that it never fails at any threshold (along the lines of there being a broad basin of corrigibility). But I haven’t really made this perspective play nicely with the perspective of alignment as translation.
This is different from making a prediction that things will probably go badly.
Thinking about it, I really should have been more explicit about this before: I do think there’s a strong chance of alignment-by-default of AGI (at least 20%, maybe higher), as well as a strong chance of non-doom via other routes (e.g. decreasing marginal returns of intelligence or alignment becoming necessary for economic value in obvious ways).
Related: one place where I think I diverge from many/most people in the area is that I’m playing to win, not just to avoid losing. I see alignment not just as important for avoiding doom, but as plausibly the hardest part of unlocking most of the economic value of AGI.
My goal for AGI is to create tons of value and to (very very reliably) avoid catastrophic loss. I see alignment-in-the-sense-of-translation as the main bottleneck to achieving both of those simultaneously; I expect that both the value and the risk are dominated by exponentially large numbers of corner-cases.
I want to flag a note of confusion here—it feels like it should be possible for a mostly-aligned system to become more aligned, such that it never fails at any threshold (along the lines of there being a broad basin of corrigibility). But I haven’t really made this perspective play nicely with the perspective of alignment as translation.
This was exactly why I mentioned the distinction between “the AI has a good model of what humans want” and “the AI is programmed to actually do what humans want”. I haven’t been able to articulate it very well, but here’s a few things which feel like they’re pointing to the same idea:
If our AI is learning what humans value by predicting some data, then it won’t matter how clever the AI is if the data-collection process is not robustly pointed at human values.
More generally, if the source-of-truth for human values does not correctly and robustly point to human values, no amount of clever AI architecture can overcome that problem (though note that the source-of-truth may include e.g information about human values built into a prior)
In translation terms, at some point we have to translate some directive for the AI, something of the form “do X”. X may include some mechanism for self-correction, but if that initial mechanism for self-correction is ever insufficient, there will not be any way to fix it later (other than starting over with a whole new AI).
Continuing with the translation analogy: suppose we could translate the directive “don’t take these instructions literally, use them as evidence to figure out what I want and then do that”—and of course other instructions would include further information about how to figure out what you want. That’s the sort of thing which would potentially give a broad(er) basin of alignment if we’re looking at the problem through a translation lens.
I do think there’s a strong chance of alignment-by-default of AGI (at least 20%, maybe higher), as well as a strong chance of non-doom via other routes (e.g. decreasing marginal returns of intelligence or alignment becoming necessary for economic value in obvious ways).
Ah, got it. In that case I think we broadly agree.
one place where I think I diverge from many/most people in the area is that I’m playing to win, not just to avoid losing.
Yeah, this is a difference. I don’t think it’s particularly decision-relevant for me personally given the problems we actually face, but certainly it makes a difference in other hypotheticals (e.g. in the translation post I suggested testing + reversibility as a solution; that’s much more about not losing than it is about winning).
Continuing with the translation analogy: suppose we could translate the directive “don’t take these instructions literally, use them as evidence to figure out what I want and then do that”—and of course other instructions would include further information about how to figure out what you want. That’s the sort of thing which would potentially give a broad(er) basin of alignment if we’re looking at the problem through a translation lens.
Yeah, I think that’s right. There’s also the directive “assist me” / “help me get what I want”. It feels like these should be easier to translate (though I can’t say what makes them different from all the other cases where I expect translation to be hard).
So I don’t think that we can eliminate most of the risk from AI systems making dumb mistakes; I do in fact see that as quite likely. And plausibly such mistakes are even bad enough to cost lives.
What I think we can eliminate is the risk of an AI very competently and intelligently optimizing against us, causing an x-risk; that part doesn’t seem nearly as analogous to “long tail” problems.
I could break this down into a few subclaims:
1. It is very hard to cause existential catastrophes via “mistakes” or “random exploration”, such that we can ignore this aspect of risk. Therefore, we only have to consider cases where an AI system is “trying” to cause an existential catastrophe.
2. To cause an existential catastrophe, an AI system will have to be very good at generalization (at the very least, there will not have been an existential catastrophe in the past that it can learn from).
3. An AI system that is good at generalization would be good at the long tail (or at the very least, it would learn as it experienced the long tail).
A counterargument would be that your AI system could be great at generalizing at capabilities / impacting the world, but not great at generalizing alignment / motivation / translation of human objectives into AI objectives.
I think this is plausible, but I find the “value of long tail” argument much less compelling when talking about alignment / motivation, conditioned on having good generalization in capabilities. I wouldn’t agree with the “value of long tail” argument as applied to humans: for many tasks, it seems like you can explain to a human what the task is, and they are quickly able to do it without too many mistakes, or at least they know when they can’t do the task without too high a risk of error; it seems like this comes from our general reasoning + knowledge of the world, both of which the AI system presumably also has.
I think this is roughly the right counterargument, modulo the distinction between “the AI has a good model of what humans want” and “the AI is programmed to actually do what humans want”. (I don’t think that distinction is key to this discussion, but might be for some people who come along and read this.)
I do think there’s one really strong argument that generalizing alignment / motivation / translation of human objectives is harder than generalizing capabilities: what happens in the limit of infinite data and compute? In that limit, an AI can get best-possible predictive power by Bayesian reasoning on the entire microscopic state of the universe. That’s what best-possible generalizing capabilities look like. The argument in Alignment as Translation was that alignment / motivation / translation of human objectives is still hard, even in that limit, and the way-in-which-it-is-hard involves a long tail of mistranslated corner cases. In other words: generalizable predictive power is very clearly not a sufficient condition for generalizable alignment.
I’d say there’s a strong chance that generalizable predictive power will be enough for generalizable alignment in practice, with realistic data/compute, but we don’t even have a decent model to predict when it will fail—other than that it will fail, once data and compute pass some unknown threshold. Such a model would presumably involve an epistemic analogue of instrumental convergence: it would tell us when two systems with different architectures are likely to converge on similar abstractions in order to model the same world.
Basically agree with all of this.
Strongly agree. I have two arguments for work on AI safety that I really do buy and find motivating; this is one of them. (The other one is the one presented in Human Compatible.)
But with both of these arguments, I see them as establishing that we can’t be confident given our current knowledge that alignment happens by default; therefore given the high stakes we should work on it. This is different from making a prediction that things will probably go badly.
(I don’t think this is actually disagreeing with you anywhere.)
I want to flag a note of confusion here—it feels like it should be possible for a mostly-aligned system to become more aligned, such that it never fails at any threshold (along the lines of there being a broad basin of corrigibility). But I haven’t really made this perspective play nicely with the perspective of alignment as translation.
Thinking about it, I really should have been more explicit about this before: I do think there’s a strong chance of alignment-by-default of AGI (at least 20%, maybe higher), as well as a strong chance of non-doom via other routes (e.g. decreasing marginal returns of intelligence or alignment becoming necessary for economic value in obvious ways).
Related: one place where I think I diverge from many/most people in the area is that I’m playing to win, not just to avoid losing. I see alignment not just as important for avoiding doom, but as plausibly the hardest part of unlocking most of the economic value of AGI.
My goal for AGI is to create tons of value and to (very very reliably) avoid catastrophic loss. I see alignment-in-the-sense-of-translation as the main bottleneck to achieving both of those simultaneously; I expect that both the value and the risk are dominated by exponentially large numbers of corner-cases.
This was exactly why I mentioned the distinction between “the AI has a good model of what humans want” and “the AI is programmed to actually do what humans want”. I haven’t been able to articulate it very well, but here’s a few things which feel like they’re pointing to the same idea:
If our AI is learning what humans value by predicting some data, then it won’t matter how clever the AI is if the data-collection process is not robustly pointed at human values.
More generally, if the source-of-truth for human values does not correctly and robustly point to human values, no amount of clever AI architecture can overcome that problem (though note that the source-of-truth may include e.g information about human values built into a prior)
Abram’s stuff on stable pointers to values
In translation terms, at some point we have to translate some directive for the AI, something of the form “do X”. X may include some mechanism for self-correction, but if that initial mechanism for self-correction is ever insufficient, there will not be any way to fix it later (other than starting over with a whole new AI).
Continuing with the translation analogy: suppose we could translate the directive “don’t take these instructions literally, use them as evidence to figure out what I want and then do that”—and of course other instructions would include further information about how to figure out what you want. That’s the sort of thing which would potentially give a broad(er) basin of alignment if we’re looking at the problem through a translation lens.
Ah, got it. In that case I think we broadly agree.
Yeah, this is a difference. I don’t think it’s particularly decision-relevant for me personally given the problems we actually face, but certainly it makes a difference in other hypotheticals (e.g. in the translation post I suggested testing + reversibility as a solution; that’s much more about not losing than it is about winning).
Yeah, I think that’s right. There’s also the directive “assist me” / “help me get what I want”. It feels like these should be easier to translate (though I can’t say what makes them different from all the other cases where I expect translation to be hard).