And for the car, real failures in 99%-safe-per-trip cars don’t actually look like six subsystems failing independently. They look like “we classified this woman holding a bicycle as a woman riding a bicycle.” The number of possible failures is large, but that doesn’t make them likely, just numerous.
So, I sort of agree in general that it’s better to solve long tail problems with solutions that try to generalize or exploit dimensionality, but I don’t agree so much that I think this means you can’t make a superhuman driver just by fixing problems “one at a time” in the way we’ve currently lumped together failure modes into discrete problems.
And for the car, real failures in 99%-safe-per-trip cars don’t actually look like six subsystems failing independently. They look like “we classified this woman holding a bicycle as a woman riding a bicycle.”
Yeah, at this point in time the engineering of our self driving cars is such complete shit that a single point of failure in the software is sufficient to cause a problem. I would say that self-driving car engineers who run into problems like this haven’t even really started working on the long tail yet. Humans at least have backup heuristics like “don’t hit things” which do not depend on highly-reliable object classification.
I don’t agree so much that I think this means you can’t make a superhuman driver just by fixing problems “one at a time” in the way we’ve currently lumped together failure modes into discrete problems.
How about this framing: when people build highly-reliable complex software, how often do they do it by starting with a buggy piece of software and then fixing problems as they come up until problems stop coming up? My guess would be “basically never”; at a bare minimum, things like 100% test coverage are going to be involved (not just tackling problems as they come up), and often more complicated things like formal specifications and proofs of correctness, stress testing, white-box tests specifically designed to break the system, etc.
How about this framing: when people build highly-reliable complex software, how often do they do it by starting with a buggy piece of software and then fixing problems as they come up until problems stop coming up?
Funnily enough, I would use this same example to illustrate the opposite point: in fact we build extremely complex software by writing buggy software and then fixing problems as they come up (this describes ~all of the big tech giants, I believe), which suggests that at least within the domains those companies work in, you didn’t have to solve literally all of the problems to capture a lot of economic value.
in fact we build extremely complex software by writing buggy software and then fixing problems as they come up (this describes ~all of the big tech giants, I believe)
Totally agree, and anyone who’s worked for one of those tech giants will tell you that their software is absolutely packed with bugs. Those are indeed domains where solving the long tail of problems is not necessary for unlocking tons of value. That software does not need to be highly reliable, and indeed it is not highly reliable. If even just one in a hundred bugs at Google or Facebook or Microsoft killed somebody, those companies would have been sued out of business for gross negligence years ago.
I don’t think this reduces to a disagreement about the necessity of guarantees; I think it reduces to a disagreement about whether the value of AGI (and the risk of AGI) resides primarily in the long tail. I wrote the OP in large part because, in our most recent discussion, it sounded like the claim “value/risk of AI is mainly in the long tail” was something you found plausible/likely, but you also thought we could eliminate most of the risk by fixing problems as they come up. The point of the OP is that these are mutually exclusive: if the value/risk is in the tail, then fixing problems as they come up cannot handle it.
(Side note: I don’t think I communicated very well in that thing about whether “guarantees are necessary for anything to be useful for AI safety”; that’s not really how I view it. Value/risk in the long tail is one of the main generators of that view. It’s not that a guarantee is necessary, more that the sort of thing which actually handles the long tail is necessary—guarantees are one way of handling a long tail, but they’re certainly not the only way. I do still stand by the analogy I used there: if something wouldn’t be considered good enough for bridge safety engineering on a completely novel bridge design, then it shouldn’t be considered good enough for AI safety engineering.)
My claim there was that in a world where alignment is about translation you could just do testing / reversibility etc. I do find this somewhat persuasive that that claim was wrong.
Nonetheless, I don’t think the power law dynamic really matches my model of the situation. I was more imagining a model with some sort of threshold effect:
1. Economic value is often tied to high levels of reliability, perhaps because:
1a. Unacceptable risks (self-driving cars)
1b. Small failure rates still lead to many failures at scale (e.g. imagine if cars broke down once every 10K miles—this is a low failure rate, but many people would have to deal with this multiple times a year)
1c. Other people would like to build on top of your product, and can’t deal with an abstraction that perpetually leaks, because that vastly increases the complexity of using the product. (True of nearly all software, even end user software—if Google Docs had a 0.1% chance of failing to save my work, I would not use it.)
2. All of these lead to ~threshold effects: once the failure rate drops below some threshold t, it becomes economically valuable and people start producing it; this leads to more investment that reduces the failure rate further, making it even more valuable. Notably, these are not power laws. (In practice, they aren’t sharp thresholds—maybe at a failure rate of 0.1%, you get 1% of the potential market, at 0.01%, you get 50%, and at 0.001% you get 99%.)
3. So when I agree with “the value is in the long tail”, I mostly mean “the threshold t is very very low; the amount of effort it takes to get there is typically higher than people expect”. But the threshold t still varies across domains, and it’s still possible for testing-style approaches to reach the threshold; it depends on the particular domain at hand.
I think this argument applies both to self-driving cars, and traditional software (a la big tech companies), which is why I still used big tech companies as an example where value is in the tail.
it sounded like the claim “value/risk of AI is mainly in the long tail” was something you found plausible/likely, but you also thought we could eliminate most of the risk by fixing problems as they come up.
So I don’t think that we can eliminate most of the risk from AI systems making dumb mistakes; I do in fact see that as quite likely. And plausibly such mistakes are even bad enough to cost lives.
What I think we can eliminate is the risk of an AI very competently and intelligently optimizing against us, causing an x-risk; that part doesn’t seem nearly as analogous to “long tail” problems.
I could break this down into a few subclaims:
1. It is very hard to cause existential catastrophes via “mistakes” or “random exploration”, such that we can ignore this aspect of risk. Therefore, we only have to consider cases where an AI system is “trying” to cause an existential catastrophe.
2. To cause an existential catastrophe, an AI system will have to be very good at generalization (at the very least, there will not have been an existential catastrophe in the past that it can learn from).
3. An AI system that is good at generalization would be good at the long tail (or at the very least, it would learn as it experienced the long tail).
A counterargument would be that your AI system could be great at generalizing at capabilities / impacting the world, but not great at generalizing alignment / motivation / translation of human objectives into AI objectives.
I think this is plausible, but I find the “value of long tail” argument much less compelling when talking about alignment / motivation, conditioned on having good generalization in capabilities. I wouldn’t agree with the “value of long tail” argument as applied to humans: for many tasks, it seems like you can explain to a human what the task is, and they are quickly able to do it without too many mistakes, or at least they know when they can’t do the task without too high a risk of error; it seems like this comes from our general reasoning + knowledge of the world, both of which the AI system presumably also has.
A counterargument would be that your AI system could be great at generalizing at capabilities / impacting the world, but not great at generalizing alignment / motivation / translation of human objectives into AI objectives.
I think this is roughly the right counterargument, modulo the distinction between “the AI has a good model of what humans want” and “the AI is programmed to actually do what humans want”. (I don’t think that distinction is key to this discussion, but might be for some people who come along and read this.)
I do think there’s one really strong argument that generalizing alignment / motivation / translation of human objectives is harder than generalizing capabilities: what happens in the limit of infinite data and compute? In that limit, an AI can get best-possible predictive power by Bayesian reasoning on the entire microscopic state of the universe. That’s what best-possible generalizing capabilities look like. The argument in Alignment as Translation was that alignment / motivation / translation of human objectives is still hard, even in that limit, and the way-in-which-it-is-hard involves a long tail of mistranslated corner cases. In other words: generalizable predictive power is very clearly not a sufficient condition for generalizable alignment.
I’d say there’s a strong chance that generalizable predictive power will be enough for generalizable alignment in practice, with realistic data/compute, but we don’t even have a decent model to predict when it will fail—other than that it will fail, once data and compute pass some unknown threshold. Such a model would presumably involve an epistemic analogue of instrumental convergence: it would tell us when two systems with different architectures are likely to converge on similar abstractions in order to model the same world.
I do think there’s one really strong argument that generalizing alignment / motivation / translation of human objectives is harder than generalizing capabilities: what happens in the limit of infinite data and compute?
Strongly agree. I have two arguments for work on AI safety that I really do buy and find motivating; this is one of them. (The other one is the one presented in Human Compatible.)
But with both of these arguments, I see them as establishing that we can’t be confident given our current knowledge that alignment happens by default; therefore given the high stakes we should work on it. This is different from making a prediction that things will probably go badly.
(I don’t think this is actually disagreeing with you anywhere.)
other than that it will fail, once data and compute pass some unknown threshold.
I want to flag a note of confusion here—it feels like it should be possible for a mostly-aligned system to become more aligned, such that it never fails at any threshold (along the lines of there being a broad basin of corrigibility). But I haven’t really made this perspective play nicely with the perspective of alignment as translation.
This is different from making a prediction that things will probably go badly.
Thinking about it, I really should have been more explicit about this before: I do think there’s a strong chance of alignment-by-default of AGI (at least 20%, maybe higher), as well as a strong chance of non-doom via other routes (e.g. decreasing marginal returns of intelligence or alignment becoming necessary for economic value in obvious ways).
Related: one place where I think I diverge from many/most people in the area is that I’m playing to win, not just to avoid losing. I see alignment not just as important for avoiding doom, but as plausibly the hardest part of unlocking most of the economic value of AGI.
My goal for AGI is to create tons of value and to (very very reliably) avoid catastrophic loss. I see alignment-in-the-sense-of-translation as the main bottleneck to achieving both of those simultaneously; I expect that both the value and the risk are dominated by exponentially large numbers of corner-cases.
I want to flag a note of confusion here—it feels like it should be possible for a mostly-aligned system to become more aligned, such that it never fails at any threshold (along the lines of there being a broad basin of corrigibility). But I haven’t really made this perspective play nicely with the perspective of alignment as translation.
This was exactly why I mentioned the distinction between “the AI has a good model of what humans want” and “the AI is programmed to actually do what humans want”. I haven’t been able to articulate it very well, but here’s a few things which feel like they’re pointing to the same idea:
If our AI is learning what humans value by predicting some data, then it won’t matter how clever the AI is if the data-collection process is not robustly pointed at human values.
More generally, if the source-of-truth for human values does not correctly and robustly point to human values, no amount of clever AI architecture can overcome that problem (though note that the source-of-truth may include e.g information about human values built into a prior)
In translation terms, at some point we have to translate some directive for the AI, something of the form “do X”. X may include some mechanism for self-correction, but if that initial mechanism for self-correction is ever insufficient, there will not be any way to fix it later (other than starting over with a whole new AI).
Continuing with the translation analogy: suppose we could translate the directive “don’t take these instructions literally, use them as evidence to figure out what I want and then do that”—and of course other instructions would include further information about how to figure out what you want. That’s the sort of thing which would potentially give a broad(er) basin of alignment if we’re looking at the problem through a translation lens.
I do think there’s a strong chance of alignment-by-default of AGI (at least 20%, maybe higher), as well as a strong chance of non-doom via other routes (e.g. decreasing marginal returns of intelligence or alignment becoming necessary for economic value in obvious ways).
Ah, got it. In that case I think we broadly agree.
one place where I think I diverge from many/most people in the area is that I’m playing to win, not just to avoid losing.
Yeah, this is a difference. I don’t think it’s particularly decision-relevant for me personally given the problems we actually face, but certainly it makes a difference in other hypotheticals (e.g. in the translation post I suggested testing + reversibility as a solution; that’s much more about not losing than it is about winning).
Continuing with the translation analogy: suppose we could translate the directive “don’t take these instructions literally, use them as evidence to figure out what I want and then do that”—and of course other instructions would include further information about how to figure out what you want. That’s the sort of thing which would potentially give a broad(er) basin of alignment if we’re looking at the problem through a translation lens.
Yeah, I think that’s right. There’s also the directive “assist me” / “help me get what I want”. It feels like these should be easier to translate (though I can’t say what makes them different from all the other cases where I expect translation to be hard).
What’s the corresponding story here for trading bots? Are they designed in a sufficiently high-assurance way that new tail problems don’t come up, or do they not operate in the tails?
Ten years ago, Knight Capital was the largest high-frequency trader in US equities. On August 1 2012, somebody deployed a bug. Knight’s testing platform included a component which generated random orders and sent them to a simulated market; somebody accidentally hooked that up to the real market. It’s exactly the sort of error testing won’t catch, because it was a change outside of the things-which-are-tested; it was partly an error in deployment, and partly code which did not handle partial deployment. The problem was fixed about 45 minutes later. That was the end of Knight Capital.
So yes, trading bots definitely operate in the tails.
When the Knight bug happened, I was interning at the largest high-frequency trading company in US options. Even before that, the company was more religious about thorough testing than any other I’ve worked at. Everybody knew that one bug could end us, Knight was just a reminder (specifically a reminder to handle partial deployment properly).
Ah, you beat me to it :)
And for the car, real failures in 99%-safe-per-trip cars don’t actually look like six subsystems failing independently. They look like “we classified this woman holding a bicycle as a woman riding a bicycle.” The number of possible failures is large, but that doesn’t make them likely, just numerous.
So, I sort of agree in general that it’s better to solve long tail problems with solutions that try to generalize or exploit dimensionality, but I don’t agree so much that I think this means you can’t make a superhuman driver just by fixing problems “one at a time” in the way we’ve currently lumped together failure modes into discrete problems.
Yeah, at this point in time the engineering of our self driving cars is such complete shit that a single point of failure in the software is sufficient to cause a problem. I would say that self-driving car engineers who run into problems like this haven’t even really started working on the long tail yet. Humans at least have backup heuristics like “don’t hit things” which do not depend on highly-reliable object classification.
How about this framing: when people build highly-reliable complex software, how often do they do it by starting with a buggy piece of software and then fixing problems as they come up until problems stop coming up? My guess would be “basically never”; at a bare minimum, things like 100% test coverage are going to be involved (not just tackling problems as they come up), and often more complicated things like formal specifications and proofs of correctness, stress testing, white-box tests specifically designed to break the system, etc.
Funnily enough, I would use this same example to illustrate the opposite point: in fact we build extremely complex software by writing buggy software and then fixing problems as they come up (this describes ~all of the big tech giants, I believe), which suggests that at least within the domains those companies work in, you didn’t have to solve literally all of the problems to capture a lot of economic value.
Though maybe this reduces to our previous disagreement about whether anything that does not have guarantees can contribute to AI safety.
Totally agree, and anyone who’s worked for one of those tech giants will tell you that their software is absolutely packed with bugs. Those are indeed domains where solving the long tail of problems is not necessary for unlocking tons of value. That software does not need to be highly reliable, and indeed it is not highly reliable. If even just one in a hundred bugs at Google or Facebook or Microsoft killed somebody, those companies would have been sued out of business for gross negligence years ago.
I don’t think this reduces to a disagreement about the necessity of guarantees; I think it reduces to a disagreement about whether the value of AGI (and the risk of AGI) resides primarily in the long tail. I wrote the OP in large part because, in our most recent discussion, it sounded like the claim “value/risk of AI is mainly in the long tail” was something you found plausible/likely, but you also thought we could eliminate most of the risk by fixing problems as they come up. The point of the OP is that these are mutually exclusive: if the value/risk is in the tail, then fixing problems as they come up cannot handle it.
(Side note: I don’t think I communicated very well in that thing about whether “guarantees are necessary for anything to be useful for AI safety”; that’s not really how I view it. Value/risk in the long tail is one of the main generators of that view. It’s not that a guarantee is necessary, more that the sort of thing which actually handles the long tail is necessary—guarantees are one way of handling a long tail, but they’re certainly not the only way. I do still stand by the analogy I used there: if something wouldn’t be considered good enough for bridge safety engineering on a completely novel bridge design, then it shouldn’t be considered good enough for AI safety engineering.)
My claim there was that in a world where alignment is about translation you could just do testing / reversibility etc. I do find this somewhat persuasive that that claim was wrong.
Nonetheless, I don’t think the power law dynamic really matches my model of the situation. I was more imagining a model with some sort of threshold effect:
1. Economic value is often tied to high levels of reliability, perhaps because:
1a. Unacceptable risks (self-driving cars)
1b. Small failure rates still lead to many failures at scale (e.g. imagine if cars broke down once every 10K miles—this is a low failure rate, but many people would have to deal with this multiple times a year)
1c. Other people would like to build on top of your product, and can’t deal with an abstraction that perpetually leaks, because that vastly increases the complexity of using the product. (True of nearly all software, even end user software—if Google Docs had a 0.1% chance of failing to save my work, I would not use it.)
2. All of these lead to ~threshold effects: once the failure rate drops below some threshold t, it becomes economically valuable and people start producing it; this leads to more investment that reduces the failure rate further, making it even more valuable. Notably, these are not power laws. (In practice, they aren’t sharp thresholds—maybe at a failure rate of 0.1%, you get 1% of the potential market, at 0.01%, you get 50%, and at 0.001% you get 99%.)
3. So when I agree with “the value is in the long tail”, I mostly mean “the threshold t is very very low; the amount of effort it takes to get there is typically higher than people expect”. But the threshold t still varies across domains, and it’s still possible for testing-style approaches to reach the threshold; it depends on the particular domain at hand.
I think this argument applies both to self-driving cars, and traditional software (a la big tech companies), which is why I still used big tech companies as an example where value is in the tail.
Agree, and you’ve articulated this much better than I had in my head. Thank you.
So I don’t think that we can eliminate most of the risk from AI systems making dumb mistakes; I do in fact see that as quite likely. And plausibly such mistakes are even bad enough to cost lives.
What I think we can eliminate is the risk of an AI very competently and intelligently optimizing against us, causing an x-risk; that part doesn’t seem nearly as analogous to “long tail” problems.
I could break this down into a few subclaims:
1. It is very hard to cause existential catastrophes via “mistakes” or “random exploration”, such that we can ignore this aspect of risk. Therefore, we only have to consider cases where an AI system is “trying” to cause an existential catastrophe.
2. To cause an existential catastrophe, an AI system will have to be very good at generalization (at the very least, there will not have been an existential catastrophe in the past that it can learn from).
3. An AI system that is good at generalization would be good at the long tail (or at the very least, it would learn as it experienced the long tail).
A counterargument would be that your AI system could be great at generalizing at capabilities / impacting the world, but not great at generalizing alignment / motivation / translation of human objectives into AI objectives.
I think this is plausible, but I find the “value of long tail” argument much less compelling when talking about alignment / motivation, conditioned on having good generalization in capabilities. I wouldn’t agree with the “value of long tail” argument as applied to humans: for many tasks, it seems like you can explain to a human what the task is, and they are quickly able to do it without too many mistakes, or at least they know when they can’t do the task without too high a risk of error; it seems like this comes from our general reasoning + knowledge of the world, both of which the AI system presumably also has.
I think this is roughly the right counterargument, modulo the distinction between “the AI has a good model of what humans want” and “the AI is programmed to actually do what humans want”. (I don’t think that distinction is key to this discussion, but might be for some people who come along and read this.)
I do think there’s one really strong argument that generalizing alignment / motivation / translation of human objectives is harder than generalizing capabilities: what happens in the limit of infinite data and compute? In that limit, an AI can get best-possible predictive power by Bayesian reasoning on the entire microscopic state of the universe. That’s what best-possible generalizing capabilities look like. The argument in Alignment as Translation was that alignment / motivation / translation of human objectives is still hard, even in that limit, and the way-in-which-it-is-hard involves a long tail of mistranslated corner cases. In other words: generalizable predictive power is very clearly not a sufficient condition for generalizable alignment.
I’d say there’s a strong chance that generalizable predictive power will be enough for generalizable alignment in practice, with realistic data/compute, but we don’t even have a decent model to predict when it will fail—other than that it will fail, once data and compute pass some unknown threshold. Such a model would presumably involve an epistemic analogue of instrumental convergence: it would tell us when two systems with different architectures are likely to converge on similar abstractions in order to model the same world.
Basically agree with all of this.
Strongly agree. I have two arguments for work on AI safety that I really do buy and find motivating; this is one of them. (The other one is the one presented in Human Compatible.)
But with both of these arguments, I see them as establishing that we can’t be confident given our current knowledge that alignment happens by default; therefore given the high stakes we should work on it. This is different from making a prediction that things will probably go badly.
(I don’t think this is actually disagreeing with you anywhere.)
I want to flag a note of confusion here—it feels like it should be possible for a mostly-aligned system to become more aligned, such that it never fails at any threshold (along the lines of there being a broad basin of corrigibility). But I haven’t really made this perspective play nicely with the perspective of alignment as translation.
Thinking about it, I really should have been more explicit about this before: I do think there’s a strong chance of alignment-by-default of AGI (at least 20%, maybe higher), as well as a strong chance of non-doom via other routes (e.g. decreasing marginal returns of intelligence or alignment becoming necessary for economic value in obvious ways).
Related: one place where I think I diverge from many/most people in the area is that I’m playing to win, not just to avoid losing. I see alignment not just as important for avoiding doom, but as plausibly the hardest part of unlocking most of the economic value of AGI.
My goal for AGI is to create tons of value and to (very very reliably) avoid catastrophic loss. I see alignment-in-the-sense-of-translation as the main bottleneck to achieving both of those simultaneously; I expect that both the value and the risk are dominated by exponentially large numbers of corner-cases.
This was exactly why I mentioned the distinction between “the AI has a good model of what humans want” and “the AI is programmed to actually do what humans want”. I haven’t been able to articulate it very well, but here’s a few things which feel like they’re pointing to the same idea:
If our AI is learning what humans value by predicting some data, then it won’t matter how clever the AI is if the data-collection process is not robustly pointed at human values.
More generally, if the source-of-truth for human values does not correctly and robustly point to human values, no amount of clever AI architecture can overcome that problem (though note that the source-of-truth may include e.g information about human values built into a prior)
Abram’s stuff on stable pointers to values
In translation terms, at some point we have to translate some directive for the AI, something of the form “do X”. X may include some mechanism for self-correction, but if that initial mechanism for self-correction is ever insufficient, there will not be any way to fix it later (other than starting over with a whole new AI).
Continuing with the translation analogy: suppose we could translate the directive “don’t take these instructions literally, use them as evidence to figure out what I want and then do that”—and of course other instructions would include further information about how to figure out what you want. That’s the sort of thing which would potentially give a broad(er) basin of alignment if we’re looking at the problem through a translation lens.
Ah, got it. In that case I think we broadly agree.
Yeah, this is a difference. I don’t think it’s particularly decision-relevant for me personally given the problems we actually face, but certainly it makes a difference in other hypotheticals (e.g. in the translation post I suggested testing + reversibility as a solution; that’s much more about not losing than it is about winning).
Yeah, I think that’s right. There’s also the directive “assist me” / “help me get what I want”. It feels like these should be easier to translate (though I can’t say what makes them different from all the other cases where I expect translation to be hard).
What’s the corresponding story here for trading bots? Are they designed in a sufficiently high-assurance way that new tail problems don’t come up, or do they not operate in the tails?
Great question. Let’s talk about Knight Capital.
Ten years ago, Knight Capital was the largest high-frequency trader in US equities. On August 1 2012, somebody deployed a bug. Knight’s testing platform included a component which generated random orders and sent them to a simulated market; somebody accidentally hooked that up to the real market. It’s exactly the sort of error testing won’t catch, because it was a change outside of the things-which-are-tested; it was partly an error in deployment, and partly code which did not handle partial deployment. The problem was fixed about 45 minutes later. That was the end of Knight Capital.
So yes, trading bots definitely operate in the tails.
When the Knight bug happened, I was interning at the largest high-frequency trading company in US options. Even before that, the company was more religious about thorough testing than any other I’ve worked at. Everybody knew that one bug could end us, Knight was just a reminder (specifically a reminder to handle partial deployment properly).