This post is difficult to understand for me because of the lack of quantitative forecasts. I agree that “the technical problems are similar either way”, but iterating gives you the opportunity to solve some problems more easily, and the assumption that “the only problems that matter are the ones iteration can’t solve” seems unjustified. There are a lot of problems you’ll catch if you have 50 years to iterate compared to only 6 months, and both of those could count as “slow takeoff” depending on your definition of “slow”.
To make this more explicit, suppose you had to give some estimates on conditional probabilities of the form P(major alignment failure|takeoff duration=T) for a bunch of different values of T, say 6 months, 2 years, 10 years, et cetera. Are you saying that the probability is roughly constant and doesn’t depend on T much, or are you saying there’s some uniform (independent of T) lower bound p≫0 on these probabilities?
I am saying that the probability is roughly constant and doesn’t depend on T much. The vast majority of problems which will be noticed during iteration will be noticed in 6 months; another 49.5 years might catch a few more, but those few will be dwarfed by the problems which will not be noticed during iteration approximately-regardless of how much time is spent.
Somebody who did not notice the air conditioner problem in 6 months is unlikely to notice it in the next 49.5 years. The examples of charities with mediocre impact and low replication rates in medicine have both been ongoing for at least a century. Time is just not the main relevant variable here.
I understand. I guess that my problem is even in the air conditioner example I’d expect a substantially higher probability that the problem is noticed in 50 years compared to 6 months, if nothing else because you’d get visitors to your house, they’d happen to go to parts of it that are away from the air conditioner, notice it’s unnaturally hot, etc. Eventually someone can guess that the air conditioner is the problem.
If you start with a Jeffreys prior over the rate at which problems are noticed, then the fact that a problem hasn’t been noticed in 6 months only really tells you that it probably takes O(1 year) or more to notice the problem in the typical situation. To get the conclusion “if problem isn’t noticed in 6 months, it won’t be noticed in 50 years” seems to require some assumption like “almost all problems are either very easy or very hard”, or “the prior on the problems being noticed in a given day is like Beta(1/n,1/n) for some big n”.
I don’t think you can justify this by pointing to problems that are very hard to notice, since those will exist no matter what the prior distribution is. The question is about how much of the probability mass they make up, and here you seem to have some inside view into the situation which you don’t communicate in the blog post. Could you elaborate on that?
I indeed expect that the vast majority of problems will either be noticed within minutes or not at all.
One model: people either have the right mental toolkit to ask the relevant questions, or they don’t. Either they know to look for balancing flows in the air conditioner case, or they don’t. Either they know to ask about replication rates in the medical example, or they don’t. Either they know to ask about impact measures for charities, or they don’t. Young people might pick up these skills over time, but most people stop adding to their mental toolkit at a meaningful rate once they’re out of school.
Another model: once we accept that there are problems which will not be noticed over any relevant timescale, the distribution is guaranteed to be bimodal: there’s problems which will be noticed in some reasonable time, and problems which won’t. Then the only question is: what’s the relevant timescale after which most problems which will be noticed at all, are noticed? Looking at the world, it sure seems like that timescale is “minutes”, not “decades”, but that’s not really the key step here. The key step is realizing that there are plenty of problems which will not be noticed in any relevant amount of time. At that point, we’re definitely in a world where “almost all problems are either very easy or very hard”, it’s just a question of exactly how much time corresponds to “very easy”.
One model: people either have the right mental toolkit to ask the relevant questions, or they don’t. Either they know to look for balancing flows in the air conditioner case, or they don’t. Either they know to ask about replication rates in the medical example, or they don’t. Either they know to ask about impact measures for charities, or they don’t. Young people might pick up these skills over time, but most people stop adding to their mental toolkit at a meaningful rate once they’re out of school.
Two points about this:
People can notice that there’s a problem and narrow it down to the air conditioner given enough time even if they have no gears-level understanding of what’s happening. For example, the Romans knew nothing about the mechanics of how malaria spreads, but they figured out that it has something to do with “bad air”, hence the name “malaria”. It’s entirely possible that such an understanding will not be here in 6 months but will be here in 50 years, and I suspect it’s more or less what happened in the case of malaria.
Thinking in terms of “most people” is reasonable in the case of the air conditioner, but it seems like a bad idea when it comes to AI alignment, since the people working on the problem will be quite far away from the center of the distribution when it comes to many different traits.
Another model: once we accept that there are problems which will not be noticed over any relevant timescale, the distribution is guaranteed to be bimodal: there’s problems which will be noticed in some reasonable time, and problems which won’t. Then the only question is: what’s the relevant timescale after which most problems which will be noticed at all, are noticed? Looking at the world, it sure seems like that timescale is “minutes”, not “decades”, but that’s not really the key step here. The key step is realizing that there are plenty of problems which will not be noticed in any relevant amount of time. At that point, we’re definitely in a world where “almost all problems are either very easy or very hard”, it’s just a question of exactly how much time corresponds to “very easy”.
I don’t think I like this framing because I don’t think it gets us to the conclusion we want. The Jeffreys prior is also bimodal and it doesn’t have this big discrepancy at any timescale. If the Jeffreys prior is applicable to the situation, then if a problem hasn’t been solved in T years your mean forecast for how long it will take to solve is ≈2T years.
You’re assuming not only that the prior is bimodal, but that it’s “strongly” bimodal, whatever that means. In the beta distribution case it corresponds to taking n to be very large. Your first argument could do this, but I’m skeptical about it for the two reasons I’ve mentioned in response to it above.
You’re assuming not only that the prior is bimodal, but that it’s “strongly” bimodal, whatever that means.
Stronger than that, even. I’m saying that my distribution over rate-of-problem-solving has a delta spike at zero, mixed with some other distribution at nonzero rates.
Which is indeed how realistic priors should usually look! If a flip a coin 50 times and it comes up heads all 50 times, then I think it’s much more likely that this coin simply has heads on both sides (or some other reason to come up basically-always-heads) than that it has a 1⁄100 or smaller (but importantly nonzero) chance of coming up heads. The prior which corresponds to that kind of reasoning is a delta spike on 0% heads, a delta spike on 0% tails, and then some weight on a continuous distribution between those two.
which I think is quite reasonable, but it doesn’t get you to “the probability is roughly constant as T varies”, because you’re only controlling the tail near zero and not near infinity. If you control both tails then you’re back to where we started, and the difference between a delta spike and a smoothed out version of the delta isn’t that important in this context.
Let Teq be the first time at which P(major alignment failure|takeoff duration=T) is within ϵ of p. As long as ϵ is small, the probability will be roughly constant with time after Teq. Thus, the probability is roughly constant as T varies, once we get past some initial period.
(Side note: in order for this to be interesting, we want ϵ small relative to p.)
For instance, we might expect that approximately-anyone who’s going to notice a particular problem at all will notice it in the first week, so Teq is on the order of a week, and the probability of noticing a problem is approximately constant with respect to time for times much longer than a week.
I agree with that, but I don’t see where the justification for Teq≈1week comes from. You can’t get there just from “there are problems that won’t be noticed at any relevant timescale”, and I think the only argument you’ve given so far for why the “intermediate time scales” should be sparsely populated by problems is your first model, which I didn’t find persuasive for the reasons I gave.
This post is difficult to understand for me because of the lack of quantitative forecasts. I agree that “the technical problems are similar either way”, but iterating gives you the opportunity to solve some problems more easily, and the assumption that “the only problems that matter are the ones iteration can’t solve” seems unjustified. There are a lot of problems you’ll catch if you have 50 years to iterate compared to only 6 months, and both of those could count as “slow takeoff” depending on your definition of “slow”.
To make this more explicit, suppose you had to give some estimates on conditional probabilities of the form P(major alignment failure|takeoff duration=T) for a bunch of different values of T, say 6 months, 2 years, 10 years, et cetera. Are you saying that the probability is roughly constant and doesn’t depend on T much, or are you saying there’s some uniform (independent of T) lower bound p≫0 on these probabilities?
I am saying that the probability is roughly constant and doesn’t depend on T much. The vast majority of problems which will be noticed during iteration will be noticed in 6 months; another 49.5 years might catch a few more, but those few will be dwarfed by the problems which will not be noticed during iteration approximately-regardless of how much time is spent.
Somebody who did not notice the air conditioner problem in 6 months is unlikely to notice it in the next 49.5 years. The examples of charities with mediocre impact and low replication rates in medicine have both been ongoing for at least a century. Time is just not the main relevant variable here.
I understand. I guess that my problem is even in the air conditioner example I’d expect a substantially higher probability that the problem is noticed in 50 years compared to 6 months, if nothing else because you’d get visitors to your house, they’d happen to go to parts of it that are away from the air conditioner, notice it’s unnaturally hot, etc. Eventually someone can guess that the air conditioner is the problem.
If you start with a Jeffreys prior over the rate at which problems are noticed, then the fact that a problem hasn’t been noticed in 6 months only really tells you that it probably takes O(1 year) or more to notice the problem in the typical situation. To get the conclusion “if problem isn’t noticed in 6 months, it won’t be noticed in 50 years” seems to require some assumption like “almost all problems are either very easy or very hard”, or “the prior on the problems being noticed in a given day is like Beta(1/n,1/n) for some big n”.
I don’t think you can justify this by pointing to problems that are very hard to notice, since those will exist no matter what the prior distribution is. The question is about how much of the probability mass they make up, and here you seem to have some inside view into the situation which you don’t communicate in the blog post. Could you elaborate on that?
I indeed expect that the vast majority of problems will either be noticed within minutes or not at all.
One model: people either have the right mental toolkit to ask the relevant questions, or they don’t. Either they know to look for balancing flows in the air conditioner case, or they don’t. Either they know to ask about replication rates in the medical example, or they don’t. Either they know to ask about impact measures for charities, or they don’t. Young people might pick up these skills over time, but most people stop adding to their mental toolkit at a meaningful rate once they’re out of school.
Another model: once we accept that there are problems which will not be noticed over any relevant timescale, the distribution is guaranteed to be bimodal: there’s problems which will be noticed in some reasonable time, and problems which won’t. Then the only question is: what’s the relevant timescale after which most problems which will be noticed at all, are noticed? Looking at the world, it sure seems like that timescale is “minutes”, not “decades”, but that’s not really the key step here. The key step is realizing that there are plenty of problems which will not be noticed in any relevant amount of time. At that point, we’re definitely in a world where “almost all problems are either very easy or very hard”, it’s just a question of exactly how much time corresponds to “very easy”.
Two points about this:
People can notice that there’s a problem and narrow it down to the air conditioner given enough time even if they have no gears-level understanding of what’s happening. For example, the Romans knew nothing about the mechanics of how malaria spreads, but they figured out that it has something to do with “bad air”, hence the name “malaria”. It’s entirely possible that such an understanding will not be here in 6 months but will be here in 50 years, and I suspect it’s more or less what happened in the case of malaria.
Thinking in terms of “most people” is reasonable in the case of the air conditioner, but it seems like a bad idea when it comes to AI alignment, since the people working on the problem will be quite far away from the center of the distribution when it comes to many different traits.
I don’t think I like this framing because I don’t think it gets us to the conclusion we want. The Jeffreys prior is also bimodal and it doesn’t have this big discrepancy at any timescale. If the Jeffreys prior is applicable to the situation, then if a problem hasn’t been solved in T years your mean forecast for how long it will take to solve is ≈2T years.
You’re assuming not only that the prior is bimodal, but that it’s “strongly” bimodal, whatever that means. In the beta distribution case it corresponds to taking n to be very large. Your first argument could do this, but I’m skeptical about it for the two reasons I’ve mentioned in response to it above.
Stronger than that, even. I’m saying that my distribution over rate-of-problem-solving has a delta spike at zero, mixed with some other distribution at nonzero rates.
Which is indeed how realistic priors should usually look! If a flip a coin 50 times and it comes up heads all 50 times, then I think it’s much more likely that this coin simply has heads on both sides (or some other reason to come up basically-always-heads) than that it has a 1⁄100 or smaller (but importantly nonzero) chance of coming up heads. The prior which corresponds to that kind of reasoning is a delta spike on 0% heads, a delta spike on 0% tails, and then some weight on a continuous distribution between those two.
Right, but then it seems like you get back to what I said in my original comment: this gets you to
limT→∞P(major alignment failure|takeoff duration=T)=p≫0
which I think is quite reasonable, but it doesn’t get you to “the probability is roughly constant as T varies”, because you’re only controlling the tail near zero and not near infinity. If you control both tails then you’re back to where we started, and the difference between a delta spike and a smoothed out version of the delta isn’t that important in this context.
Let Teq be the first time at which P(major alignment failure|takeoff duration=T) is within ϵ of p. As long as ϵ is small, the probability will be roughly constant with time after Teq. Thus, the probability is roughly constant as T varies, once we get past some initial period.
(Side note: in order for this to be interesting, we want ϵ small relative to p.)
For instance, we might expect that approximately-anyone who’s going to notice a particular problem at all will notice it in the first week, so Teq is on the order of a week, and the probability of noticing a problem is approximately constant with respect to time for times much longer than a week.
I agree with that, but I don’t see where the justification for Teq≈1week comes from. You can’t get there just from “there are problems that won’t be noticed at any relevant timescale”, and I think the only argument you’ve given so far for why the “intermediate time scales” should be sparsely populated by problems is your first model, which I didn’t find persuasive for the reasons I gave.