I disagree with most of the empirical claims in this post, and dislike most of the words.
But I do like the framework of valuing actions based on log odds in doom reduction. Some reasons I like it:
I think it makes it easier to communicate between people with very different background views about alignment; it seems much easier to agree about whether something reduces doom by 1 bit, than to agree about whether it cuts doom by 50%.
It seems like the right way to prepare for positive surprises if you are pessimistic.
I think it’s correct that saying “well if there’s a chance it has to be X...” very often dooms you on the starting line, and I think this fuzzier way of preparing for positive surprises is often better. I agree that in many contexts you shouldn’t say “let’s condition on somehow getting enough dignity points, since that’s the only way it matters” and should instead just fight for what you perceive as a dignity point.
I tentatively think that it’s better for having intuition about effect sizes and comparing different interventions.
I think it makes it easier to communicate between people with very different background views about alignment; it seems much easier to agree about whether something reduces doom by 1 bit, than to agree about whether it cuts doom by 50%.
I’m a bit surprised by this. (I was also a bit confused about what work “log odds” vs other probability framing was doing in the first place, and so were some others I’ve chatted with). Have you run into people/conversations when this was helpful?
(I guess same question for Eliezer. I had a vague sense of “log odds put the question into units that were more motivating to reason about”, but was fuzzy on the details. I’m not fluent in log odds, I have some sense that if were were actually measuring precise probabilities it might matter for making math actually easier, but I haven’t seen anyone make arguments about success chance that were grounded in something rigorous enough that doing actual math to it was useful)
Likelihood ratios can be easier to evaluate than absolute probabilities insofar as you can focus on the meaning of a single piece of evidence, separately from the context of everything else you believe.
Suppose we’re trying to pick a multiple-dial combination lock (in the dark, where we can’t see how many dials there are). If we somehow learn that the first digit is 3, we can agree that that’s lg10≈3.32 bits of progress towards cracking the lock, even if you think there’s three dials total (such that we only need 6.64 more bits) and I think there’s 10 dials total (such that we need 29.88 more bits).
Thanks, I found it useful to have this explanation alongside Paul’s. (I think each explanation would have helped a bit on it’s own but having it explained in two different language gave me a clearer sense of how the math worked and how to conceptualize it)
More intuitive illustration with no logarithms: your plane crashed in the ocean. To survive, you must swim to shore. You know that the shore is west, but you don’t know how far.
The optimist thinks the shore is just over the horizon; we only need to swim a few miles and we’ll almost certainly make it. The pessimist thinks the shore is a thousand miles away and we will surely die. But the optimist and pessimist can both agree on how far we’ve swum up to this point, and that the most dignified course of action is “Swim west as far as you can.”
Suppose that Eliezer thinks there is a 99% risk of doom, and I think there is a 20% risk of doom.
Suppose that we solve some problem we both think of as incredibly important, like we find a way to solve ontology identification and make sense of the alien knowledge a model has about the world and about how to think, and it actually looks pretty practical and promising and suggests an angle of attack on other big theoretical problems and generally suggests all these difficulties may be more tractable than we thought.
If that’s an incredible smashing success maybe my risk estimate has gone down from 20% to 10%, cutting risk in half.
And if it’s an incredible smashing success maybe Eliezer thinks that risk has gone down from 99% to 98%, cutting risk by ~1%.
I think there are basically just two separate issues at stake:
How much does this help solve the problem? I think mostly captured by bits of log odds reduction, and not where the real disagreement is.
How much are we doomed anyway so it doesn’t matter?
Suppose that to solve alignment the quality of our alignment research effort has to be greater than some threshold. If the distribution of possible output quality is logistic, and research moves the mean of the distribution, then I think we gain a constant amount of log-odds per unit of research quality, regardless of where we think the threshold is.
I disagree with most of the empirical claims in this post, and dislike most of the words.
But I do like the framework of valuing actions based on log odds in doom reduction. Some reasons I like it:
I think it makes it easier to communicate between people with very different background views about alignment; it seems much easier to agree about whether something reduces doom by 1 bit, than to agree about whether it cuts doom by 50%.
It seems like the right way to prepare for positive surprises if you are pessimistic.
I think it’s correct that saying “well if there’s a chance it has to be X...” very often dooms you on the starting line, and I think this fuzzier way of preparing for positive surprises is often better. I agree that in many contexts you shouldn’t say “let’s condition on somehow getting enough dignity points, since that’s the only way it matters” and should instead just fight for what you perceive as a dignity point.
I tentatively think that it’s better for having intuition about effect sizes and comparing different interventions.
I’m a bit surprised by this. (I was also a bit confused about what work “log odds” vs other probability framing was doing in the first place, and so were some others I’ve chatted with). Have you run into people/conversations when this was helpful?
(I guess same question for Eliezer. I had a vague sense of “log odds put the question into units that were more motivating to reason about”, but was fuzzy on the details. I’m not fluent in log odds, I have some sense that if were were actually measuring precise probabilities it might matter for making math actually easier, but I haven’t seen anyone make arguments about success chance that were grounded in something rigorous enough that doing actual math to it was useful)
Likelihood ratios can be easier to evaluate than absolute probabilities insofar as you can focus on the meaning of a single piece of evidence, separately from the context of everything else you believe.
Suppose we’re trying to pick a multiple-dial combination lock (in the dark, where we can’t see how many dials there are). If we somehow learn that the first digit is 3, we can agree that that’s lg10≈3.32 bits of progress towards cracking the lock, even if you think there’s three dials total (such that we only need 6.64 more bits) and I think there’s 10 dials total (such that we need 29.88 more bits).
Similarly, alignment pessimists and optimists might be able to agree that reinforcement learning from human feedback techniques are a good puzzle piece to have (we don’t have to write a reward function by hand! that’s progress!), even if pessimists aren’t particularly encouraged overall (because Goodhart’s curse, inner-misaligned mesa-optimizers, &c. are still going to kill you).
Thanks, I found it useful to have this explanation alongside Paul’s. (I think each explanation would have helped a bit on it’s own but having it explained in two different language gave me a clearer sense of how the math worked and how to conceptualize it)
More intuitive illustration with no logarithms: your plane crashed in the ocean. To survive, you must swim to shore. You know that the shore is west, but you don’t know how far.
The optimist thinks the shore is just over the horizon; we only need to swim a few miles and we’ll almost certainly make it. The pessimist thinks the shore is a thousand miles away and we will surely die. But the optimist and pessimist can both agree on how far we’ve swum up to this point, and that the most dignified course of action is “Swim west as far as you can.”
Suppose that Eliezer thinks there is a 99% risk of doom, and I think there is a 20% risk of doom.
Suppose that we solve some problem we both think of as incredibly important, like we find a way to solve ontology identification and make sense of the alien knowledge a model has about the world and about how to think, and it actually looks pretty practical and promising and suggests an angle of attack on other big theoretical problems and generally suggests all these difficulties may be more tractable than we thought.
If that’s an incredible smashing success maybe my risk estimate has gone down from 20% to 10%, cutting risk in half.
And if it’s an incredible smashing success maybe Eliezer thinks that risk has gone down from 99% to 98%, cutting risk by ~1%.
I think there are basically just two separate issues at stake:
How much does this help solve the problem? I think mostly captured by bits of log odds reduction, and not where the real disagreement is.
How much are we doomed anyway so it doesn’t matter?
Suppose that to solve alignment the quality of our alignment research effort has to be greater than some threshold. If the distribution of possible output quality is logistic, and research moves the mean of the distribution, then I think we gain a constant amount of log-odds per unit of research quality, regardless of where we think the threshold is.