When Alice uses a model with more free parameters, you need to posit a bias before you can predict a systematic direction in which Alice will make mistakes. So this only bites you if you have a bias towards optimism.
That is, when I give Optimistic Alice fewer constraints, she can more easily imagine a solution, and when I give Pessimistic Bob fewer constraints, he can more easily imagine that no solution is possible? I think… this feels true as a matter of human psychology of problem-solving, or something, and not as a matter of math. Like, the way Bob fails to find a solution mostly looks like “not actually considering the space”, or “wasting consideration on easily-known-bad parts of the space”, and more constraints could help with both of those. But, as math, removing constraints can’t lower the volume of the implied space and so can’t make it less likely that a viable solution exists.
I know Eliezer thinks I have such a bias. I disagree with him.
I think Eliezer thinks nearly all humans have such a bias by default, and so without clear evidence to the contrary it’s a reasonable suspicion for anyone.
[I think there’s a thing Eliezer does a lot, which I have mixed feelings about, which is matching people’s statements to patterns and then responding to the generator of the pattern in Eliezer’s head, which only sometimes corresponds to the generator in the other person’s head.]
I agree that this is true in some platonic sense.
Cool, makes sense. [I continue to think we disagree about how true this is in a practical sense, where I read you as thinking “yeah, this is a minor consideration, we have to think with the tools we have access to, which could be wrong in either direction and so are useful as a point estimate” and me as thinking “huh, this really seems like the tools we have access to are going to give us overly optimistic answers, and we should focus more on how to get tools that will give us more robust answers.”]
[I think there’s a thing Eliezer does a lot, which I have mixed feelings about, which is matching people’s statements to patterns and then responding to the generator of the pattern in Eliezer’s head, which only sometimes corresponds to the generator in the other person’s head.]
I want to add an additional meta-pattern – there was a once a person who thought I had a particular bias. They’d go around telling me “Ray, you’re exhibiting that bias right now. Whatever rationalization you’re coming up with right now, it’s not the real reason you’re arguing X.” And I was like “c’mon man. I have a ton of introspective access to myself and I can tell that this ‘rationalization’ is actually a pretty good reason to believe X and I trust that my reasoning process is real.”
But… eventually I realized I just actually had two motivations going on. When I introspected, I was running a check for a positive result on “is Ray displaying rational thought?”. When they extrospected me (i.e. reading my facial expressions), they were checking for a positive result on “does Ray seem biased in this particular way?”.
And both checks totally returned ‘true’, and that was an accurate assessment.
The particular moment where I noticed this metapattern, I’d say my cognition was, say, 65% “good argumentation”, 15% “one particular bias”, “20% other random stuff.” On a different day, it might have been that I was 65% exhibiting the bias and 15%.
None of this is making much claim of what’s likely to be going on in Rohin’s head or Eliezer’s head or whether Eliezer’s conversational pattern is useful, but wanted to flag it as a way people could be talking past each other.
I think… this feels true as a matter of human psychology of problem-solving, or something, and not as a matter of math.
I think we’re imagining different toy mathematical models.
Your model, according to me:
There is a space of possible approaches, that we are searching over to find a solution. (E.g. the space of all possible programs.)
We put a layer of abstraction on top of this space, characterizing approaches by N different “features” (e.g. “is it goal-directed”, “is it an oracle”, “is it capable of destroying the world”)
Because we’re bounded agents, we then treat the features as independent, and search for some combination of features that would comprise a solution.
I agree that this procedure has a systematic error in claiming that there is a solution when none exists (and doesn’t have the opposite error), and that if this were an accurate model of how I was reasoning I should be way more worried about correcting for that problem.
My model:
There is a probability distribution over “ways the world could be”.
We put a layer of abstraction on top of this space, characterizing “ways the world could be” by N different “features” (e.g. “can you get human-level intelligence out of a pile of heuristics”, “what are the returns to specialization”, “how different will AI ontologies be from human ontologies”). We estimate the marginal probability of each of those features.
Because we’re bounded agents, when we need the joint probability of two or more features, we treat them as independent and just multiply.
Given a proposed solution, we estimate its probability of working by identifying which features need to be true of the world for the solution to work, and then estimate the probability of those features (using the method above).
I claim that this procedure doesn’t have a systematic error in the direction of optimism (at least until you add some additional details), and that this procedure more accurately reflects the sort of reasoning that I am doing.
Huh, why doesn’t that procedure have that systematic error?
Like, when I try to naively run your steps 1-4 on “probability of there existing a number that’s both even and odd”, I get that about 25% of numbers should be both even and odd, so it seems pretty likely that it’ll work out given that there are at least 4 numbers. But I can’t easily construct an argument at a similar level of sophistication that gives me an underestimate. [Like, “probability of there existing a number that’s both odd and prime” gives the wrong conclusion if you buy that the probability that a natural number is prime is 0, but this is because you evaluated your limits in the wrong order, not because of a problem with dropping all the covariance data from your joint distribution.]
My first guess is that you think I’m doing the “ways the world could be” thing wrong—like, I’m looking at predicates over numbers and trying to evaluate a predicate over all numbers, but instead I should just have a probability on “universe contains a number that is both even and odd” and its complement, as those are the two relevant ways the world can be.
My second guess is that you’ve got a different distribution over target predicates; like, we can just take the complement of my overestimate (“probability of there existing no numbers that are both even and odd”) and call it an underestimate. But I think I’m more interested in ‘overestimating existence’ than ‘underestimating non-existence’. [Is this an example of the ‘additional details’ you’re talking about?]
Also maybe you can just exhibit a simple example that has an underestimate, and then we need to think harder about how likely overestimates and underestimates are to see if there’s a net bias.
I think if you have a particular number then I’m like “yup, it’s fair to notice that we overestimate the probability that x is even and odd by saying it’s 25%”, and then I’d say “notice that we underestimate the probability that x is even and divisible by 4 by saying it’s 12.5%”.
I agree that if you estimate a probability, and then “perform search” / “optimize” / “run n copies of the estimate” (so that you estimate the probability as 1 - (1 - P(event))^n), then you’re going to have systematic errors.
I don’t think I’m doing anything that’s analogous to that. I definitely don’t go around thinking “well, it seems 10% likely that such and such feature of the world holds, and so each alignment scheme I think of that depends on this feature has a 10% chance of working, therefore if I think of 10 alignment schemes I’ve solved the problem”. (I suspect this is not the sort of mistake you imagine me doing but I don’t think I know what you do imagine me doing.)
I’d say “notice that we underestimate the probability that x is even and divisible by 4 by saying it’s 12.5%”.
Cool, I like this example.
I agree that if you estimate a probability, and then “perform search” / “optimize” / “run n copies of the estimate” (so that you estimate the probability as 1 - (1 - P(event))^n), then you’re going to have systematic errors. ... I suspect this is not the sort of mistake you imagine me doing but I don’t think I know what you do imagine me doing.
I think the thing I’m interested in is “what are our estimates of the output of search processes?”. The question we’re ultimately trying to answer with a model here is something like “are humans, when they consider a problem that could have attempted solutions of many different forms, overly optimistic about how solvable those problems are because they hypothesize a solution with inconsistent features?”
The example of “a number divisible by 2 and a number divisible by 4” is an example of where the consistency of your solution helps you—anything that satisfies the second condition is already satisfying the first condition. But importantly the best you can do here is ignore superfluous conditions; they can’t increase the volume of the solution space. I think this is where the systematic bias is coming from (that the joint probability of two conditions can’t be higher than the maximum of those two conditions, where the joint probability can be lower than the minimum of the two, and so the product isn’t an unbiased estimator of the joint).
For example, consider this recent analysis of cultured meat, which seems to me to point out a fundamental inconsistency of this type in people’s plans for creating cultured meat. Basically, the bigger you make a bioreactor, the better it looks on criteria ABC, and the smaller you make a bioreactor, the better it looks on criteria DEF, and projections seem to suggest that massive progress will be made on all of those criteria simultaneously because progress can be made on them individually. But this necessitates making bioreactors that are simultaneously much bigger and much smaller!
[Sometimes this is possible, because actually one is based on volume and the other is based on surface area, and so when you make something like a zeolite you can combine massive surface area with tiny volume. But if you need massive volume and tiny surface area, that’s not possible. Anyway, in this case, my read is that both of these are based off of volume, and so there’s no clever technique like that available.]
Maybe you could step me thru how your procedure works for estimating the viability of cultured meat, or the possibility of constructing a room temperature <10 atm superconductor, or something?
It seems to me like there’s a version of your procedure which, like, considers all of the different possible factory designs, applies some functions to determine the high-level features of those designs (like profitability, amount of platinum they consume, etc.), and then when we want to know “is there a profitable cultured meat factory?” responds with “conditioning on profitability > 0, this is the set of possible designs.” And then when I ask “is there a profitable cultured meat factory using less than 1% of the platinum available on Earth?” says “sorry, that query is too difficult; I calculated the set of possible designs conditioned on profitability, calculated the set of possible designs conditioned on using less than 1% of the platinum available on Earth, and then <multiplied sets together> to give you this approximate answer.”
But of course that’s not what you’re doing, because the boundedness prevents you from considering all the different possible factory designs. So instead you have, like, clusters of factory designs in your map? But what are those objects, and how do they work, and why don’t they have the problem of not noticing inconsistencies because they didn’t fully populate the details? [Or if they did fully populate the details for some limited number of considered objects, how do you back out the implied probability distribution over the non-considered objects in a way that isn’t subject to this?]
Re: cultured meat example: If you give me examples in which you know the features are actually inconsistent, my method is going to look optimistic when it doesn’t know about that inconsistency. So yeah, assuming your description of the cultured meat example is correct, my toy model would reproduce that problem.
To give a different example, consider OpenAI Five. One would think that to beat Dota, you need to have an algorithm that allows you to do hierarchical planning, state estimation from partial observability, coordination with team members, understanding of causality, compression of the giant action space, etc. Everyone looked at this giant list of necessary features and thought “it’s highly improbable for an algorithm to demonstrate all of these features”. My understanding is that even OpenAI, the most optimistic of everyone, thought they would need to do some sort of hierarchical RL to get this to work. In the end, it turned out that vanilla PPO with reward shaping and domain randomization was enough. It turns out that all of these many different capabilities / features were very consistent with each other and easier to achieve simultaneously than we thought.
so the product isn’t an unbiased estimator of the joint
Tbc, I don’t want to claim “unbiased estimator” in the mathematical sense of the phrase. To even make such a claim you need to choose some underlying probability distribution which gives rise to our features, which we don’t have. I’m more saying that the direction of the bias depends on whether your features are positively vs. negatively correlated with each other and so a priori I don’t expect the bias to be in a predictable direction.
But what are those objects, and how do they work, and why don’t they have the problem of not noticing inconsistencies because they didn’t fully populate the details?
They definitely have that problem. I’m not sure how you don’t have that problem; you’re always going to have some amount of abstraction and some amount of inconsistency; the future is hard to predict for bounded humans, and you can’t “fully populate the details” as an embedded agent.
If you’re asking how you notice any inconsistencies at all (rather than all of the inconsistences), then my answer is that you do in fact try to populate details sometimes, and that can demonstrate inconsistencies (and consistencies).
I can sketch out a more concrete, imagined-in-hindsight-and-therefore-false story of what’s happening.
Most of the “objects” are questions about the future to which there are multiple possible answers, which you have a probability distribution over (you can think of this as a factor in a Finite Factored Set, with an associated probability distribution over the answers). For example, you could imagine a question for “number of AGI orgs with a shot at time X”, “fraction of people who agree alignment is a problem”, “amount of optimization pressure needed to avoid deception”, etc. If you provide answers to some subset of questions, that gives you an incomplete possible world (which you could imagine as an implicitly-represented set of possible worlds if you want). Given an incomplete possible world, to answer a new question quickly you reason abstractly from the answers you are conditioning on to get an answer to the new question.
When you have lots of time, you can improve your reasoning in many different ways:
You can find other factors that seem important, add them in, subdividing worlds out even further.
You can take two factors, and think about how compatible they are with each other, building intuitions about their joint (rather than just their marginal probabilities, which is what you have by default).
You can take some incomplete possible world, sketch out lots of additional concrete details, and see if you can spot inconsistencies.
You can refactor your “main factors” to be more independent of each other. For example, maybe you notice that all of your reasoning about things like “<metric> at time X” depends a lot on timelines, and so you instead replace them with factors like “<metric> at X years before crunch time”, where they are more independent of timelines.
That is, when I give Optimistic Alice fewer constraints, she can more easily imagine a solution, and when I give Pessimistic Bob fewer constraints, he can more easily imagine that no solution is possible? I think… this feels true as a matter of human psychology of problem-solving, or something, and not as a matter of math. Like, the way Bob fails to find a solution mostly looks like “not actually considering the space”, or “wasting consideration on easily-known-bad parts of the space”, and more constraints could help with both of those. But, as math, removing constraints can’t lower the volume of the implied space and so can’t make it less likely that a viable solution exists.
I think Eliezer thinks nearly all humans have such a bias by default, and so without clear evidence to the contrary it’s a reasonable suspicion for anyone.
[I think there’s a thing Eliezer does a lot, which I have mixed feelings about, which is matching people’s statements to patterns and then responding to the generator of the pattern in Eliezer’s head, which only sometimes corresponds to the generator in the other person’s head.]
Cool, makes sense. [I continue to think we disagree about how true this is in a practical sense, where I read you as thinking “yeah, this is a minor consideration, we have to think with the tools we have access to, which could be wrong in either direction and so are useful as a point estimate” and me as thinking “huh, this really seems like the tools we have access to are going to give us overly optimistic answers, and we should focus more on how to get tools that will give us more robust answers.”]
I want to add an additional meta-pattern – there was a once a person who thought I had a particular bias. They’d go around telling me “Ray, you’re exhibiting that bias right now. Whatever rationalization you’re coming up with right now, it’s not the real reason you’re arguing X.” And I was like “c’mon man. I have a ton of introspective access to myself and I can tell that this ‘rationalization’ is actually a pretty good reason to believe X and I trust that my reasoning process is real.”
But… eventually I realized I just actually had two motivations going on. When I introspected, I was running a check for a positive result on “is Ray displaying rational thought?”. When they extrospected me (i.e. reading my facial expressions), they were checking for a positive result on “does Ray seem biased in this particular way?”.
And both checks totally returned ‘true’, and that was an accurate assessment.
The particular moment where I noticed this metapattern, I’d say my cognition was, say, 65% “good argumentation”, 15% “one particular bias”, “20% other random stuff.” On a different day, it might have been that I was 65% exhibiting the bias and 15%.
None of this is making much claim of what’s likely to be going on in Rohin’s head or Eliezer’s head or whether Eliezer’s conversational pattern is useful, but wanted to flag it as a way people could be talking past each other.
I think we’re imagining different toy mathematical models.
Your model, according to me:
There is a space of possible approaches, that we are searching over to find a solution. (E.g. the space of all possible programs.)
We put a layer of abstraction on top of this space, characterizing approaches by N different “features” (e.g. “is it goal-directed”, “is it an oracle”, “is it capable of destroying the world”)
Because we’re bounded agents, we then treat the features as independent, and search for some combination of features that would comprise a solution.
I agree that this procedure has a systematic error in claiming that there is a solution when none exists (and doesn’t have the opposite error), and that if this were an accurate model of how I was reasoning I should be way more worried about correcting for that problem.
My model:
There is a probability distribution over “ways the world could be”.
We put a layer of abstraction on top of this space, characterizing “ways the world could be” by N different “features” (e.g. “can you get human-level intelligence out of a pile of heuristics”, “what are the returns to specialization”, “how different will AI ontologies be from human ontologies”). We estimate the marginal probability of each of those features.
Because we’re bounded agents, when we need the joint probability of two or more features, we treat them as independent and just multiply.
Given a proposed solution, we estimate its probability of working by identifying which features need to be true of the world for the solution to work, and then estimate the probability of those features (using the method above).
I claim that this procedure doesn’t have a systematic error in the direction of optimism (at least until you add some additional details), and that this procedure more accurately reflects the sort of reasoning that I am doing.
Huh, why doesn’t that procedure have that systematic error?
Like, when I try to naively run your steps 1-4 on “probability of there existing a number that’s both even and odd”, I get that about 25% of numbers should be both even and odd, so it seems pretty likely that it’ll work out given that there are at least 4 numbers. But I can’t easily construct an argument at a similar level of sophistication that gives me an underestimate. [Like, “probability of there existing a number that’s both odd and prime” gives the wrong conclusion if you buy that the probability that a natural number is prime is 0, but this is because you evaluated your limits in the wrong order, not because of a problem with dropping all the covariance data from your joint distribution.]
My first guess is that you think I’m doing the “ways the world could be” thing wrong—like, I’m looking at predicates over numbers and trying to evaluate a predicate over all numbers, but instead I should just have a probability on “universe contains a number that is both even and odd” and its complement, as those are the two relevant ways the world can be.
My second guess is that you’ve got a different distribution over target predicates; like, we can just take the complement of my overestimate (“probability of there existing no numbers that are both even and odd”) and call it an underestimate. But I think I’m more interested in ‘overestimating existence’ than ‘underestimating non-existence’. [Is this an example of the ‘additional details’ you’re talking about?]
Also maybe you can just exhibit a simple example that has an underestimate, and then we need to think harder about how likely overestimates and underestimates are to see if there’s a net bias.
It’s the first guess.
I think if you have a particular number then I’m like “yup, it’s fair to notice that we overestimate the probability that x is even and odd by saying it’s 25%”, and then I’d say “notice that we underestimate the probability that x is even and divisible by 4 by saying it’s 12.5%”.
I agree that if you estimate a probability, and then “perform search” / “optimize” / “run n copies of the estimate” (so that you estimate the probability as 1 - (1 - P(event))^n), then you’re going to have systematic errors.
I don’t think I’m doing anything that’s analogous to that. I definitely don’t go around thinking “well, it seems 10% likely that such and such feature of the world holds, and so each alignment scheme I think of that depends on this feature has a 10% chance of working, therefore if I think of 10 alignment schemes I’ve solved the problem”. (I suspect this is not the sort of mistake you imagine me doing but I don’t think I know what you do imagine me doing.)
Cool, I like this example.
I think the thing I’m interested in is “what are our estimates of the output of search processes?”. The question we’re ultimately trying to answer with a model here is something like “are humans, when they consider a problem that could have attempted solutions of many different forms, overly optimistic about how solvable those problems are because they hypothesize a solution with inconsistent features?”
The example of “a number divisible by 2 and a number divisible by 4” is an example of where the consistency of your solution helps you—anything that satisfies the second condition is already satisfying the first condition. But importantly the best you can do here is ignore superfluous conditions; they can’t increase the volume of the solution space. I think this is where the systematic bias is coming from (that the joint probability of two conditions can’t be higher than the maximum of those two conditions, where the joint probability can be lower than the minimum of the two, and so the product isn’t an unbiased estimator of the joint).
For example, consider this recent analysis of cultured meat, which seems to me to point out a fundamental inconsistency of this type in people’s plans for creating cultured meat. Basically, the bigger you make a bioreactor, the better it looks on criteria ABC, and the smaller you make a bioreactor, the better it looks on criteria DEF, and projections seem to suggest that massive progress will be made on all of those criteria simultaneously because progress can be made on them individually. But this necessitates making bioreactors that are simultaneously much bigger and much smaller!
[Sometimes this is possible, because actually one is based on volume and the other is based on surface area, and so when you make something like a zeolite you can combine massive surface area with tiny volume. But if you need massive volume and tiny surface area, that’s not possible. Anyway, in this case, my read is that both of these are based off of volume, and so there’s no clever technique like that available.]
Maybe you could step me thru how your procedure works for estimating the viability of cultured meat, or the possibility of constructing a room temperature <10 atm superconductor, or something?
It seems to me like there’s a version of your procedure which, like, considers all of the different possible factory designs, applies some functions to determine the high-level features of those designs (like profitability, amount of platinum they consume, etc.), and then when we want to know “is there a profitable cultured meat factory?” responds with “conditioning on profitability > 0, this is the set of possible designs.” And then when I ask “is there a profitable cultured meat factory using less than 1% of the platinum available on Earth?” says “sorry, that query is too difficult; I calculated the set of possible designs conditioned on profitability, calculated the set of possible designs conditioned on using less than 1% of the platinum available on Earth, and then <multiplied sets together> to give you this approximate answer.”
But of course that’s not what you’re doing, because the boundedness prevents you from considering all the different possible factory designs. So instead you have, like, clusters of factory designs in your map? But what are those objects, and how do they work, and why don’t they have the problem of not noticing inconsistencies because they didn’t fully populate the details? [Or if they did fully populate the details for some limited number of considered objects, how do you back out the implied probability distribution over the non-considered objects in a way that isn’t subject to this?]
Re: cultured meat example: If you give me examples in which you know the features are actually inconsistent, my method is going to look optimistic when it doesn’t know about that inconsistency. So yeah, assuming your description of the cultured meat example is correct, my toy model would reproduce that problem.
To give a different example, consider OpenAI Five. One would think that to beat Dota, you need to have an algorithm that allows you to do hierarchical planning, state estimation from partial observability, coordination with team members, understanding of causality, compression of the giant action space, etc. Everyone looked at this giant list of necessary features and thought “it’s highly improbable for an algorithm to demonstrate all of these features”. My understanding is that even OpenAI, the most optimistic of everyone, thought they would need to do some sort of hierarchical RL to get this to work. In the end, it turned out that vanilla PPO with reward shaping and domain randomization was enough. It turns out that all of these many different capabilities / features were very consistent with each other and easier to achieve simultaneously than we thought.
Tbc, I don’t want to claim “unbiased estimator” in the mathematical sense of the phrase. To even make such a claim you need to choose some underlying probability distribution which gives rise to our features, which we don’t have. I’m more saying that the direction of the bias depends on whether your features are positively vs. negatively correlated with each other and so a priori I don’t expect the bias to be in a predictable direction.
They definitely have that problem. I’m not sure how you don’t have that problem; you’re always going to have some amount of abstraction and some amount of inconsistency; the future is hard to predict for bounded humans, and you can’t “fully populate the details” as an embedded agent.
If you’re asking how you notice any inconsistencies at all (rather than all of the inconsistences), then my answer is that you do in fact try to populate details sometimes, and that can demonstrate inconsistencies (and consistencies).
I can sketch out a more concrete, imagined-in-hindsight-and-therefore-false story of what’s happening.
Most of the “objects” are questions about the future to which there are multiple possible answers, which you have a probability distribution over (you can think of this as a factor in a Finite Factored Set, with an associated probability distribution over the answers). For example, you could imagine a question for “number of AGI orgs with a shot at time X”, “fraction of people who agree alignment is a problem”, “amount of optimization pressure needed to avoid deception”, etc. If you provide answers to some subset of questions, that gives you an incomplete possible world (which you could imagine as an implicitly-represented set of possible worlds if you want). Given an incomplete possible world, to answer a new question quickly you reason abstractly from the answers you are conditioning on to get an answer to the new question.
When you have lots of time, you can improve your reasoning in many different ways:
You can find other factors that seem important, add them in, subdividing worlds out even further.
You can take two factors, and think about how compatible they are with each other, building intuitions about their joint (rather than just their marginal probabilities, which is what you have by default).
You can take some incomplete possible world, sketch out lots of additional concrete details, and see if you can spot inconsistencies.
You can refactor your “main factors” to be more independent of each other. For example, maybe you notice that all of your reasoning about things like “<metric> at time X” depends a lot on timelines, and so you instead replace them with factors like “<metric> at X years before crunch time”, where they are more independent of timelines.