Rohin Shah comments on Late 2021 MIRI Conversations: AMA / Discussion

Rohin Shah 7 Mar 2022 13:38 UTC
LW: 5 AF: 4
AF
Re: cultured meat example: If you give me examples in which you know the features are actually inconsistent, my method is going to look optimistic when it doesn’t know about that inconsistency. So yeah, assuming your description of the cultured meat example is correct, my toy model would reproduce that problem.
To give a different example, consider OpenAI Five. One would think that to beat Dota, you need to have an algorithm that allows you to do hierarchical planning, state estimation from partial observability, coordination with team members, understanding of causality, compression of the giant action space, etc. Everyone looked at this giant list of necessary features and thought “it’s highly improbable for an algorithm to demonstrate all of these features”. My understanding is that even OpenAI, the most optimistic of everyone, thought they would need to do some sort of hierarchical RL to get this to work. In the end, it turned out that vanilla PPO with reward shaping and domain randomization was enough. It turns out that all of these many different capabilities / features were very consistent with each other and easier to achieve simultaneously than we thought.
so the product isn’t an unbiased estimator of the joint
Tbc, I don’t want to claim “unbiased estimator” in the mathematical sense of the phrase. To even make such a claim you need to choose some underlying probability distribution which gives rise to our features, which we don’t have. I’m more saying that the direction of the bias depends on whether your features are positively vs. negatively correlated with each other and so a priori I don’t expect the bias to be in a predictable direction.
But what are those objects, and how do they work, and why don’t they have the problem of not noticing inconsistencies because they didn’t fully populate the details?
They definitely have that problem. I’m not sure how you don’t have that problem; you’re always going to have some amount of abstraction and some amount of inconsistency; the future is hard to predict for bounded humans, and you can’t “fully populate the details” as an embedded agent.
If you’re asking how you notice any inconsistencies at all (rather than all of the inconsistences), then my answer is that you do in fact try to populate details sometimes, and that can demonstrate inconsistencies (and consistencies).
I can sketch out a more concrete, imagined-in-hindsight-and-therefore-false story of what’s happening.
Most of the “objects” are questions about the future to which there are multiple possible answers, which you have a probability distribution over (you can think of this as a factor in a Finite Factored Set, with an associated probability distribution over the answers). For example, you could imagine a question for “number of AGI orgs with a shot at time X”, “fraction of people who agree alignment is a problem”, “amount of optimization pressure needed to avoid deception”, etc. If you provide answers to some subset of questions, that gives you an incomplete possible world (which you could imagine as an implicitly-represented set of possible worlds if you want). Given an incomplete possible world, to answer a new question quickly you reason abstractly from the answers you are conditioning on to get an answer to the new question.
When you have lots of time, you can improve your reasoning in many different ways:
1. You can find other factors that seem important, add them in, subdividing worlds out even further.
2. You can take two factors, and think about how compatible they are with each other, building intuitions about their joint (rather than just their marginal probabilities, which is what you have by default).
3. You can take some incomplete possible world, sketch out lots of additional concrete details, and see if you can spot inconsistencies.
4. You can refactor your “main factors” to be more independent of each other. For example, maybe you notice that all of your reasoning about things like “<metric> at time X” depends a lot on timelines, and so you instead replace them with factors like “<metric> at X years before crunch time”, where they are more independent of timelines.
What links here?
- Sammy Martin's comment on AI Governance across Slow/Fast Takeoff and Easy/Hard Alignment spectra by Davidmanheim (4 Apr 2022 11:15 UTC; 14 points)

Rohin Shah comments on Late 2021 MIRI Conversations: AMA /​ Discussion

Rohin Shah comments on Late 2021 MIRI Conversations: AMA / Discussion