So, part of my argument here is that “limited in breaking down the black boxes” is the wrong way to view the limitation. It’s limited attention, time, etc. That doesn’t necessarily translate into a limitation in breaking down black boxes, especially if you have some general knowledge about how to usefully break down the relevant black boxes.
And that’s where the “a few constraints” part comes in. Like, when Eliezer says “my model doesn’t have strong predictions about X or Y or Z, except that you can’t find settings of all of these variables which solve the alignment problem”, that’s actually a fairly simple constraint on its own. It’s simpler and more useful than a black-box of Eliezer’s predictions or policy suggestions. That’s the sort of thing we want to extract.
when Eliezer says “my model doesn’t have strong predictions about X or Y or Z, except that you can’t find settings of all of these variables which solve the alignment problem”, that’s actually a fairly simple constraint on its own. It’s simpler and more useful than a black-box of Eliezer’s predictions or policy suggestions. That’s the sort of thing we want to extract.
But when Paul (or most other alignment researchers) say “in fact you can find settings of those variables which solve the alignment problem”, now we’ve got another high-level claim about the world which is inconsistent with the first one. So if your strategy for operating multiple worldviews is to build a new worldview by combining claims like these, then you’ll hit a contradiction pretty quickly; and in this case it’s highly nontrivial to figure out how to resolve that contradiction, or produce a sensible policy from a starting set of contradictory beliefs. Whereas if you calculate the separate policies first, it may well be the case that they’re almost entirely consistent with each other.
(As an aside: I once heard someone give a definition of “rationalists” as “people who try to form a single coherent worldview”. Feels relevant here.)
Let’s walk through an example a bit more. Eliezer says something like “my model doesn’t have strong predictions about X or Y, except that you can’t have both X and Y at the same time, and that’s what you’d need in order for alignment to be easy.”. Then e.g. Rohin comes along and says something like “my model is that Z generally solves most problems most of the time, therefore it will also solve alignment”. And clearly these make incompatible predictions in the case of alignment specifically. But they’re both models which say lots of stuff about lots of things other than alignment. Other than alignment, the models mostly make predictions about different things, so it’s not easy to directly test them against each other.
The actual strategy I want here is to take both of those constraints and say “here’s one constraint, here’s another constraint, both of them seem to hold across a pretty broad range of situations but they’re making opposite predictions about alignment difficulty, so one of them must not generalize to alignment for some reason”. And if you don’t currently understand which situations the two constraints do and do not generalize to, or where the loopholes are, then that’s the correct epistemic state. It is correct to say “yup, there’s two constraints here which both make lots of correct predictions in mostly-different places and one of them must be wrong in this case but I don’t know which”.
… and this is totally normal! Like, of course people have lots of different heuristics and model-fragments which mostly don’t overlap but do occasionally contradict each other. That’s fine, that’s a very ordinary epistemic state for a human, we have lots of experience with such epistemic states. Humans have lots of intuitive practice with things like “agonize about which of those conflicting heuristics we trust more in this situation” or “look for a policy which satisfies both of these conflicting model-fragments” or “consider the loopholes in each of these heuristics; does one of them have loopholes more likely to apply to this problem?”.
Of course we still try to make our worldview more coherent over time; a conflict is evidence that something in there is wrong. But throwing away information—whether by abandoning one constraint wholesale, or by black-boxing things in various ways—is not the way to do that. We resolve the contradiction by thinking more about the internal details of how and why and when each model works, not by thinking less about the internal details. And if we don’t want to do the hard work of thinking about how and why and when each model works, then we shouldn’t be trying to force a resolution in the contradiction. Without doing that work, the most-accurate epistemic state we can reach is a model in which we know there’s a contradiction, we know there’s something wrong, but we don’t know what’s wrong (and, importantly, we know that we don’t know what’s wrong). Forcing a resolution to the contradiction, without figuring out where the actual problem is, would make our epistemic state less correct/accurate, not more.
I’m curious what part of your comment you think I disagree with. I’m not arguing for “forcing a resolution”, except insofar as you need to sometimes actually make decisions under worldview uncertainty. In fact, “forcing a resolution” by forming “all-things-considered” credences is the thing I’m arguing against in this post.
I also agree that humans have lots of experience weighing up contradictory heuristics and model-fragments. I think all the mechanisms you gave for how humans might do these are consistent with the thing I’m advocating. In particular, “choose which heuristics to apply” or “search for a policy consistent with different model-fragments” seem like basically what the policy approach would recommend (e.g. searching for a policy consistent with both the Yudkowsky model-fragment and the Christiano model-fragment). By contrast, I don’t think this is an accurate description of the way most EAs currently think about epistemic deference, which is the concrete example I’m contrasting my approach against.
(My model here is that you see me as missing a mood, like I’m not being sufficiently anti-black-boxes. I also expect that my proposal sounds more extreme than it actually is, because for simplicity I’m focusing on the limiting case of having almost no ability to resolve disagreements between worldviews.)
Huh. Yeah, I got the impression from the post that you wanted to do something like replace “epistemic deference plus black-box tests of predictive accuracy” with “policy deference plus simulated negotiation”. And I do indeed feel like a missing anti-black-box mood is the main important problem with epistemic deference. But it’s plausible we’re just in violent agreement here :).
So, part of my argument here is that “limited in breaking down the black boxes” is the wrong way to view the limitation. It’s limited attention, time, etc. That doesn’t necessarily translate into a limitation in breaking down black boxes, especially if you have some general knowledge about how to usefully break down the relevant black boxes.
And that’s where the “a few constraints” part comes in. Like, when Eliezer says “my model doesn’t have strong predictions about X or Y or Z, except that you can’t find settings of all of these variables which solve the alignment problem”, that’s actually a fairly simple constraint on its own. It’s simpler and more useful than a black-box of Eliezer’s predictions or policy suggestions. That’s the sort of thing we want to extract.
But when Paul (or most other alignment researchers) say “in fact you can find settings of those variables which solve the alignment problem”, now we’ve got another high-level claim about the world which is inconsistent with the first one. So if your strategy for operating multiple worldviews is to build a new worldview by combining claims like these, then you’ll hit a contradiction pretty quickly; and in this case it’s highly nontrivial to figure out how to resolve that contradiction, or produce a sensible policy from a starting set of contradictory beliefs. Whereas if you calculate the separate policies first, it may well be the case that they’re almost entirely consistent with each other.
(As an aside: I once heard someone give a definition of “rationalists” as “people who try to form a single coherent worldview”. Feels relevant here.)
Let’s walk through an example a bit more. Eliezer says something like “my model doesn’t have strong predictions about X or Y, except that you can’t have both X and Y at the same time, and that’s what you’d need in order for alignment to be easy.”. Then e.g. Rohin comes along and says something like “my model is that Z generally solves most problems most of the time, therefore it will also solve alignment”. And clearly these make incompatible predictions in the case of alignment specifically. But they’re both models which say lots of stuff about lots of things other than alignment. Other than alignment, the models mostly make predictions about different things, so it’s not easy to directly test them against each other.
The actual strategy I want here is to take both of those constraints and say “here’s one constraint, here’s another constraint, both of them seem to hold across a pretty broad range of situations but they’re making opposite predictions about alignment difficulty, so one of them must not generalize to alignment for some reason”. And if you don’t currently understand which situations the two constraints do and do not generalize to, or where the loopholes are, then that’s the correct epistemic state. It is correct to say “yup, there’s two constraints here which both make lots of correct predictions in mostly-different places and one of them must be wrong in this case but I don’t know which”.
… and this is totally normal! Like, of course people have lots of different heuristics and model-fragments which mostly don’t overlap but do occasionally contradict each other. That’s fine, that’s a very ordinary epistemic state for a human, we have lots of experience with such epistemic states. Humans have lots of intuitive practice with things like “agonize about which of those conflicting heuristics we trust more in this situation” or “look for a policy which satisfies both of these conflicting model-fragments” or “consider the loopholes in each of these heuristics; does one of them have loopholes more likely to apply to this problem?”.
Of course we still try to make our worldview more coherent over time; a conflict is evidence that something in there is wrong. But throwing away information—whether by abandoning one constraint wholesale, or by black-boxing things in various ways—is not the way to do that. We resolve the contradiction by thinking more about the internal details of how and why and when each model works, not by thinking less about the internal details. And if we don’t want to do the hard work of thinking about how and why and when each model works, then we shouldn’t be trying to force a resolution in the contradiction. Without doing that work, the most-accurate epistemic state we can reach is a model in which we know there’s a contradiction, we know there’s something wrong, but we don’t know what’s wrong (and, importantly, we know that we don’t know what’s wrong). Forcing a resolution to the contradiction, without figuring out where the actual problem is, would make our epistemic state less correct/accurate, not more.
I’m curious what part of your comment you think I disagree with. I’m not arguing for “forcing a resolution”, except insofar as you need to sometimes actually make decisions under worldview uncertainty. In fact, “forcing a resolution” by forming “all-things-considered” credences is the thing I’m arguing against in this post.
I also agree that humans have lots of experience weighing up contradictory heuristics and model-fragments. I think all the mechanisms you gave for how humans might do these are consistent with the thing I’m advocating. In particular, “choose which heuristics to apply” or “search for a policy consistent with different model-fragments” seem like basically what the policy approach would recommend (e.g. searching for a policy consistent with both the Yudkowsky model-fragment and the Christiano model-fragment). By contrast, I don’t think this is an accurate description of the way most EAs currently think about epistemic deference, which is the concrete example I’m contrasting my approach against.
(My model here is that you see me as missing a mood, like I’m not being sufficiently anti-black-boxes. I also expect that my proposal sounds more extreme than it actually is, because for simplicity I’m focusing on the limiting case of having almost no ability to resolve disagreements between worldviews.)
Huh. Yeah, I got the impression from the post that you wanted to do something like replace “epistemic deference plus black-box tests of predictive accuracy” with “policy deference plus simulated negotiation”. And I do indeed feel like a missing anti-black-box mood is the main important problem with epistemic deference. But it’s plausible we’re just in violent agreement here :).