I wish there was a standardized name for the proposal by Paul Christiano that indirect normativity be done using a specific counterfactual human with a computer to aid herself. Is there? I’ve heard people calling it Paul-Boxing many times, maybe there is a different one.
That post discusses two big ideas, one is putting a human in a box and building a model of their input/output behavior as “the simplest model consistent with the observed input/output behavior.” Nick Bostrom calls this the “crypt,” which is not a very flattering name but I have no alternative. I think it has been mostly superseded by this kind of thing (and more explicitly, here, but realistically the box part was never necessary).
The other part is probably more important, but less colorful; extrapolate by actually seeing what a person would do in a particular “favorable” environment. I have been calling this “explicit” extrapolation.
I’m sorry to never name this. I think I can be (partly) defended because the actual details have changed so much, and it’s not clear exactly what you would want to refer to.
I’m skeptical about the relevance of the fragility of minds here. Yes, if you mess up your simulation slightly it will start to make bad predictions. But that seems to make it easier to specify a person precisely, not harder. The differences in observations allow someone to quickly rule out alternative models by observations of people. Indeed, the way in which a human brain is physically represented makes little difference to the kind of predictions someone would make. As another commenter pointed out, if you randomly flip some bits in a computer it will not do anything good. But that has little relevance to your predictions of a computer, unless you expect some bits to get randomly flipped.
(A general point is also that you shouldn’t be specifying values either through a definition of a human mind, or through language. You should be using these weak descriptions to give an indirect definition of value, via an edict like “do whatever it is that people want” and recommending actions like “when the time comes, ask people.” (This indirection may be done in hypothetical simulation, for the explicit extrapolation approach.) These targets are much less fragile. So I tend to think that the fragility of value doesn’t bear much on the feasibility of these approaches.
What is paul-boxing?
I wish there was a standardized name for the proposal by Paul Christiano that indirect normativity be done using a specific counterfactual human with a computer to aid herself. Is there? I’ve heard people calling it Paul-Boxing many times, maybe there is a different one.
https://ordinaryideas.wordpress.com/2012/04/21/indirect-normativity-write-up/
...is, I think, still the reference post.
That post discusses two big ideas, one is putting a human in a box and building a model of their input/output behavior as “the simplest model consistent with the observed input/output behavior.” Nick Bostrom calls this the “crypt,” which is not a very flattering name but I have no alternative. I think it has been mostly superseded by this kind of thing (and more explicitly, here, but realistically the box part was never necessary).
The other part is probably more important, but less colorful; extrapolate by actually seeing what a person would do in a particular “favorable” environment. I have been calling this “explicit” extrapolation.
I’m sorry to never name this. I think I can be (partly) defended because the actual details have changed so much, and it’s not clear exactly what you would want to refer to.
I’m skeptical about the relevance of the fragility of minds here. Yes, if you mess up your simulation slightly it will start to make bad predictions. But that seems to make it easier to specify a person precisely, not harder. The differences in observations allow someone to quickly rule out alternative models by observations of people. Indeed, the way in which a human brain is physically represented makes little difference to the kind of predictions someone would make. As another commenter pointed out, if you randomly flip some bits in a computer it will not do anything good. But that has little relevance to your predictions of a computer, unless you expect some bits to get randomly flipped.
(A general point is also that you shouldn’t be specifying values either through a definition of a human mind, or through language. You should be using these weak descriptions to give an indirect definition of value, via an edict like “do whatever it is that people want” and recommending actions like “when the time comes, ask people.” (This indirection may be done in hypothetical simulation, for the explicit extrapolation approach.) These targets are much less fragile. So I tend to think that the fragility of value doesn’t bear much on the feasibility of these approaches.
Replied to here