I think it might be possible to get somewhere with a model of this type if we formalize the idea that manipulation requires considerable optimization power. For example, we can assume that a random description has low probability to be manipulative. Or, consider the following stronger assumption. For any algorithm that takes one description as input and produces another description of the same choice as output, if the computing resources used by the algorithm are sufficiently few then for most inputs it will not produce a manipulative output.
Those are some of the lines I was thinking along. But it’s not clear if the peak of the distribution is close to accuracy, human bias and poor understanding being what they are..
I agree that even without manipulation, human reasoning is wildly inaccurate. But perhaps we can use a model where human reasoning asymptotically converges to something accurate unless subjected to some sort of “destructive manipulation” which is unlikely to happen by chance.
The following is one (simplistic) model which might be a useful starting point.
Consider a human and a robot playing a stochastic game like in CIRL. Suppose that each of them is an oracle machine plugged into a reflective oracle, like in the recent paper of Jan, Jessica and Benya. Let the robot have the following prior over the program implemented by the human. The human implements a random program (i.e. a random string of bits for some prefix-free universal Oracle machine) conditional on this program being asymptotically optimal in mean for the class of all robot policies that avoid producing some set of “manipulative action sequences.” Here, “manipulative sequences” can be any set S of action sequences s.t.∑x∈Sn−|x|<ϵ where |x| is the length of the action sequence x, n is the number of possible actions and ϵ is a parameter on which the prior depends.
I think it might be possible to get somewhere with a model of this type if we formalize the idea that manipulation requires considerable optimization power. For example, we can assume that a random description has low probability to be manipulative. Or, consider the following stronger assumption. For any algorithm that takes one description as input and produces another description of the same choice as output, if the computing resources used by the algorithm are sufficiently few then for most inputs it will not produce a manipulative output.
Those are some of the lines I was thinking along. But it’s not clear if the peak of the distribution is close to accuracy, human bias and poor understanding being what they are..
I agree that even without manipulation, human reasoning is wildly inaccurate. But perhaps we can use a model where human reasoning asymptotically converges to something accurate unless subjected to some sort of “destructive manipulation” which is unlikely to happen by chance.
Interesting. How could we formalise that?
The following is one (simplistic) model which might be a useful starting point.
Consider a human and a robot playing a stochastic game like in CIRL. Suppose that each of them is an oracle machine plugged into a reflective oracle, like in the recent paper of Jan, Jessica and Benya. Let the robot have the following prior over the program implemented by the human. The human implements a random program (i.e. a random string of bits for some prefix-free universal Oracle machine) conditional on this program being asymptotically optimal in mean for the class of all robot policies that avoid producing some set of “manipulative action sequences.” Here, “manipulative sequences” can be any set S of action sequences s.t.∑x∈Sn−|x|<ϵ where |x| is the length of the action sequence x, n is the number of possible actions and ϵ is a parameter on which the prior depends.