If the predictor’s P(diamond look safe) is higher than the human’s P(diamond looks safe), then it seems like the human simulator will predictably have an anticorrelation between [camera-tamperign] and [actually-saving-diamond]. It’s effectively updating on one or the other of them happening more often than it thought, and so conditioned on one it goes back to the (lower) prior for the other. This still seems right to me, despite no explicit dependence on the correlations.
I agree the human simulator will predictably have an anticorrelation. But the direct simulator might also have an anticorrelation, perhaps a larger one, depending on what reality looks like.
Is the assumption that it’s unlikely that most identifiable X actually imply large anticorrelations?
I do agree that there are examples where the direct translator systematically has anticorrelations and so gets penalized even more than the human simulator. For example, this could happen if there is a consequentialist in the environment who wants it to happen, or if there’s a single big anticorrelation that dominates the sum and happens to go the wrong way.
That said, it at least seems like it should be rare for the direct translator to have a larger anticorrelation (without something funny going on). It should happen only if reality itself is much more anticorrelated than the human expects, by a larger margin than the anticorrelation induced by the update in the human simulator. But on average things should be more anticorrelated than expected about as much as they are positively correlated (averaging out to ~0), and probably usually don’t have any correlation big enough to matter.
If the predictor’s P(diamond look safe) is higher than the human’s P(diamond looks safe), then it seems like the human simulator will predictably have an anticorrelation between [camera-tamperign] and [actually-saving-diamond]. It’s effectively updating on one or the other of them happening more often than it thought, and so conditioned on one it goes back to the (lower) prior for the other. This still seems right to me, despite no explicit dependence on the correlations.
I agree the human simulator will predictably have an anticorrelation. But the direct simulator might also have an anticorrelation, perhaps a larger one, depending on what reality looks like.
Is the assumption that it’s unlikely that most identifiable X actually imply large anticorrelations?
I do agree that there are examples where the direct translator systematically has anticorrelations and so gets penalized even more than the human simulator. For example, this could happen if there is a consequentialist in the environment who wants it to happen, or if there’s a single big anticorrelation that dominates the sum and happens to go the wrong way.
That said, it at least seems like it should be rare for the direct translator to have a larger anticorrelation (without something funny going on). It should happen only if reality itself is much more anticorrelated than the human expects, by a larger margin than the anticorrelation induced by the update in the human simulator. But on average things should be more anticorrelated than expected about as much as they are positively correlated (averaging out to ~0), and probably usually don’t have any correlation big enough to matter.
Thanks, I consider this fully answered.