paulfchristiano comments on paulfchristiano’s Shortform

paulfchristiano 30 Jul 2021 22:53 UTC
LW: 2 AF: 2
AF
This is also a way to think about the proposals in this post and the reply:
- The human believes that A’ and B’ are related in a certain way for simple+fundamental reasons.
- On the training distribution, all of the functions we are considering reproduce the expected relationship. However, the reason that they reproduce the expected relationship is quite different.
- For the intended function, you can verify this relationship by looking at the link (A --> B) and the coarse-graining applied to A and B, and verify that the probabilities work out. (That is, I can replace all of the rest of the computational graph with nonsense, or independent samples, and get the same relationship.)
- For the bad function, you have to look at basically the whole graph. That is, it’s not the case that the human’s beliefs about A’ and B’ have the right relationship for arbitrary Ys, they only have the right relationship for a very particular distribution of Ys. So to see that A’ and B’ have the right relationship, we need to simulate the actual underlying dynamics where A --> B, since that creates the correlations in Y that actually lead to the expected correlations between A’ and B’.
- It seems like we believe not only that A’ and B’ are related in a certain way, but that the relationship should be for simple reasons, and so there’s a real sense in which it’s a bad sign if we need to do a ton of extra compute to verify that relationship. I still don’t have a great handle on that kind of argument. I suspect it won’t ultimately come down to “faster is better,” though as a heuristic that seems to work surprisingly well. I think that this feels a bit more plausible to me as a story for why faster would be better (but only a bit).
- It’s not always going to be quite this cut and dried—depending on the structure of the human beliefs we may automatically get the desired relationship between A’ and B’. But if that’s the case then one of the other relationships will be a contingent fact about Y—we can’t reproduce all of the expected relationships for arbitrary Y, since our model presumably makes some substantive predictions about Y and if those predictions are violated we will break some of our inferences.