Causal structure is an intuitively appealing way to pick out the “intended” translation between an AI’s model of the world and a human’s model. For example, intuitively “There is a dog” causes “There is a barking sound.” If we ask our neural net questions like “Is there a dog?” and it computes its answer by checking “Does a human labeler think there is a dog?” then its answers won’t match the expected causal structure—so maybe we can avoid these kinds of answers.
What does that mean if we apply typical definitions of causality to ML training?
If we define causality in terms of interventions, then this helps iff we have interventions in which the labeler is mistaken. In general, it seems we could just include examples with such interventions in the training set.
Similarly, if we use some kind of closest-possible-world semantics, then we need to be able to train models to answer questions consistently about nearby worlds in which the labeler is mistaken. It’s not clear how to train a system to do that. Probably the easiest is to have a human labeler in world X talking about what would happen in some other world Y, where the labeling process is potentially mistaken. (As in “decoupled rl” approaches.) However, in this case it seems liable to learn the “instrumental policy” that asks “What does a human in possible world X think about what would happen in world Y?” which seems only slightly harder than the original.
We could talk about conditional independencies that we expect to remain robust on new distributions (e.g. in cases where humans are mistaken). I’ll discuss this a bit in a reply.
Here’s an abstract example to think about these proposals, just a special case of the example from this post.
Suppose that reality M is described as a causal graph X --> A --> B --> C, and then the observation Y is a function of (A, B, C).
The human’s model M’ of the situation is X --> A’ --> B’ --> C’. Each of them is a coarse-graining of the corresponding part of the real world model, and the observation Y is still a function of (A’, B’, C’), it’s just more uncertain now.
The coarse-grained dynamics are simpler than the actual coarse-graining f: (A, B, C) --> (A’, B’, C’).
We prepare a dataset by actually sampling (X, A, B, C, Y) from M, having humans look at it, make inferences about (A’, B’, C’), and get a dataset of (X, A’, B’, C’, Y) tuples to train a model.
The intended question-answering function is to use M to sample (A, B, C, Y) then apply the coarse-graining f to get (A’, B’, C’). But there is also a bad function that produces good answers on the training dataset: use M to sample (A, B, C, Y), then use the human’s model to infer (A’, B’, C’), and output those.
We’d like to rule out this bad function by making some kind of assumption about causal structure.
The human believes that A’ and B’ are related in a certain way for simple+fundamental reasons.
On the training distribution, all of the functions we are considering reproduce the expected relationship. However, the reason that they reproduce the expected relationship is quite different.
For the intended function, you can verify this relationship by looking at the link (A --> B) and the coarse-graining applied to A and B, and verify that the probabilities work out. (That is, I can replace all of the rest of the computational graph with nonsense, or independent samples, and get the same relationship.)
For the bad function, you have to look at basically the whole graph. That is, it’s not the case that the human’s beliefs about A’ and B’ have the right relationship for arbitrary Ys, they only have the right relationship for a very particular distribution of Ys. So to see that A’ and B’ have the right relationship, we need to simulate the actual underlying dynamics where A --> B, since that creates the correlations in Y that actually lead to the expected correlations between A’ and B’.
It seems like we believe not only that A’ and B’ are related in a certain way, but that the relationship should be for simple reasons, and so there’s a real sense in which it’s a bad sign if we need to do a ton of extra compute to verify that relationship. I still don’t have a great handle on that kind of argument. I suspect it won’t ultimately come down to “faster is better,” though as a heuristic that seems to work surprisingly well. I think that this feels a bit more plausible to me as a story for why faster would be better (but only a bit).
It’s not always going to be quite this cut and dried—depending on the structure of the human beliefs we may automatically get the desired relationship between A’ and B’. But if that’s the case then one of the other relationships will be a contingent fact about Y—we can’t reproduce all of the expected relationships for arbitrary Y, since our model presumably makes some substantive predictions about Y and if those predictions are violated we will break some of our inferences.
So are there some facts about conditional independencies that would privilege the intended mapping? Here is one option.
We believe that A’ and C’ should be independent conditioned on B’. One problem is that this isn’t even true, because B’ is a coarse-graining and so there are in fact correlations between A’ and C’ that the human doesn’t understand. That said, I think that the bad map introduces further conditional correlations, even assuming B=B’. For example, if you imagine Y preserving some facts about A’ and C’, and if the human is sometimes mistaken about B’=B, then we will introduce extra correlations between the human’s beliefs about A’ and C’.
I think it’s pretty plausible that there are necessarily some “new” correlations in any case where the human’s inference is imperfect, but I’d like to understand that better.
So I think the biggest problem is that none of the human’s believed conditional independencies actually hold—they are both precise, and (more problematically) they may themselves only hold “on distribution” in some appropriate sense.
This problem seems pretty approachable though and so I’m excited to spend some time thinking about it.
Actually if A --> B --> C and I observe some function of (A, B, C) it’s just not generally the case that my beliefs about A and C are conditionally independent given my beliefs about B (e.g. suppose I observe A+C). This just makes it even easier to avoid the bad function in this case, but means I want to be more careful about the definition of the case to ensure that it’s actually difficult before concluding that this kid of conditional independence structure is potentially useful.
Sometimes we figure out the conditional in/dependence by looking at the data. It may not match common sense intuition, but if your model takes that into account and gives better results, then they just keep the conditional independence in there. You are only able to do with what you have. Lack of attributes may force you to rely on other dependencies for better predictions.
Conditional probability should be reflected if given enough data points. When you introduce human labeling into the equation, you are adding another uncertainty about the accuracy of the human doing the labeling, regardless whether the inaccuracy came from his own false sense of conditional independence. Usually human labeling don’t directly take into account of any conditional probability to not mess with the conditionals that exist within the data set. That’s why the more data the better, which also means the more labelers you have the less dependent you are on the inaccuracy of any individual human.
The conditional probability assumed in the real world carries over to the data representation world simply because it’s trying to model the same phenomenon in the real world, despite it’s coarse grained nature. Without the conditional probability, we wouldn’t be able to make the same strong inferences that match up to the real world. The causality is part of the data. If you use a different casual relationship, the end model would be different, and you would be solving a very different problem than if you applied the real world casual relationship.
Causal structure is an intuitively appealing way to pick out the “intended” translation between an AI’s model of the world and a human’s model. For example, intuitively “There is a dog” causes “There is a barking sound.” If we ask our neural net questions like “Is there a dog?” and it computes its answer by checking “Does a human labeler think there is a dog?” then its answers won’t match the expected causal structure—so maybe we can avoid these kinds of answers.
What does that mean if we apply typical definitions of causality to ML training?
If we define causality in terms of interventions, then this helps iff we have interventions in which the labeler is mistaken. In general, it seems we could just include examples with such interventions in the training set.
Similarly, if we use some kind of closest-possible-world semantics, then we need to be able to train models to answer questions consistently about nearby worlds in which the labeler is mistaken. It’s not clear how to train a system to do that. Probably the easiest is to have a human labeler in world X talking about what would happen in some other world Y, where the labeling process is potentially mistaken. (As in “decoupled rl” approaches.) However, in this case it seems liable to learn the “instrumental policy” that asks “What does a human in possible world X think about what would happen in world Y?” which seems only slightly harder than the original.
We could talk about conditional independencies that we expect to remain robust on new distributions (e.g. in cases where humans are mistaken). I’ll discuss this a bit in a reply.
Here’s an abstract example to think about these proposals, just a special case of the example from this post.
Suppose that reality M is described as a causal graph X --> A --> B --> C, and then the observation Y is a function of (A, B, C).
The human’s model M’ of the situation is X --> A’ --> B’ --> C’. Each of them is a coarse-graining of the corresponding part of the real world model, and the observation Y is still a function of (A’, B’, C’), it’s just more uncertain now.
The coarse-grained dynamics are simpler than the actual coarse-graining f: (A, B, C) --> (A’, B’, C’).
We prepare a dataset by actually sampling (X, A, B, C, Y) from M, having humans look at it, make inferences about (A’, B’, C’), and get a dataset of (X, A’, B’, C’, Y) tuples to train a model.
The intended question-answering function is to use M to sample (A, B, C, Y) then apply the coarse-graining f to get (A’, B’, C’). But there is also a bad function that produces good answers on the training dataset: use M to sample (A, B, C, Y), then use the human’s model to infer (A’, B’, C’), and output those.
We’d like to rule out this bad function by making some kind of assumption about causal structure.
This is also a way to think about the proposals in this post and the reply:
The human believes that A’ and B’ are related in a certain way for simple+fundamental reasons.
On the training distribution, all of the functions we are considering reproduce the expected relationship. However, the reason that they reproduce the expected relationship is quite different.
For the intended function, you can verify this relationship by looking at the link (A --> B) and the coarse-graining applied to A and B, and verify that the probabilities work out. (That is, I can replace all of the rest of the computational graph with nonsense, or independent samples, and get the same relationship.)
For the bad function, you have to look at basically the whole graph. That is, it’s not the case that the human’s beliefs about A’ and B’ have the right relationship for arbitrary Ys, they only have the right relationship for a very particular distribution of Ys. So to see that A’ and B’ have the right relationship, we need to simulate the actual underlying dynamics where A --> B, since that creates the correlations in Y that actually lead to the expected correlations between A’ and B’.
It seems like we believe not only that A’ and B’ are related in a certain way, but that the relationship should be for simple reasons, and so there’s a real sense in which it’s a bad sign if we need to do a ton of extra compute to verify that relationship. I still don’t have a great handle on that kind of argument. I suspect it won’t ultimately come down to “faster is better,” though as a heuristic that seems to work surprisingly well. I think that this feels a bit more plausible to me as a story for why faster would be better (but only a bit).
It’s not always going to be quite this cut and dried—depending on the structure of the human beliefs we may automatically get the desired relationship between A’ and B’. But if that’s the case then one of the other relationships will be a contingent fact about Y—we can’t reproduce all of the expected relationships for arbitrary Y, since our model presumably makes some substantive predictions about Y and if those predictions are violated we will break some of our inferences.
So are there some facts about conditional independencies that would privilege the intended mapping? Here is one option.
We believe that A’ and C’ should be independent conditioned on B’. One problem is that this isn’t even true, because B’ is a coarse-graining and so there are in fact correlations between A’ and C’ that the human doesn’t understand. That said, I think that the bad map introduces further conditional correlations, even assuming B=B’. For example, if you imagine Y preserving some facts about A’ and C’, and if the human is sometimes mistaken about B’=B, then we will introduce extra correlations between the human’s beliefs about A’ and C’.
I think it’s pretty plausible that there are necessarily some “new” correlations in any case where the human’s inference is imperfect, but I’d like to understand that better.
So I think the biggest problem is that none of the human’s believed conditional independencies actually hold—they are both precise, and (more problematically) they may themselves only hold “on distribution” in some appropriate sense.
This problem seems pretty approachable though and so I’m excited to spend some time thinking about it.
Actually if A --> B --> C and I observe some function of (A, B, C) it’s just not generally the case that my beliefs about A and C are conditionally independent given my beliefs about B (e.g. suppose I observe A+C). This just makes it even easier to avoid the bad function in this case, but means I want to be more careful about the definition of the case to ensure that it’s actually difficult before concluding that this kid of conditional independence structure is potentially useful.
Sometimes we figure out the conditional in/dependence by looking at the data. It may not match common sense intuition, but if your model takes that into account and gives better results, then they just keep the conditional independence in there. You are only able to do with what you have. Lack of attributes may force you to rely on other dependencies for better predictions.
Conditional probability should be reflected if given enough data points. When you introduce human labeling into the equation, you are adding another uncertainty about the accuracy of the human doing the labeling, regardless whether the inaccuracy came from his own false sense of conditional independence. Usually human labeling don’t directly take into account of any conditional probability to not mess with the conditionals that exist within the data set. That’s why the more data the better, which also means the more labelers you have the less dependent you are on the inaccuracy of any individual human.
The conditional probability assumed in the real world carries over to the data representation world simply because it’s trying to model the same phenomenon in the real world, despite it’s coarse grained nature. Without the conditional probability, we wouldn’t be able to make the same strong inferences that match up to the real world. The causality is part of the data. If you use a different casual relationship, the end model would be different, and you would be solving a very different problem than if you applied the real world casual relationship.