On page 22 of Probabilistic reasoning in intelligent systems, Pearl writes:
Raw experiential data is not amenable to reasoning activities such as prediction and planning; these require that data be abstracted into a representation with a coarser grain. Probabilities are summaries of details lost in this abstraction...
An agent observes a sequence of images displaying either a red or a blue ball. The balls are drawn according to some deterministic rule of the time step. Reasoning directly from the experiential data leads to ~Solomonoff induction. What might Pearl’s “coarser grain” look like for a real agent?
Imagine an RNN trained with gradient descent and binary cross-entropy loss function (“given the data so far, did it correctly predict the next draw?”), and suppose the learned predictive accuracy is good. How might this happen?
The network learns to classify whether the most recent input image contains a red or blue ball, for instrumental predictive reasons, and
A recurrent state records salient information about the observed sequence, which could be arbitrarily long. The RNN + learned weights form a low-complexity function approximator in the space of functions on arbitrary-length sequences. My impression is that gradient descent has simplicity as an inductive bias (cf double descent debate).
Being an approximation of some function over arbitrary-length sequences, the network outputs a prediction for the next color, a specific feature of the next image in the sequence. Can this prediction be viewed as nontrivially probabilistic? In other words, could we use the output to learn about the network’s “beliefs” over hypotheses which generate the sequence of balls?
The RNN probably isn’t approximating the true (deterministic) hypothesis which explains the sequence of balls. Since it’s trained to minimize cross-entropy loss, it learns to hedge, essentially making it approximate a distribution over hypotheses. This implicitly defines its “posterior probability distribution”.
Under this interpretation, the output is just the measure of hypotheses predicting blue versus the measure predicting red.
In particular, the coarse-grain is what I mentioned in 1) – beliefs are easier to manage with respect to a fixed featurization of the observation space.
Only related to the first part of your post, I suspect Pearl!2020 would say the coarse-grained model should be some sort of causal model on which we can do counterfactual reasoning.
On page 22 of Probabilistic reasoning in intelligent systems, Pearl writes:
An agent observes a sequence of images displaying either a red or a blue ball. The balls are drawn according to some deterministic rule of the time step. Reasoning directly from the experiential data leads to ~Solomonoff induction. What might Pearl’s “coarser grain” look like for a real agent?
Imagine an RNN trained with gradient descent and binary cross-entropy loss function (“given the data so far, did it correctly predict the next draw?”), and suppose the learned predictive accuracy is good. How might this happen?
The network learns to classify whether the most recent input image contains a red or blue ball, for instrumental predictive reasons, and
A recurrent state records salient information about the observed sequence, which could be arbitrarily long. The RNN + learned weights form a low-complexity function approximator in the space of functions on arbitrary-length sequences. My impression is that gradient descent has simplicity as an inductive bias (cf double descent debate).
Being an approximation of some function over arbitrary-length sequences, the network outputs a prediction for the next color, a specific feature of the next image in the sequence. Can this prediction be viewed as nontrivially probabilistic? In other words, could we use the output to learn about the network’s “beliefs” over hypotheses which generate the sequence of balls?
The RNN probably isn’t approximating the true (deterministic) hypothesis which explains the sequence of balls. Since it’s trained to minimize cross-entropy loss, it learns to hedge, essentially making it approximate a distribution over hypotheses. This implicitly defines its “posterior probability distribution”.
Under this interpretation, the output is just the measure of hypotheses predicting blue versus the measure predicting red.
In particular, the coarse-grain is what I mentioned in 1) – beliefs are easier to manage with respect to a fixed featurization of the observation space.
Only related to the first part of your post, I suspect Pearl!2020 would say the coarse-grained model should be some sort of causal model on which we can do counterfactual reasoning.