Since I’m not there yet, if I just take at intuitive amateur guess at how I might expect this to work, it seems pretty intuitively plausible that we’re going to want the category node to be especially sensitive to cheap-to-observe features that correlate with goal-relevant features? Like, yes, we ultimately just want to know as much as possible about the decision-relevant variables, but if some observations are more expensive to make than others, that seems like the sort of thing the network should be able to take into account, right??
I think the mathematically correct thing here is to use something like the expectation maximization algorithm. Let’s say you have a dataset that is a list of elements, each of which has some subset of its attributes known to you, and the others unknown. EM does the following:
Start with some parameters (parameters tell you things like what the cluster means/covariance matrices are; it’s different depending on the probabilistic model)
Use your parameters, plus the observed variables, to infer the unobserved variables (and cluster assignments) and put Bayesian distributions over them
Do something mathematically equivalent to generating a bunch of “virtual” datasets by sampling the unobserved variables from these distributions, then setting the parameters to assign high probability to the union of these virtual datasets (EM isn’t usually described this way but it’s easier to think about IMO)
Repeat starting from step 2
This doesn’t assign any special importance to observed features. Since step 3 is just a function of the virtual datasets (not taking into account additional info about which variables are easy to observe), they’re going to take all the features, observable or not, into account. However, the hard-to-observe features are going to have more uncertainty to them, which affects the virtual datasets. With enough data, this shouldn’t matter that much, but the argument for this is a little complicated.
Another way to solve this problem (which is easier to reason about) is by fully observing a sufficiently high number of samples. Then there isn’t a need for EM, you can just do clustering (or whatever other parameter fitting) on the dataset (actually, clustering can be framed in terms of EM, but doesn’t have to be). Of course, this assigns no special importance to easy-to-observe features. (After learning the parameters, we can use them to infer the unobserved variables probabilistically)
Philosophically, “functions of easily-observed features” seem more like percepts than concepts (this post describes the distinction). These are still useful, and neural nets are automatically going to learn high-level percepts (i.e. functions of observed features), since that’s what the intermediate layers are optimized for. However, a Bayesian inference method isn’t going to assign special importance to observed features, as it treats the observations as causally downstream of the ontological reality rather than causally upstream of it.
I think the mathematically correct thing here is to use something like the expectation maximization algorithm. Let’s say you have a dataset that is a list of elements, each of which has some subset of its attributes known to you, and the others unknown. EM does the following:
Start with some parameters (parameters tell you things like what the cluster means/covariance matrices are; it’s different depending on the probabilistic model)
Use your parameters, plus the observed variables, to infer the unobserved variables (and cluster assignments) and put Bayesian distributions over them
Do something mathematically equivalent to generating a bunch of “virtual” datasets by sampling the unobserved variables from these distributions, then setting the parameters to assign high probability to the union of these virtual datasets (EM isn’t usually described this way but it’s easier to think about IMO)
Repeat starting from step 2
This doesn’t assign any special importance to observed features. Since step 3 is just a function of the virtual datasets (not taking into account additional info about which variables are easy to observe), they’re going to take all the features, observable or not, into account. However, the hard-to-observe features are going to have more uncertainty to them, which affects the virtual datasets. With enough data, this shouldn’t matter that much, but the argument for this is a little complicated.
Another way to solve this problem (which is easier to reason about) is by fully observing a sufficiently high number of samples. Then there isn’t a need for EM, you can just do clustering (or whatever other parameter fitting) on the dataset (actually, clustering can be framed in terms of EM, but doesn’t have to be). Of course, this assigns no special importance to easy-to-observe features. (After learning the parameters, we can use them to infer the unobserved variables probabilistically)
Philosophically, “functions of easily-observed features” seem more like percepts than concepts (this post describes the distinction). These are still useful, and neural nets are automatically going to learn high-level percepts (i.e. functions of observed features), since that’s what the intermediate layers are optimized for. However, a Bayesian inference method isn’t going to assign special importance to observed features, as it treats the observations as causally downstream of the ontological reality rather than causally upstream of it.