Feed the outputs of all these heuristics into the inputs of region W. Loosely couple region W to the rest of your world model. Region W will eventually learn to trigger in response to the abstract concept of a woman. Region W will even draw on other information in the broader world model when deciding whether to fire.
I am not saying that the theory is wrong, but I was reading about something similiar before, and I still don’t understand why would such a system, “region W” in this case, learn something more general than the basic heuristics that were connected to it? It seems like it would have less surprise if it would just copy-paste the behavior of the input.
The first explanation that comes to mind: “it would work better and have less surprise as a whole, since other regions could use output of “region W” for their predictions”. But again, I don’t think that I understand that. I think “region W” doesn’t “know” about other regions surprise rates and hence cannot care about it, so why would it learn something more general and thus contradictory to the heuristics in some cases?
A predictive processor doesn’t just minimize prediction error. It minimizes free energy. Prediction error is only half of the free energy equation. Here is a partial explanation that I hope will begin to help clear up some confusion.
Consider the simplest example. All the heuristics agree. Then region W will just copy and paste the inputs.
But now suppose the heuristics disagree. Suppose 7 heuristics output a value of 1 and 2 heuristics output a value of −1. Region W could just copy the heuristics, but if it did then 29 of the region would have a value of −1 and 79 of the region would have a value of 1. A small region with two contradictory values has high free energy. If the region is tightly coupled to itself then the region will round the whole thing to 1 instead.
I am not saying that the theory is wrong, but I was reading about something similiar before, and I still don’t understand why would such a system, “region W” in this case, learn something more general than the basic heuristics that were connected to it? It seems like it would have less surprise if it would just copy-paste the behavior of the input.
The first explanation that comes to mind: “it would work better and have less surprise as a whole, since other regions could use output of “region W” for their predictions”. But again, I don’t think that I understand that. I think “region W” doesn’t “know” about other regions surprise rates and hence cannot care about it, so why would it learn something more general and thus contradictory to the heuristics in some cases?
A predictive processor doesn’t just minimize prediction error. It minimizes free energy. Prediction error is only half of the free energy equation. Here is a partial explanation that I hope will begin to help clear up some confusion.
Consider the simplest example. All the heuristics agree. Then region W will just copy and paste the inputs.
But now suppose the heuristics disagree. Suppose 7 heuristics output a value of 1 and 2 heuristics output a value of −1. Region W could just copy the heuristics, but if it did then 29 of the region would have a value of −1 and 79 of the region would have a value of 1. A small region with two contradictory values has high free energy. If the region is tightly coupled to itself then the region will round the whole thing to 1 instead.