In a previous post on bridging syntax and semantics, I mentioned how to empirically establish that the internal symbols Xi represented the variables xi in the environment: if the Xi have high mutual information with the xi. This basically ask whether you can find out about the values of the xi by knowing the Xi. See also Luke Muelhauser’s mention of “representation” and the articles linked therein.
At the end of that post, I mentioned the problem of finding the variables Xi if they were not given. This post will briefly look over that problem, and the related problem of finding the xi.
Waterfall and variables in the world
Given the internal variable Xi, it is almost certainly possible to find a variable y in the outside world that correlates with it (even if we assume a Cartesian separation between the agent and the world, so we can’t just do the lazy thing and set y=Xi).
In the example of detecting an intruder in a greenhouse, look at Xg, the internal variable of a guard that peers into the greenhouse to see an intruder.
Then we can certainly come up with a variable y that correlates with Xg. This could be a variable that correlates with whether there is an intruder in the greenhouse in situations where the guard can see it, and then correlates with all the issues that might fool the guard: mannequins, delusion-inducing gases, intruders disguised as tables, etc...
But we don’t even need y to be anything like the variables that Xg was ‘supposed’ to measure. If we have a chaotic system in the vicinity—say a nearby waterfall—then we can just list all the states of that system that happen when Xg=0 vs those that happen when Xg=1, and set y to be 0 or 1 in those states.
That is a variant of Scott Aaronson’s waterfall argument: if you have enough variety of states, and you can construct definitions of arbitrary complexity, then you can “ground” any model in these definitions. To avoid this, we have to penalise this definitional complexity the definition is doing all the work here, and is itself a highly complicated algorithm to implement.
So pick the xi so that:
the complexity of defining the xi is low, and
the xi have intrinsically relevant definitions, definitions that make sense without direct or indirect knowledge of Xi.
There are some edge cases of course—if a human has Xi being their estimate of whether a swan is around, it might be useful to distinguish between xi={there is a swan} and x′i={there is a white swan}, as this tells us whether the human was conceptualising black swans as swans. But in general, the xi should be defined by concepts that make sense on their own, and don’t take Xi into account.
Variables in the mind
Now assume that the xi are something reasonable. What of the Xi? Well, imagine a superintelligence had access to an agent’s entire sensory input. If the superintelligence had a decent world model, it could use that input to construct a best estimate as to the value of xi - and call that estimate, which is a function of the internal state of the agent, Xi. Even if we limited the superintelligence to only accessing some parts of the agent—maybe just the short term memory, or the conscious states—it could still construct an Xi that is likely a far better correlate of xi than anything the agent could construct/naturally has access to.
For example, if xi were temperature (as in this post), then an AI could deduce temperature information from human sensory data much better than our subjective “it feels kinda hot/cold in here”.
So the Xi should be selected according to other criteria than correlation with xi. For algorithms, we could look at named variables within them. For humans, we could also look at variables that correspond to names or labels (for example, when you ask a human “are you feeling hot?”, what parts of the brain are triggered when that question is asked, and what parts correspond to the articulated answer being “yes”).
Unless we are specifically interested in speech acts, we can’t just say ”Xi corresponds to the human answering ‘yes’ when asked about how hot they feel”. Nevertheless, when attempting to define a “feeling of hotness” variable, we should be defining it with all our knowledge (and the human’s knowledge) of what that means: for example the fact that humans often answer ‘yes’ to that question when they indeed do feel hot.
So the Xi should be defined by taking some concept and seeking to formalise how humans use it/implement it, not by correlating it with the xi.
We can sometimes justify a more correlated Xi, if the concept is natural for the human in question. For example, we could take a human and train them to estimate temperature. After a while, they will develop an internal temperature estimator X′′i which is more highly correlated with the temperature xi, but which corresponds naturally to something the human can consciously access; we could check this, by, for example, getting the human to write down their temperature estimate.
We can also imagine the variable X′i, which is an untrained human’s estimate of temperature; we’d expect this to be a bit better than Xi, just because the human can explicitly take into account things like fever, or temperature acclimatisation. But it’s not clear that X′i is really an intrinsic variable in the brain, or something constructed specifically by the human to answer that question at that moment.
Things can get more murky if we allow for unconscious feelings. Suppose someone has a relatively accurate gut instinct as to whether other people are trustworthy, but barely makes use of that instinct consciously. Then it’s tricky to decide whether that instinct is a natural internal variable (which is highly correlated with trustworthiness), or an input into the human’s conscious estimate (which is weakly correlated with trustworthiness).
Investigation, not optimisation
So this method is very suitable for checking the correlations between internal variables and external ones, variables that we have defined though some other process. So it can answer questions like:
“Is a human’s subjective feeling of heat a good estimate of temperature?” (not really).
“Is a trained human’s temperature guess a good estimate of temperature?” (somewhat).
“Is a human’s subjective feeling of there being someone else in the room a good estimate of the presence of an intruder”? (yes, very much so).
“Does this brain activity mean that the human detects an intruder?” (possibly).
But it all falls apart if we try and use the correlation as an optimisation measure, shifting Xi to better measure xi or vice-versa.
Finding the variables
In a previous post on bridging syntax and semantics, I mentioned how to empirically establish that the internal symbols Xi represented the variables xi in the environment: if the Xi have high mutual information with the xi. This basically ask whether you can find out about the values of the xi by knowing the Xi. See also Luke Muelhauser’s mention of “representation” and the articles linked therein.
At the end of that post, I mentioned the problem of finding the variables Xi if they were not given. This post will briefly look over that problem, and the related problem of finding the xi.
Waterfall and variables in the world
Given the internal variable Xi, it is almost certainly possible to find a variable y in the outside world that correlates with it (even if we assume a Cartesian separation between the agent and the world, so we can’t just do the lazy thing and set y=Xi).
In the example of detecting an intruder in a greenhouse, look at Xg, the internal variable of a guard that peers into the greenhouse to see an intruder.
Then we can certainly come up with a variable y that correlates with Xg. This could be a variable that correlates with whether there is an intruder in the greenhouse in situations where the guard can see it, and then correlates with all the issues that might fool the guard: mannequins, delusion-inducing gases, intruders disguised as tables, etc...
But we don’t even need y to be anything like the variables that Xg was ‘supposed’ to measure. If we have a chaotic system in the vicinity—say a nearby waterfall—then we can just list all the states of that system that happen when Xg=0 vs those that happen when Xg=1, and set y to be 0 or 1 in those states.
That is a variant of Scott Aaronson’s waterfall argument: if you have enough variety of states, and you can construct definitions of arbitrary complexity, then you can “ground” any model in these definitions. To avoid this, we have to penalise this definitional complexity the definition is doing all the work here, and is itself a highly complicated algorithm to implement.
So pick the xi so that:
the complexity of defining the xi is low, and
the xi have intrinsically relevant definitions, definitions that make sense without direct or indirect knowledge of Xi.
There are some edge cases of course—if a human has Xi being their estimate of whether a swan is around, it might be useful to distinguish between xi={there is a swan} and x′i={there is a white swan}, as this tells us whether the human was conceptualising black swans as swans. But in general, the xi should be defined by concepts that make sense on their own, and don’t take Xi into account.
Variables in the mind
Now assume that the xi are something reasonable. What of the Xi? Well, imagine a superintelligence had access to an agent’s entire sensory input. If the superintelligence had a decent world model, it could use that input to construct a best estimate as to the value of xi - and call that estimate, which is a function of the internal state of the agent, Xi. Even if we limited the superintelligence to only accessing some parts of the agent—maybe just the short term memory, or the conscious states—it could still construct an Xi that is likely a far better correlate of xi than anything the agent could construct/naturally has access to.
For example, if xi were temperature (as in this post), then an AI could deduce temperature information from human sensory data much better than our subjective “it feels kinda hot/cold in here”.
So the Xi should be selected according to other criteria than correlation with xi. For algorithms, we could look at named variables within them. For humans, we could also look at variables that correspond to names or labels (for example, when you ask a human “are you feeling hot?”, what parts of the brain are triggered when that question is asked, and what parts correspond to the articulated answer being “yes”).
Unless we are specifically interested in speech acts, we can’t just say ”Xi corresponds to the human answering ‘yes’ when asked about how hot they feel”. Nevertheless, when attempting to define a “feeling of hotness” variable, we should be defining it with all our knowledge (and the human’s knowledge) of what that means: for example the fact that humans often answer ‘yes’ to that question when they indeed do feel hot.
So the Xi should be defined by taking some concept and seeking to formalise how humans use it/implement it, not by correlating it with the xi.
We can sometimes justify a more correlated Xi, if the concept is natural for the human in question. For example, we could take a human and train them to estimate temperature. After a while, they will develop an internal temperature estimator X′′i which is more highly correlated with the temperature xi, but which corresponds naturally to something the human can consciously access; we could check this, by, for example, getting the human to write down their temperature estimate.
We can also imagine the variable X′i, which is an untrained human’s estimate of temperature; we’d expect this to be a bit better than Xi, just because the human can explicitly take into account things like fever, or temperature acclimatisation. But it’s not clear that X′i is really an intrinsic variable in the brain, or something constructed specifically by the human to answer that question at that moment.
Things can get more murky if we allow for unconscious feelings. Suppose someone has a relatively accurate gut instinct as to whether other people are trustworthy, but barely makes use of that instinct consciously. Then it’s tricky to decide whether that instinct is a natural internal variable (which is highly correlated with trustworthiness), or an input into the human’s conscious estimate (which is weakly correlated with trustworthiness).
Investigation, not optimisation
So this method is very suitable for checking the correlations between internal variables and external ones, variables that we have defined though some other process. So it can answer questions like:
“Is a human’s subjective feeling of heat a good estimate of temperature?” (not really).
“Is a trained human’s temperature guess a good estimate of temperature?” (somewhat).
“Is a human’s subjective feeling of there being someone else in the room a good estimate of the presence of an intruder”? (yes, very much so).
“Does this brain activity mean that the human detects an intruder?” (possibly).
But it all falls apart if we try and use the correlation as an optimisation measure, shifting Xi to better measure xi or vice-versa.