We have successfully eliminated the concepts of null actions and nonexistence from our definition of optimization. We have also eliminated the concept of repeated action. We are halfway there, and now have to eliminate uncertainty and absolute time. Then we will have achieved the goal of being able to wrap a 3D hyperplane boundary around a 4D chunk of relativistic spacetime and ask ourselves “Is this an optimizer?” in a meaningful way.
I’m going to tackle uncertainty next.
TL;DR I have allowed for a mor
We’ve already defined, for a deterministic system, that a joint probability distribution PABt(sA,sB) has a numerical optimizing-ness, in terms of entropy. Now I want to extend that to a non-joint probability distribution of the form PA(sA)PB(sB). We can do this by defining PAt−1(sA) and PBt−1(sB) for the previous timestep.
We can then define PABt(sA,sB) as by stepping forwards from t−1 to t as before, according to the dynamics of the system.
A question we might want to ask is, for a given PAt−1(sA) and PBt−1(sB), how “optimizing” is the distribution PABt(sA,sB)?
The Dumb Thermostat
Lets apply our new idea to the previous models, the two thermostats. Lets begin with uncorrelated, maximum entropy distributions.
For thermostat 1 we have the dynamic matrix:
hot
warm
cold
high
(hot,off)
(hot,low)
(warm,high)
low
(hot,off)
(warm,low)
(cold,high)
off
(warm,off)
(cold,low)
(cold,high)
(In this matrix, the entry for a cell represents the state at time=t+1given the coordinates of that cell represent the state at time=t)
With the PRTt−1 distribution:
hot
warm
cold
high
1/9
1/9
1/9
low
1/9
1/9
1/9
off
1/9
1/9
1/9
As an aside this has 3.2 bits of entropy.
Leading to the PRTt
distribution:
hot
warm
cold
high
0
1/9
2/9
low
1/9
1/9
1/9
off
2/9
1/9
0
This gives us the “standard” PRTt+1 distribution of:
hot
warm
cold
high
0
2/9
1/9
low
1/9
1/9
1/9
off
1/9
2/9
0
And the “decorrelated” P′RTt+1 distribution is actually just the same as PRTt! When we decorrelate the probabilities for sR and sT we just get back to the maximum entropy distribution and so P′RTt+1=PRTt
It’s clear by inspection that the distributions PRTt+1 and P′RTt+1 have the same entropy, so the decorrelated maximum entropy PRt−1PTt−1 does not produce an “optimizing” distribution at PRTt.
If we actually consider the dynamics of this system, we can see that this makes sense! The temperature actually either stays at (warm,low) or falls into the cycle:
So there’s no compression of futures into a smaller number of trajectories.
The Smart Thermostat
What about our “smarter” thermostat? This one has the dynamic matrix:
hot
warm
cold
high
(hot,off)
(hot,low)
(warm,low)
low
(hot,off)
(warm,low)
(cold,high)
off
(warm,low)
(cold,low)
(cold,high)
Well now our PRTt distribution looks like this:
hot
warm
cold
high
0
0
2/9
low
1/9
1/3
1/9
off
2/9
0
0
Giving “standard” a PRTt+1 of this:
hot
warm
cold
high
0
0
1/9
low
0
7/9
0
off
1/9
0
0
And a “decorrelated” P′RTt of:
hot
warm
cold
high
2/27
2/27
2/27
low
5/27
5/27
5/27
off
2/27
2/27
2/27
Giving the decorrelated P′RTt+1:
hot
warm
cold
high
0
0
7/27
low
2/27
1/3
2/27
off
7/27
0
0
Now in this case, these two do have different entropies.PRTt+1 has an entropy of 1.0 bits, and P′RTt+1 has an entropy of 2.1 bits. This gives us a difference of 1.1 bits of entropy. This is the Optimizing-ness we defined in the last post, but I think it’s actually somewhat incomplete.
Let’s also consider the initial difference between PRTt and P′RTt. Decorrelating PRTt takes it from 2.2 to 3.0 bits of entropy. So the entropy difference started off at 0.8 bits. Therefore the difference of the difference in entropy is 0.3 bits.
The value of associated with PRTt−1 is equal to (S[P′RTt+1]−S[PRTt+1])−(S[P′RTt]−S[PRTt]). which can also be expressed as S[P′RTt+1]+S[PRTt]−S[PRTt+1]−S[P′RTt]. We might call this quantity the adjusted optimizing-ness.
Quantitative Data
The motivation for this was that a maximum entropy distribution is “natural” in some sense. This moves us towards not needing uncertainty. If we have a given state of a system, we might be able to “naturally” define a probability distribution around that state. Then we can measure the optimizing-ness of the next step’s distribution.
What happens with a different PRTt−1 condition? What if we have a distribution like this:
hot
warm
cold
high
ϵ2/4
ϵ(1−ϵ)/2
ϵ2/4
low
ϵ(1−ϵ)/2
(1−ϵ)2
ϵ(1−ϵ)/2
off
ϵ2/4
ϵ(1−ϵ)/2
ϵ2/4
For some small epsilon in the second situation.
Now PRTt is like this:
hot
warm
cold
high
0
0
ϵ(1−ϵ)/2+ϵ2/4
low
ϵ(1−ϵ)/2
(1−ϵ)2+ϵ2/2
ϵ(1−ϵ)/2
off
ϵ(1−ϵ)/2+ϵ2/4
0
0
So PRTt+1 is:
hot
warm
cold
high
0
0
ϵ(1−ϵ)/2
low
0
1−ϵ(1−ϵ)
0
off
ϵ(1−ϵ)/2
0
0
While it is theoretically possible to decorrelate everything, calculate the next set of things, and keep going, it’s a huge mess. Using values for epsilon between 0.1 and 10−10 we can make the following plot between the entropy of PRTt−1 and our previously defined adjusted optimizing-ness.
It looks linear in the log/log particularly in the region where ϵ is very small. By fitting to the leftmost five points we get a simple linear relation: The adjusted optimizing-ness approaches half of the entropy of PRTt−1.
This is kind of weird. This might not be an optimal system to study, so let’s look at another toy example. A more realistic model of a thermostat:
The Continuous Thermostat
The temperature of the room is considered as SR∈R. The activity of the thermostat is considered as T∈R. Each timestep, we have the following updates:
STt+1=SRt
SRt+1=SRt−kSTt
Consider the following distributions:
PRt−1(sR)∼U(10−ϵ/2,10+ϵ/2)
PTt−1(sT)∼U(10−ϵ/2,10+ϵ/2)
Where U(a,b) refers to a uniform distribution between a and b. PRTt−1 can be thought of as a square of side length ϵ centered on the point (10,10). PRTt turns out to be a rhombus. The corners transform like this:
Time = t−1
Time = t
(10+ϵ,10+ϵ)
(10(1−k)+ϵ(1−k),10+ϵ)
(10+ϵ,10−ϵ)
(10(1−k)+ϵk,10+ϵ)
(10−ϵ,10+ϵ)
(10(1−k)−ϵk,10−ϵ)
(10−ϵ,10−ϵ)
(10(1−k)−ϵ(1−k),1−ϵ)
For ϵ=0.1,k=0.3 the whole sequence looks like the following:
So we clearly have some sort of optimization going on here. Estimating or calculating the entropy of these distributions is not easy. And when we use the entropy of a continuous distribution, we get results which depend on the choice of coordinates (or alternatively the choice of some weighting function). Entropies of continuous distributions may also be negative, which is quite annoying.
Perhaps calculating the variance will leave us better off? Sadly not. I tried it for gaussians of decreasing variance and didn’t get much. The equivalent to our adjusted optimizing-ness which we might define as log(V[P′ABt+1])+log(V[PABt])−log(V[P′ABt])−log(V[PABt+1]) is always zero for this system. The non-adjusted version log(V[P′ABt+1])−log(V[PABt+1]) fluctuates a lot.
Where does this leave us?
We can define whether something is an optimizer based on a probability distribution which need not be joint over A and B. This means we can define whether something is an optimizer for an arbitrarily narrow probability distribution, meaning we can take the limit as the probability distribution approaches a delta. We found an interesting relation between quantities in our simplified system but failed to extend it to a continuous system.
Defining Optimization in a Deeper Way Part 2
We have successfully eliminated the concepts of null actions and nonexistence from our definition of optimization. We have also eliminated the concept of repeated action. We are halfway there, and now have to eliminate uncertainty and absolute time. Then we will have achieved the goal of being able to wrap a 3D hyperplane boundary around a 4D chunk of relativistic spacetime and ask ourselves “Is this an optimizer?” in a meaningful way.
I’m going to tackle uncertainty next.
TL;DR I have allowed for a mor
We’ve already defined, for a deterministic system, that a joint probability distribution PABt(sA,sB) has a numerical optimizing-ness, in terms of entropy. Now I want to extend that to a non-joint probability distribution of the form PA(sA)PB(sB). We can do this by defining PAt−1(sA) and PBt−1(sB) for the previous timestep.
We can then define PABt(sA,sB) as by stepping forwards from t−1 to t as before, according to the dynamics of the system.
A question we might want to ask is, for a given PAt−1(sA) and PBt−1(sB), how “optimizing” is the distribution PABt(sA,sB)?
The Dumb Thermostat
Lets apply our new idea to the previous models, the two thermostats. Lets begin with uncorrelated, maximum entropy distributions.
For thermostat 1 we have the dynamic matrix:
(In this matrix, the entry for a cell represents the state at time=t+1 given the coordinates of that cell represent the state at time=t)
With the PRTt−1 distribution:
As an aside this has 3.2 bits of entropy.
Leading to the PRTt
distribution:
This gives us the “standard” PRTt+1 distribution of:
And the “decorrelated” P′RTt+1 distribution is actually just the same as PRTt! When we decorrelate the probabilities for sR and sT we just get back to the maximum entropy distribution and so P′RTt+1=PRTt
It’s clear by inspection that the distributions PRTt+1 and P′RTt+1 have the same entropy, so the decorrelated maximum entropy PRt−1PTt−1 does not produce an “optimizing” distribution at PRTt.
If we actually consider the dynamics of this system, we can see that this makes sense! The temperature actually either stays at (warm,low) or falls into the cycle:
(hot,low)→(hot,off)→(warm,off)→(cold,low)→(cold,high)→(warm,high)→(hot,low)
So there’s no compression of futures into a smaller number of trajectories.
The Smart Thermostat
What about our “smarter” thermostat? This one has the dynamic matrix:
Well now our PRTt distribution looks like this:
Giving “standard” a PRTt+1 of this:
And a “decorrelated” P′RTt of:
Giving the decorrelated P′RTt+1:
Now in this case, these two do have different entropies.PRTt+1 has an entropy of 1.0 bits, and P′RTt+1 has an entropy of 2.1 bits. This gives us a difference of 1.1 bits of entropy. This is the Optimizing-ness we defined in the last post, but I think it’s actually somewhat incomplete.
Let’s also consider the initial difference between PRTt and P′RTt. Decorrelating PRTt takes it from 2.2 to 3.0 bits of entropy. So the entropy difference started off at 0.8 bits. Therefore the difference of the difference in entropy is 0.3 bits.
The value of associated with PRTt−1 is equal to (S[P′RTt+1]−S[PRTt+1])−(S[P′RTt]−S[PRTt]). which can also be expressed as S[P′RTt+1]+S[PRTt]−S[PRTt+1]−S[P′RTt]. We might call this quantity the adjusted optimizing-ness.
Quantitative Data
The motivation for this was that a maximum entropy distribution is “natural” in some sense. This moves us towards not needing uncertainty. If we have a given state of a system, we might be able to “naturally” define a probability distribution around that state. Then we can measure the optimizing-ness of the next step’s distribution.
What happens with a different PRTt−1 condition? What if we have a distribution like this:
For some small epsilon in the second situation.
Now PRTt is like this:
So PRTt+1 is:
While it is theoretically possible to decorrelate everything, calculate the next set of things, and keep going, it’s a huge mess. Using values for epsilon between 0.1 and 10−10 we can make the following plot between the entropy of PRTt−1 and our previously defined adjusted optimizing-ness.
It looks linear in the log/log particularly in the region where ϵ is very small. By fitting to the leftmost five points we get a simple linear relation: The adjusted optimizing-ness approaches half of the entropy of PRTt−1.
This is kind of weird. This might not be an optimal system to study, so let’s look at another toy example. A more realistic model of a thermostat:
The Continuous Thermostat
The temperature of the room is considered as SR∈R. The activity of the thermostat is considered as T∈R. Each timestep, we have the following updates:
STt+1=SRt
SRt+1=SRt−k STt
Consider the following distributions:
PRt−1(sR)∼U(10−ϵ/2,10+ϵ/2)
PTt−1(sT)∼U(10−ϵ/2,10+ϵ/2)
Where U(a,b) refers to a uniform distribution between a and b. PRTt−1 can be thought of as a square of side length ϵ centered on the point (10,10). PRTt turns out to be a rhombus. The corners transform like this:
For ϵ=0.1,k=0.3 the whole sequence looks like the following:
So we clearly have some sort of optimization going on here. Estimating or calculating the entropy of these distributions is not easy. And when we use the entropy of a continuous distribution, we get results which depend on the choice of coordinates (or alternatively the choice of some weighting function). Entropies of continuous distributions may also be negative, which is quite annoying.
Perhaps calculating the variance will leave us better off? Sadly not. I tried it for gaussians of decreasing variance and didn’t get much. The equivalent to our adjusted optimizing-ness which we might define as log(V[P′ABt+1])+log(V[PABt])−log(V[P′ABt])−log(V[PABt+1]) is always zero for this system. The non-adjusted version log(V[P′ABt+1])−log(V[PABt+1]) fluctuates a lot.
Where does this leave us?
We can define whether something is an optimizer based on a probability distribution which need not be joint over A and B. This means we can define whether something is an optimizer for an arbitrarily narrow probability distribution, meaning we can take the limit as the probability distribution approaches a delta. We found an interesting relation between quantities in our simplified system but failed to extend it to a continuous system.