My aim is to define optimization without making reference to the following things:
A “null” action or “nonexistence” of the optimizer. This is generally poorly defined, and choices of different null actions give different answers.
Repeated action. An optimizer should still count even if it only does a single action.
Uncertainty. We should be able to define an optimizer in a fully deterministic universe.
Absolute time at all. This will be the hardest, but it would be nice to define optimization without reference to the “state” of the universe at “time t”.
Attempt one
First let’s just eliminate the concept of a null action. Imagine the state of the universe at a time t.
Let’s divide the universe into two sections and call these A and B. They have states SA and SB. If we want to use continuous states we’ll need to have some metric D(s1,s2) which applies to these states, so we can calculate things like the variance and entropy of probability distributions over them.
Treat SA and SB as part of a Read-Eval-Print-Loop. Each SAt produces some output OAt which acts like a function mapping SBt→SBt+1, and vice versa.OAt can be thought of as things which cross the Markov blanket.
Sadly we still have to introduce probability distributions. Let’s consider a joint probability distribution PABt(sA,sB), and also the two individual probability distributions PAt(sA) and PBt(sB).
By defining distributions over O outputs based on the distribution PABt, we can define PABt+1(sA,sB) in the “normal” way. This looks like integrating over the space of sA and sB like so:
What this is basically saying is that to define the probability distribution of states sAt+1 and sBt+1, we integrate over all states sAt and sBt and sum up the states where the OAt corresponding to sAt maps sBt to the given sBt+1.
Now lets define an “uncorrelated” version of PABt+1, which we will refer to as P′ABt+1.
This loosely represents what happens if we decorrelate sA and sB. In the language of humans, this is like an agent taking a random move from a selection.
We can refer to a probability distribution PABt as an “optimizing” probability distribution if P′ABt+1 is higher entropy than PABt+1.
For an example, imagine the universe is divided into two parts: a room R and a thermostat T. The room can have states in the set sR∈{hot,lukewarm,cold}, and the thermostat can have states sT∈{high,low,off}. Imagine that OR and OT are defined as follows:
Basically the thermostat decides whether the room gets warmer, stays the same, or gets colder, and the thermostat.
We can also consider the probability mass flowing from each of the nine states to another one:
hot
warm
cold
high
(hot,off)
(hot,low)
(warm,high)
low
(hot,off)
(warm,low)
(cold,high)
off
(warm,low)
(cold,low)
(cold,high)
Imagine the following PTR0(sT,sR):
hot
warm
cold
high
0
0
1/3
low
0
1/3
0
off
1/3
0
0
This will give us the following PTR1(sT,sR):
hot
warm
cold
high
0
1/3
0
low
0
1/3
0
off
0
1/3
0
Which has 1.6 bits of entropy.
And the following P′TR1(sT,sR):
hot
warm
cold
high
0
1/9
2/9
low
1/9
1/9
1/9
off
2/9
1/9
0
Which has 2.7 bits of entropy.
This means that the joint-ness of the probability distribution PRT0 has removed 1.1 bits of entropy from the system. We say that our choice of PRT0 is optimizing, with an optimizing strength of 1.1 bits.
But what if we consider a “smarter” thermostat, which turns off just before the temperature changes.
In the new system, PRT0 has an optimizing strength of 2.2 bits, approximately twice as much. This indicates that the latter system is “better” at optimizing the distribution PRT0 in some way.
So we have eliminated the idea of needing the optimizer to have clearly-defined existence/nonexistence cases, or needing some “null” action to compare its outputs to. This is good. We have also eliminated the concept of repeated action.
Next I will attempt to eliminate the need to start with a probability distribution. In both of the examples above, our choice of PRT0 was important. I want to find a more “natural” way of defining probability distributions.
Defining Optimization in a Deeper Way Part 1
My aim is to define optimization without making reference to the following things:
A “null” action or “nonexistence” of the optimizer. This is generally poorly defined, and choices of different null actions give different answers.
Repeated action. An optimizer should still count even if it only does a single action.
Uncertainty. We should be able to define an optimizer in a fully deterministic universe.
Absolute time at all. This will be the hardest, but it would be nice to define optimization without reference to the “state” of the universe at “time t”.
Attempt one
First let’s just eliminate the concept of a null action. Imagine the state of the universe at a time t.
Let’s divide the universe into two sections and call these A and B. They have states SA and SB. If we want to use continuous states we’ll need to have some metric D(s1,s2) which applies to these states, so we can calculate things like the variance and entropy of probability distributions over them.
Treat SA and SB as part of a Read-Eval-Print-Loop. Each SAt produces some output OAt which acts like a function mapping SBt→SBt+1, and vice versa.OAt can be thought of as things which cross the Markov blanket.
Sadly we still have to introduce probability distributions. Let’s consider a joint probability distribution PABt(sA,sB), and also the two individual probability distributions PAt(sA) and PBt(sB).
By defining distributions over O outputs based on the distribution PABt, we can define PABt+1(sA,sB) in the “normal” way. This looks like integrating over the space of sA and sB like so:
PABt+1(sAt+1,sBt+1)=∫PABt(sAt,sBt) δ(OAt(sBt)−sBt+1) δ(OBt(sAt)−sAt+1) dsAtdsBt
What this is basically saying is that to define the probability distribution of states sAt+1 and sBt+1, we integrate over all states sAt and sBt and sum up the states where the OAt corresponding to sAt maps sBt to the given sBt+1.
Now lets define an “uncorrelated” version of PABt+1, which we will refer to as P′ABt+1.
P′ABt+1(sAt+1,sBt+1)=∫PAt(sAt) PBt(sBt) δ(OAt(sBt)−sBt+1) δ(OBt(sAt)−sAt+1) dsAtdsBt
This loosely represents what happens if we decorrelate sA and sB. In the language of humans, this is like an agent taking a random move from a selection.
We can refer to a probability distribution PABt as an “optimizing” probability distribution if P′ABt+1 is higher entropy than PABt+1.
For an example, imagine the universe is divided into two parts: a room R and a thermostat T. The room can have states in the set sR∈{hot,lukewarm,cold}, and the thermostat can have states sT∈{high,low,off}. Imagine that OR and OT are defined as follows:
OT[high]:⎧⎨⎩hot→hotwarm→hotcold→warm
OT[low]:⎧⎨⎩hot→hotwarm→warmcold→cold
OT[off]:⎧⎨⎩hot→warmwarm→coldcold→cold
OR[hot]:⎧⎨⎩high→offlow→offoff→off
OR[warm]:⎧⎨⎩high→lowlow→lowoff→low
OR[cold]:⎧⎨⎩high→highlow→highoff→high
Basically the thermostat decides whether the room gets warmer, stays the same, or gets colder, and the thermostat.
We can also consider the probability mass flowing from each of the nine states to another one:
Imagine the following PTR0(sT,sR):
This will give us the following PTR1(sT,sR):
Which has 1.6 bits of entropy.
And the following P′TR1(sT,sR):
Which has 2.7 bits of entropy.
This means that the joint-ness of the probability distribution PRT0 has removed 1.1 bits of entropy from the system. We say that our choice of PRT0 is optimizing, with an optimizing strength of 1.1 bits.
But what if we consider a “smarter” thermostat, which turns off just before the temperature changes.
OR[hot]:⎧⎨⎩high→offlow→offoff→low
OR[warm]:⎧⎨⎩high→lowlow→lowoff→low
OR[cold]:⎧⎨⎩high→lowlow→highoff→high
With the same choice of PTR0(sT,sR):
This will give us the following PTR1(sT,sR):
With an entropy of zero.
And the following P′TR1(sT,sR):
Which has 2.2 bits of entropy.
In the new system, PRT0 has an optimizing strength of 2.2 bits, approximately twice as much. This indicates that the latter system is “better” at optimizing the distribution PRT0 in some way.
So we have eliminated the idea of needing the optimizer to have clearly-defined existence/nonexistence cases, or needing some “null” action to compare its outputs to. This is good. We have also eliminated the concept of repeated action.
Next I will attempt to eliminate the need to start with a probability distribution. In both of the examples above, our choice of PRT0 was important. I want to find a more “natural” way of defining probability distributions.