Defining Optimization in a Deeper Way Part 1

My aim is to define optimization without making reference to the following things:

  1. A “null” action or “nonexistence” of the optimizer. This is generally poorly defined, and choices of different null actions give different answers.

  2. Repeated action. An optimizer should still count even if it only does a single action.

  3. Uncertainty. We should be able to define an optimizer in a fully deterministic universe.

  4. Absolute time at all. This will be the hardest, but it would be nice to define optimization without reference to the “state” of the universe at “time t”.

Attempt one

First let’s just eliminate the concept of a null action. Imagine the state of the universe at a time .

Let’s divide the universe into two sections and call these and . They have states and . If we want to use continuous states we’ll need to have some metric which applies to these states, so we can calculate things like the variance and entropy of probability distributions over them.

Treat and as part of a Read-Eval-Print-Loop. Each produces some output which acts like a function mapping , and vice versa. can be thought of as things which cross the Markov blanket.

Sadly we still have to introduce probability distributions. Let’s consider a joint probability distribution , and also the two individual probability distributions and .

By defining distributions over outputs based on the distribution , we can define in the “normal” way. This looks like integrating over the space of and like so:

What this is basically saying is that to define the probability distribution of states and , we integrate over all states and and sum up the states where the corresponding to maps to the given .

Now lets define an “uncorrelated” version of , which we will refer to as .

This loosely represents what happens if we decorrelate and . In the language of humans, this is like an agent taking a random move from a selection.

We can refer to a probability distribution as an “optimizing” probability distribution if is higher entropy than .


For an example, imagine the universe is divided into two parts: a room and a thermostat . The room can have states in the set , and the thermostat can have states . Imagine that and are defined as follows:





Basically the thermostat decides whether the room gets warmer, stays the same, or gets colder, and the thermostat.

We can also consider the probability mass flowing from each of the nine states to another one:

Imagine the following :

00
00
00

This will give us the following :

00
00
00

Which has 1.6 bits of entropy.

And the following :

Which has 2.7 bits of entropy.

This means that the joint-ness of the probability distribution has removed 1.1 bits of entropy from the system. We say that our choice of is optimizing, with an optimizing strength of 1.1 bits.


But what if we consider a “smarter” thermostat, which turns off just before the temperature changes.



With the same choice of :

00
00
00

This will give us the following :

000
010
000

With an entropy of zero.

And the following :

Which has 2.2 bits of entropy.

In the new system, has an optimizing strength of 2.2 bits, approximately twice as much. This indicates that the latter system is “better” at optimizing the distribution in some way.


So we have eliminated the idea of needing the optimizer to have clearly-defined existence/​nonexistence cases, or needing some “null” action to compare its outputs to. This is good. We have also eliminated the concept of repeated action.

Next I will attempt to eliminate the need to start with a probability distribution. In both of the examples above, our choice of was important. I want to find a more “natural” way of defining probability distributions.