Embedded Agents are Quines

lsusr

Is entropy objective or subjective?

DaemonicSigil

In one sense subjective, if you believe in subjective probabilities. i.e. if we use Shannon’s entropy formula, $- \sum_{i} p_{i} log p_{i}$ then the formula itself is objective, but Bayesians would have subjective probabilities $p_{i}$ , which might vary from person to person.

On the other hand, it does seem to be the case that there are ways in which entropy is very clearly objective. Like the entropy of a box of gas isn’t going to vary from person to person, and it’s not possible to avert the heat death of the universe by simply changing your subjective beliefs in such a way that entropy goes down.

It’s a strange tension to have in your worldview, that entropy should be both subjective and somehow also not. I have this meme lying around somewhere, let me see if I can pull it up...

“The second law of thermodynamics from various perspectives”:

Anyway, the joke is to ask what the second law of thermodynamics looks like for people in various states of knowledge. For us humans, there is much we don’t know, we’re surrounded gas (and other things) made from atoms bouncing around in ways we haven’t observed. And entropy goes up all the time, whenever there is friction, or any other dissipative process. Then, Laplace’s demon is a stand-in for what we’d call a “logically omniscient being”. The demon isn’t all-knowing, but it has infinite memory and computational power. Since the laws of physics are reversible, entropy never increases for such a being. Apparent increases in entropy for us mortals are just the result of probability distributions become too complicated and tangled up for us to track. And then “God” would be a stand in for true omniscience, that actually does know the position of every atom. For God, for whom everything is certain, entropy is 0. (The rest of the meme is just a comment on the fact that Conway’s game of life is not reversible, left as exercise for the reader.)

lsusr

You seem not to have fallen for my trick question. Often, the people I talk to assume that entropy is necessarily an objective quantity. After all, $S = entropy$ appears in thermodynamic equations like $d S = \frac{\partial Q}{T}$ . But what I find most interesting about your comment is the phrase “if you believe in subjective probabilities”.

Are probabilities subjective? Litany of Tarski.

lsusr

If probabilities are subjective,
I desire to believe that probabilities are subjective;
If probabilities are not subjective,
I desire to believe that probabilities are not subjective;
Let me not become attached to beliefs I may not want.

DaemonicSigil

Okay, to me, the Litany of Tarski makes the most sense in the context of questions that are clearly empirical. Like whether or not sterile neutrinos are a thing or whatever. In this case, what do I expect to observe if probabilities are fundamentally subjective, or fundamentally objective? It’s just not an empirical question.

In another way, though, I think that Tarski’s litany provides an answer to the question. The rough outline of the answer, so as not to keep you in suspense, is that while probabilities can be subjective, the laws of probability, the rules governing how probability can be manipulated are immutable.

Probabilities can be generalized from logic, where “true” becomes “1“ and “false” becomes “0”. So if we generalize the litany of Tarski to a probabilistic setting, we get something like:

If X happens, I wish to have assigned it probability 1.

If X does not happen, I wish to have assigned it probability 0.

And if I assign a probability $p$ in between, the score is $log p_{i}$ where $i$ is the actual outcome.

As I’m sure you know, this is actually a description of the loss function in machine learning algorithms that have to classify things into categories. Probabilities are subjective in the sense that different people might have different probability estimates for various events, but the log probability score itself is not up for grabs.

lsusr

Well stated. Are probabilities observable phenomena?

DaemonicSigil

The probabilities I have in my head are encoded as ways my neurons are connected, and maybe neurotransmitter concentrations and that kind of thing. That’s clearly observable in principle. Most people will flinch if you throw a ball at their head. Why did they flinch? Because they assigned a high probability to the ball striking them. So that’s certainly an observable consequence of a probability.

The question is easy if we’re talking about subjective probabilities. It’s only if one wishes to have objective probabilities that the question of whether or not we can observe them becomes difficult.

lsusr

I think I asked that question wrong. Instead of “observable phenomena”, I should have written “material phenomena”.

You and I are both material reductionists The point I’m trying to elucidate isn’t an attack on physical materialism. Rather, I’m trying to draw attention to a certain facet of the Bayesian interpretation of probability.

I call this problem the Belief-Value Uncertainty Principle. Biological neuron connections are a black box. They’re non-interpretable. All we can observe are the inputs and the outputs. We can’t observe beliefs or values directly.

If we knew inputs, outputs and values, then we could infer beliefs. If we knew inputs, outputs and beliefs, then we could infer values. But for every possible belief, there is a corresponding value that could match our observations. And for every corresponding value, there is a belief that could match our observations.

DaemonicSigil

Yeah, it’s kind of like if we’re space aliens watching some Greek mathematicians doing geometry with straightedge and compass. They’re drawing things that are approximately circles and triangles, but only approximately. The true reality is that they’re working on paper and drawing in charcoal (or whatever was actually used?) and the lines have thickness and the circles aren’t quite round and so zooming in you see all kinds of problems with how this system isn’t perfectly imitating the mathematical ideal. And maybe one of the guys is being really sloppy in his work and just drawing freehand for everything. So there’s a philosophical question of “at what point are these not even circles anymore”? And we could in a similar way ask: human neurons are just approximating probability, and some people are actually really bad at reasoning, but nobody is coming close to following the ideal mathematical laws of probability. At what point does this “probability” idea stop being a good description of how someone is thinking? What if it’s not a good description for how literally anybody in the world is thinking?

In his book (which I haven’t read all of), E.T. Jaynes made quite a good decision of how to present probability. He didn’t introduce it as the science of how humans think. He presented it terms of “let’s suppose we were going to build an ideal reasoning robot. What would be the best way to design it?”

lsusr

Indeed. Instead of human beings, let’s consider an ideal reasoning robot with a legible world module inside its positronic brain.

Something interesting to me about such a being is that its potential world models and its potential value functions are not orthogonal.

DaemonicSigil

Not to go in for a full-throated defence of the orthogonality thesis just yet, but looking at your article there, it seems like you’re predicting that the world-model would have a tendency to try and manipulate the values/search parts of the agent. I don’t see how that’s necessarily true? Like if I imagine training a feed-forward neural network to make “what comes next?” predictions for a rules-based environment, then it seems like it just tries to get better and better at making those predictions. If I vary the difficulty of the predictions I ask it to make, how do we end up with that network attempting to manipulate me into giving it the easiest predictions?

lsusr

What kind of neural network? It it purely functional (as in “functional programming”) or does it have a legible representation of the Universe? That is, is it stateful?

By the way, you don’t need to defend the orthogonality thesis from all possible attacks. I’m just claiming that legible embedded world optimizers have at least one belief-value constraint. I’m not taking any other issues with the orthogonality thesis (in this dialogue).

DaemonicSigil

For the world model, I’m imagining a feed-forward network, but GPT-4 is feed-forward, so it could still be quite sophisticated, with attention layers and that kind of thing. GPT-4 also gets a scratch pad when doing auto-regressive generation, so we can imagine that the world model could also get a scratch pad to work out intermediate states when making its predictions. The scratchpad is stateful, so I guess it in some sense has a stateful representation of the universe. Maybe not a legible one, though.

lsusr

Can we make it legible by defining a fixed function $F : scratchpad \to universe$ that legibly maps from the scratchpad to universe macrostates?

DaemonicSigil

I guess the problem I see with that idea would be that we have some limited number of macrostates, based on the size of our scratchpad. If the function $F$ is fixed, then that kills a lot of flexibility. Like if we’re talking about programming computers, and the function $F$ coarse grains out details like the exact positions of electrons, then we’re building an agent that can’t see the point of programming. As far as it could tell, it’s not changing the macrostate by rearranging all these electrons. Also, when humans program, we don’t usually think of ourselves as rearranging electrons, though that is what we’re in fact doing. We think in terms of this separate ontology of “I’m writing a computer program”.

So in other words, we have various abstractions we use to navigate the universe, pulling on different ones at different times. If $F$ could be a learned function instead of a fixed one, then we might be getting somewhere.

lsusr

It doesn’t matter to me whether $F$ is a learned function, but it is important that $F$ is fixed. (This will matter later.)

I acknowledge that $F$ ’s mapping of single scratchpad states to infinite universe states creates an implicit fungibility between universe states that constraints what it’s possible to express as a value function. If $F^{- 1}$ maps “Lsusr eats breakfast” and “Lsusr doesn’t eat breakfast” to the same scratchpad state, then it’s impossible to define a value function that causes me to eat breakfast. That’s not where I think the orthogonality thesis breaks.

DaemonicSigil

Okay, interesting. Let’s say the following then: The scratchpad consists of two sections: one holds the source code for the function $F$ and the other holds some long string of binary data, $D$ . And then the statement is that $F$ maps from microstates of the universe to some corresponding value of $D$ .

I think this gets around the issue of flexibility, while still hopefully preserving the basic idea of having some legible mapping between states of the universe and the data in the scratchpad.

lsusr

Is $F$ ’s source code human-readable?

DaemonicSigil

We might as well specify that it is, no?

lsusr

I guess what actually matters is that $F$ is human-writable. Anyway…go on. :)

DaemonicSigil

Okay, so that’s the world model handled. The robot should also presumably have some values that it pursues. Let’s say we have some learned function $U$ that assigns utilities to states of the universe. (We’ll limit the possible functions $F$ that we can use above to just those functions that preserve enough information that $U$ can be inferred from $D$ .)

So anyway, we have this function $U$ and then we’d like to search across actions $a$ and functions $F$ . The query to the world model is: suppose we’ve taken the action $a$ , what is the expected value of $U$ ?

We can try different functions $F$ (which determines the precision of the world model’s predictions, some functions $F$ will show programming as a potentially valuable strategy, and others won’t, depending on whether or not $F$ keeps track of electrons or not.) The agent picks the action with the highest expected utility across all choices of $F$ .

(There’s an issue of how the agent is to predict its own future actions. One simplifying assumption that I sometimes like to make in this kind of very abstract “let’s think of an agent design” discussion is to suppose that the agent only needs to make a single (possibly very complex) decision in the entire history of the universe. The rationale is that the agent can decide in advance “if A happens, then I’m going to do X”, and “if B happens then I’m going to do Y”. So if we pack all those future potential observations and actions into single decision made right at the start. This works so long as we’re very unconstrained in terms of compute (i.e. we have exponential amounts of compute).)

lsusr

It sounds like $D$ is a quine.

DaemonicSigil

That depends on which choice of $F$ we’re working with. If it ignores the exact positions of electrons (or even just the positions of the electrons inside the robot), then it’s not a quine. If $F$ does map information about the inner workings of the bot into $D$ then there would be at least some self reference. In an extreme case where $F$ ’s output is just the contents of the $D$ data store, then that’s quite a lot like a quine. Except that the contents of $D$ change over time as the robot considers different actions and uses different functions $F$ to try and analyze the world. The contents of $D$ after the robot has already decided on an action are almost uninteresting, it’s just whatever possibility the robot last searched.

lsusr

I guess it’s not a perfect quine. The important thing is that $D$ sometimes (especially for smart, powerful robots) must reference itself, without contradiction. There exist some functions $F$ such that this self-reference imposes a constraint on $D$ .

DaemonicSigil

Okay, so if $F$ takes the whole world-history (and predicted future) as input, then we can consider some particular recursive $F_{r}$ that looks in particular at the robot’s thoughts at the time when it was analyzing the universe using $F_{r}$ , and then it just dumps the data in finds in $D$ to output. (We can also consider the paradoxical $F_{r}^{'}$ , which inverts all the bits before dumping them. Which I guess means that the world model can’t be powerful enough to model itself too well.)

lsusr

Correct. When a world model becomes too powerful, it cannot model itself. If a powerful world model cannot model itself, then it cannot model the world. If a powerful world model cannot model the world…then the orthogonality thesis shatters under its own weight.