[Question] How Many Bits Of Optimization Can One Bit Of Observation Unlock?

johnswentworthApr 26, 2023, 12:26 AM

LW: 62 AF: 27

So there’s this thing where a system can perform more bits of optimization on its environment by observing some bits of information from its environment. Conjecture: observing an additional $N$ bits of information can allow a system to perform at most $N$ additional bits of optimization. I want a proof or disproof of this conjecture.

I’ll operationalize “bits of optimization” in a similar way to channel capacity, so in more precise information-theoretic language, the conjecture can be stated as: if the sender (but NOT the receiver) observes $N$ bits of information about the noise in a noisy channel, they can use that information to increase the bit-rate by at most $N$ bits per usage.

For once, I’m pretty confident that the operationalization is correct, so this is a concrete math question.

Toy Example

We have three variables, each one bit: Action ( $A$ ), Observable ( $O$ ), and outcome ( $Y$ ). Our “environment” takes in the action and observable, and spits out the outcome, in this case via an xor function:

$Y = A \oplus O$

We’ll assume the observable bit has a ⁵⁰⁄₅₀ distribution.

If the action is independent of the observable, then the distribution of outcome $Y$ is the same no matter what action is taken: it’s just ⁵⁰⁄₅₀. The actions can perform zero bits of optimization; they can’t change the distribution of outcomes at all.

On the other hand, if the actions can be a function of $O$ , then we can take either $A = O$ or $A = ¯ O$ (i.e. not- $O$ ), in which case $Y$ will be deterministically 0 (if we take $A = O$ ), or deterministically 1 (for $A = ¯ O$ ). So, the actions can apply 1 bit of optimization to $Y$ , steering $Y$ deterministically into one half of its state space or the other half. By making the actions $A$ a function of observable $O$ , i.e. by “observing 1 bit”, 1 additional bit of optimization can be performed via the actions.

Operationalization

Operationalizing this problem is surprisingly tricky; at first glance the problem pattern-matches to various standard info-theoretic things, and those pattern-matches turn out to be misleading. (In particular, it’s not just conditional mutual information, since only the sender—not the receiver—observes the observable.) We have to start from relatively basic principles.

The natural starting point is to operationalize “bits of optimization” in a similar way to info-theoretic channel capacity. We have 4 random variables:

“Goal” $G$
“Action” $A$
“Observable” $O$
“Outcome” $Y$

Structurally:

(This diagram is a Bayes net; it says that $G$ and $O$ are independent, $A$ is calculated from $G$ and $O$ and maybe some additional noise, and $Y$ is calculated from $A$ and $O$ and maybe some additional noise. So, $P [G, O, A, Y] = P [G] P [O] P [A | G, O] P [Y | A, O]$ .) The generalized “channel capacity” is the maximum value of the mutual information $I (G; Y)$ , over distributions $P [A | G, O]$ .

Intuitive story: the system will be assigned a random goal $G$ , and then take actions $A$ (as a function of observations $O$ ) to steer the outcome $Y$ . The “number of bits of optimization” applied to $Y$ is the amount of information one could gain about the goal $G$ by observing the outcome $Y$ .

In information theoretic language:

$G$ is the original message to be sent
$A$ is the encoded message sent in to the channel
$O$ is noise on the channel
$Y$ is the output of the channel

Then the generalized “channel capacity” is found by choosing the encoding $P [A | G, O]$ to maximize $I (G; Y)$ .

I’ll also import one more assumption from the standard info-theoretic setup: $G$ is represented as an arbitrarily long string of independent ⁵⁰⁄₅₀ bits.

So, fully written out, the conjecture says:

Let $G$ be an arbitrarily long string of independent ⁵⁰⁄₅₀ bits. Let $A$ , $O$ , and $Y$ be finite random variables satisfying
$P [G, O, A, Y] = P [G] P [O] P [A | G, O] P [Y | A, O]$
and define
$Δ := ({max}_{P [A | G, O]} I (G; Y)) - ({max}_{P [A | G, O]} I (G; Y) subject to P [A | G, O] = P [A | G])$
Then
$Δ \leq H (O)$

Also, one slightly stronger bonus conjecture: $Δ$ is at most $I (A; O)$ under the unconstrained maximal $P [A | G, O]$ .

(Feel free to give answers that are only partial progress, and use this space to think out loud. I will also post some partial progress below. Also, thankyou to Alex Mennen for some help with a couple conjectures along the path to formulating this one.)

What links here?

johnswentworthApr 26, 2023, 12:26 AM

LW: 62 AF: 27

32 comments3 min readLW link

Optimization AI

johnswentworth Apr 26, 2023, 8:35 PM
LW: 20 AF: 10
11
AF

Alright, I think we have an answer! The conjecture is false.
Counterexample: suppose I have a very-high-capacity information channel (N bit capacity), but it’s guarded by a uniform random n-bit password. O is the password, A is an N-bit message and a guess at the n-bit password. Y is the N-bit message part of A if the password guess matches O; otherwise, Y is 0.
Let’s say the password is 50 bits and the message is 1M bits. If A is independent of the password, then there’s a $2^{- 50}$ chance of guessing the password, so the bitrate will be about $2^{- 50}$ * 1M $\approx$ $2^{- 30}$ , or about one-billionth of a bit in expectation.
If A “knows” the password, then the capacity is about 1M bits. So, the delta from knowing the password is a lot more than 50 bits. It’s a a multiplier of $2^{50}$ , rather than an addition of 50 bits.
This is really cool! It means that bits of observation can give a really ridiculously large boost to a system’s optimization power. Making actions depend on observations is a potentially-very-big game, even with just a few bits of observation.
Credit to Yair Halberstadt in the comments for the attempted-counterexamples which provided stepping stones to this one.
What links here?
- dr_s Apr 27, 2023, 9:17 AM
  8 points
  7
  Parent
  
  This is interesting, but it also feels a bit somehow like a “cheat” compared to the more “real” version of this problem (namely, if I know something about the world and can think intelligently about it, how much leverage can I get out of it?).
  
  The kind of system in which you can pack so much information in an action and at the cost of a small bit of information you get so much leverage feels like it ought to be artificial. Trivially, this is actually what makes a lock (real or virtual) work: if you have one simple key/password, you get to do whatever with the contents. But the world as a whole doesn’t seem to work as a locked system (if it did, we would have magic: just a tiny, specific formula or gesture and we get massive results down the line).
  
  I wonder if the key here isn’t in the entropy. Your knowing O here allows you to significantly reduce the entropy of the world as a whole. This feels akin to being a Maxwell demon. In the physical world though there are bounds on that sort of observation and action exactly because to be able to do them would allow you to violate the 2nd principle of thermodynamics. So I wonder if the conjecture may be true under some additional constraints which also include these common properties of macroscopic closed physical systems (while it remains false in artificial subsystems that we can build for the purpose, in which we only care about certain bits and not all the ones defining the underlying physical microstates).
  - jeremy Apr 28, 2023, 7:55 AM
    5 points
    2
    Parent
    
    I’m not sure about that; it seems like there’s lots of instances where just a few bits of knowledge gets you lots of optimization power. Knowing Maxwell’s equations lets you do electronics, and knowing which catalyst to use for the Haber process lets you make lots of food and bombs. If I encoded the instructions for making a nanofactory, that would probably be few bits compared to the amount of optimization you could do with that knowledge.
    - dr_s Apr 28, 2023, 9:08 AM
      3 points
      2
      Parent
      
      The important thing is that your relevant information isn’t about the state of the world, it’s about the laws. That’s the evolution map $f$ , not the region $O$ (going by the nomenclature I used in my other comment). Your knowledge about $O$ when using the Haber process is actually roughly proportional to the output: you need to know that inside tank X there is such-and-such precursor, and it’s pure to a certain degree. That’s like knowing that a certain region of the bit string is prepared purely with $1$ s. But the laws are an interesting thing because they can have regularities (in fact, we do know they have them), so that they can be represented in compressed form, and you can exploit that knowledge. But also, to actually represent that knowledge in bits of world-knowledge you’d need to represent the state of all the experiments that were performed and from which that knowledge was inferred and generalized. Though volume wise, that’s still less than the applications… unless you count each application also as a further validation of the model that updates your confidence in it, at which point by definition the bits of knowledge backing the model are always more than the bits of order you got out of it.
      What links here?
      Information-Theoretic Boxing of Superintelligences by JustinShovelain (Nov 30, 2023, 2:31 PM; 30 points)
  - johnswentworth Apr 27, 2023, 3:46 PM
    3 points
    1
    Parent
    
    Ah, interesting. If I were going down that path, I’d probably aim to use a Landauer-style argument. Something like, “here’s a bound on mutual information between the policy and the whole world, including the agent itself”. And then a lock/password could give us a lot more optimization power over one particular part of the world, but not over the world as a whole.
    … I’m not sure how to make something like that nontrivial, though. Problem is, the policy itself would then presumably be embedded in the world, so $I (π, world)$ is just $H (π)$ .
    - dr_s Apr 27, 2023, 4:35 PM
      3 points
      0
      Parent
      
      Here’s my immediate thought on it: you define a single world bit string $W$ , and $A$ , $O$ and $Y$ are just designated subsections of it. You are able to know only the contents of $O$ , and can set the contents of $A$ (this feels like it’s reducing the entropy of the whole world btw, so you could also postulate that you can only do so by drawing free energy from some other region, your fuel $F$ : for each bit of $A$ you set deterministically, you need to randomize two of $F$ , so that the overall entropy increases). After this, some kind of map $W \to f (W)$ is applied repeatedly, evolving the system until such time comes to check that the region $Y$ is indeed as close as possible to your goal configuration $G$ . I think at this point the properties of the result will depend on the properties of the map—is it a “lock” map like your suggested one (compare a region of $A$ with $O$ , and if they’re identical, clone the rest of $A$ into $Y$ , possibly using up $F$ to keep the entropy increase positive?). Is it reversible, is it chaotic?
      
      Yeah, not sure, I need to think about it. Reversibility (even acting as if these were qubits and not simple bits) might be the key here. In general I think there can’t be any hard rule against lock-like maps, because the real world allows building locks. But maybe there’s some rule about how if you define the map itself randomly enough, it probably won’t be a lock-map (for example, you could define a map as a series of operations on two bits writing to a third one $o p (i, j) \to k$ ; decide a region of your world for it, encode bit indices and operators as bit strings, and you can make the map’s program itself a part of the world, and then define what makes a map a lock-like map and how probable that occurrence is).
- johnswentworth Apr 26, 2023, 9:02 PM
  LW: 3 AF: 2
  0
  AF Parent
  
  The new question is: what is the upper bound on bits of optimization gained from a bit of observation? What’s the best-case asymptotic scaling? The counterexample suggests it’s roughly exponential, i.e. one bit of observation can double the number of bits of optimization. On the other hand, it’s not just multiplicative, because our xor example at the top of this post showed a jump from 0 bits of optimization to 1 bit from observing 1 bit.
  - M. Y. Zuo Apr 26, 2023, 10:29 PM
    1 point
    0
    Parent
    
    Isn’t it unbounded?
    - johnswentworth Apr 26, 2023, 10:33 PM
      3 points
      0
      Parent
      
      In an absolute sense, yes, but I expect it can be bounded as a function of bits of optimization without observation. For instance, if we could only at-most double the number of bits of opt by observing one bit, then that would bound bit-gain as a function of bits of optimization without observation, even though it’s unbounded in an absolute sense.
      Unless you’re seeing some stronger argument which I have not yet seen?
      - M. Y. Zuo Apr 26, 2023, 10:42 PM
        1 point
        0
        Parent
        
        The scaling would also be unbounded, at least that would be my default assumption without solid proof otherwise.
        In other words I don’t see any reason to assume there must be any hard cap, whether at 2x or 10x or 100x, etc...
        [ ]
        
        [deleted]
        [ ]
        
        [deleted]
        johnswentworth Apr 26, 2023, 11:06 PM
        2 points
        0
        Parent
        
        Here are two intuitive arguments:
        If we can’t observe O, we could always just guess a particular value of O and then do whatever’s optimal for that value. Then with probability P[O], we’ll be right, and our performance is lower bounded by P[O]*(whatever optimization pressure we’re able to apply if we guess correctly).
        The log-number of different policies bounds the log-number of different outcome-distributions we can achieve. And observing one additional bit doubles the log-number of different policies.
  - FireStormOOO Apr 27, 2023, 12:57 AM
    1 point
    0
    Parent
    
    It’s possible to construct a counterexample where there’s a step from guessing at random to perfect knowledge after an arbitrary number of observed bits; n-1 bits of evidence are worthless alone and the nth bit lets you perfectly predict the next bit and all future bits.
    Consider for example shifting bits in one at a time into the input of a known hash function that’s been initialized with an unknown value (and known width) and I ask you to guess a specified bit from the output; in the idealized case, you know nothing about the output of the function until you learn the final bit in the input (all unknown bits have shifted out) b/c they’re perfectly mixed, and after that you’ll guess every future bit correctly.
    Seems like the pathological cases can be arbitrarily messy.
    - Yair Halberstadt Apr 27, 2023, 2:13 PM
      2 points
      0
      Parent
      
      I don’t think that matters, because knowing all but the last bit, I can simply take two actions—action assuming last bit is true, and action assuming it’s false.
      - FireStormOOO Apr 27, 2023, 4:15 PM
        1 point
        0
        Parent
        
        Not sure I’m following the setup and notation quite close enough to argue that one way or the other, as far as the order we’re saying the agent receives evidence and has to commit to actions. Above I was considering the simplest case of 1 bit evidence in, 1 bit action out, repeat.
        I pretty sure that could be extended to get that one small key/update that unlocks the whole puzzle sort of effect and have the model click all at once. As you say though, not sure that gets to the heart of the matter regarding the bound; it may show that no such bound exists on the margin, the last piece can be much more valuable on the margin than all the prior pieces of evidence, but not necessarily in a way that violates the proposed bound overall. Maybe we have to see that last piece as unlocking some bounded amount of value from your prior observations.
  - Will_BC May 11, 2023, 3:15 PM
    1 point
    0
    Parent
    
    I think it depends on the size of the world model. Imagine an agent with a branch due to uncertainty between two world models. It can construct these models in parallel but doesn’t know which one is true. Every observation it makes has two interpretations. A single observation which conclusively determines which branch world model was correct I think could produce an arbitrarily large but resource bounded update.
johnswentworth Apr 26, 2023, 12:37 AM
LW: 5 AF: 3
0
AF

Eliminating G
The standard definition of channel capacity makes no explicit reference to the original message $G$ ; it can be eliminated from the problem. We can do the same thing here, but it’s trickier. First, let’s walk through it for the standard channel capacity setup.
Standard Channel Capacity Setup
In the standard setup, $A$ cannot depend on $O$ , so our graph looks like
… and we can further remove $O$ entirely by absorbing it into the stochasticity of $Y$ .
Now, there are two key steps. First step: if $A$ is not a deterministic function of $G$ , then we can make $A$ a deterministic function of $G$ without reducing $I (G; Y)$ . Anywhere $A$ is stochastic, we just read the random bits from some independent part of $G$ instead; $Y$ will have the same joint distribution with any parts of $G$ which $A$ was reading before, but will also potentially get some information about the newly-read bits of $G$ as well.
Second step: note from the graphical structure that $A$ mediates between $G$ and $Y$ . Since $A$ is a deterministic function of $G$ and $A$ mediates between $G$ and $Y$ , we have $I (G; Y) = I (A; Y)$ .
Furthermore, we can achieve any distribution $P [A]$ (to arbitrary precision) by choosing a suitable function $A (G)$ .
So, for the standard channel capacity problem, we have $P [G, A, Y] = P [G] P [A | G] P [Y | A]$ , and we can simplify the optimization problem:
$({max}_{P [A | G]} I (G; Y)) = ({max}_{P [A]} I (A; Y))$
Note that this all applies directly to our conjecture, for the part where actions do not depend on observations.
That’s how we get the standard expression for channel capacity. It would be potentially helpful to do something similar in our problem, allowing for observation of $O$ .
Our Problem
The step about determinism of $A$ carries over easily: if $A$ is not a deterministic function of $G$ and $O$ , then we can change $A$ to read random bits from an independent part of $G$ . That will make $A$ a deterministic function of $G$ and $O$ without reducing $I (G; Y)$ .
The second step fails: $A$ does not mediate between $G$ and $Y$ .
However, we can define a “Policy” variable
$π := (o \mapsto A (G, o))$
$π$ is also a deterministic function of $G$ , and $π$ does mediate between G and Y. And we can achieve any distribution over policies (to arbitrary precision) by choosing a suitable function $A (G, O)$ .
So, we can rewrite our problem as
$({max}_{P [A | G, O]} I (G; Y)) = ({max}_{P [π]} I (π; Y))$
In the context of our toy example: $π$ has two possible values, $(o \mapsto o)$ and $(o \mapsto ¯ o)$ . If $π$ takes the first value, then $Y$ is deterministically 0; if $π$ takes the second value, then $Y$ is deterministically 1. So, taking the distribution $P [π]$ to be ⁵⁰⁄₅₀ over those two values, our generalized “channel capacity” is at least 1 bit. (Note that we haven’t shown that no $P [π]$ achieves higher value in the maximization problem, which is why I say “at least”.)
Back to the general case: our conjecture can be expressed as
$Δ = ({max}_{P [π]} I (π; Y)) - ({max}_{P [A]} I (A; Y)) \leq H (O)$
where the first optimization problem uses the factorization
$P [π, O, Y] = P [π] P [O] P [Y | A = π (O), O]$
and the second optimization problem uses the factorization
$P [A, O, Y] = P [A] P [O] P [Y | A, O]$
Ben Amitay Apr 27, 2023, 9:27 AM
3 points
0

I didn’t think much about the mathematical problem, but I think that the conjecture is at least wrong in spirit, and that LLMs are good counterexample for the spirit. An LLM by its own is not very good at being an assistant, but you need pretty small amounts of optimization to steer the existing capabilities toward being a good assistant. I think about it as “the assistant was already there, with very small but not negligible probability”, so in a sense “the optimization was already there”, but not in a sense that is easy to capture mathematically.

beren Apr 26, 2023, 9:12 AM
13 points
0

Interesting thoughts! By the way, are you familiar with Hugo Touchette’s work on this? which looks very related and I think has a lot of cool insights about these sorts of questions.
- johnswentworth Apr 26, 2023, 6:13 PM
  2 points
  0
  Parent
  
  Hadn’t seen that before, thankyou.
Yair Halberstadt Apr 26, 2023, 10:52 AM
11 points
2

I presume I’m not understanding something, so why isn’t this a trivial counterexample:

I know a channel either strips out every 11, or every 00. This forces me to perform some sort of encoding scheme to ensure we send the same message either way. Observing a single bit of information (whether 11 or 00 is stripped out), allows me to use a much simpler encoding, saving a large number of bits.
What links here?
- How Many Bits Of Optimization Can One Bit Of Observation Unlock? by johnswentworth (Apr 26, 2023, 12:26 AM; 61 points)
- johnswentworth Apr 26, 2023, 6:12 PM
  4 points
  0
  Parent
  
  After chewing it on it a bit, I find it very plausible that this is indeed a counterexample. However, it is not obvious to me how to prove that there does not exist some clever encoding scheme which would achieve bit-throughput competitive with the O-dependent encoding without observing O. (Note that we don’t actually need to ensure the same Y pops out either way, we just need the receiver to be able to distinguish between enough possible inputs A by looking at Y.)
  - Yair Halberstadt Apr 26, 2023, 6:50 PM
    17 points
    −2
    Parent
    
    Ok simpler example:
    
    You know the channel either removes all 0s or all 1s, but you don’t know which.
    
    The most efficient way to send a message is to send n 1s, followed by n 0s, where n is the number the binary message you want to send represents.
    
    If you know whether 1s or 0s are stripped out, then you only need to send n bits of information, for a total saving of n bits.
    
    EDIT: this doesn’t work, see comment by AlexMennen.
    - AlexMennen Apr 26, 2023, 7:33 PM
      12 points
      2
      Parent
      
      I don’t think this one works. In order for the channel capacity to be finite, there must be some maximum number of bits N you can send. Even if you don’t observe the type of the channel, you can communicate a number n from 0 to N by sending n 1s and N-n 0s. But then even if you do observe the type of the channel (say, it strips the 0s), the receiver will still just see some number of 1s that is from 0 to N, so you have actually gained zero channel capacity. There’s no bonus for not making full use of the channel; in johnswentworth’s formulation of the problem, there’s no such thing as some messages being cheaper to transmit through the channel than others.
      - Yair Halberstadt Apr 26, 2023, 7:45 PM
        2 points
        0
        Parent
        
        A fair point. Or a similar argument, you can only transfer one extra bit of information this way, since the message representing a number of size 2n is only 1 bit larger than the message representing n.
    - johnswentworth Apr 26, 2023, 8:11 PM
      6 points
      0
      Parent
      
      Trying to patch the thing which I think this example was aiming for:
      Let A be an n-bit number, O be 0 or 1 (50/50 distribution). Then let Y = A if $A \equiv O mod 2$ , else Y = 0. If the sender knows O, then they can convey n-1 bits with every message (i.e. n bits minus the lowest-order bit). If the sender does not know O, then half the messages are guaranteed to be 0 (and which messages are 0 communicates at most 1 bit per, although I’m pretty sure it’s in fact zero bits per in this case, so no loophole there). So at most ~n/2 bits per message can be conveyed if the sender does not know O.
      So, sender learning the one bit O doesn’t add one bit to the capacity, it doubles the capacity.
      I’m now basically convinced that that’s the answer; the conjecture is false.
      I now expect that the upper limit to bits of optimization gained from one bit of observation is ~doubling the number of bits of optimization (as in this example). Though we also know that observing one bit can take us from zero bits of optimization to one bit of optimization, so it’s not as simple as just doubling; there’s at least a small additional term in there.
      - Yair Halberstadt Apr 27, 2023, 5:50 AM
        6 points
        0
        Parent
        
        That sounds about right.
        
        Tripling is I think definitely a hard max since you can send 3 messages, action if true, action if false, + which is which—at least assuming you can reliably send a bit at all without the observation.
        
        More tightly it’s doubling + number of bits required to send a single bit of information.
    - johnswentworth Apr 26, 2023, 7:46 PM
      2 points
      0
      Parent
      
      Damn, that one sounded really promising at first, but I don’t think it works. Problem is, if A is fixed-length, then knowing the number of 1′s also tells us the number of 0′s. And since we get to pick P[A] in the optimization problem, we can make A fixed-length.
      EDIT: oh, Alex beat me to the punch.
- Thomas Sepulchre Apr 26, 2023, 12:00 PM
  1 point
  0
  Parent
  
  The post says
  Let $G$ be an arbitrarily long string of independent ⁵⁰⁄₅₀ bits.
  I believe that, in your example, the bits are not independent (indeed, the first and second bits are always equal), thus it isn’t a counterexample.
  Sorry if I misunderstood
  - johnswentworth Apr 26, 2023, 4:37 PM
    2 points
    0
    Parent
    
    G can still be independent bits in the proposed counterexample. The mechanism is just about how Y is constructed from A: A is a sequence of bits, Y is A with either 11′s removed or 00′s removed.
tailcalled Apr 26, 2023, 8:35 AM
9 points
2

Channel capacity between goals and outcomes is sort of a counterintuitive measure of optimization power to me, and so if there are any resources on it, I would appreciate them.
- johnswentworth Apr 29, 2023, 5:26 PM
  4 points
  2
  Parent
  
  I don’t know of a good existing write-up on this, and I think it would be valuable for someone to write.
dr_s Apr 26, 2023, 9:45 AM
4 points
0

Just to clarify, why are both $A$ and $Y$ treated as random variables? The way I understand it:
- $G$ is our Goal, so the string we want to get
- $O$ is whatever bits of the world we actually observe
- $A$ is the action we choose (depending on what we observe)
- $Y$ is what we get as an outcome of the action
$O$ being called “observation”, I feel like I’d assume it to be fully known. If $Y$ can still not be fully determined by $A$ and $O$ then there’s got to be some random noise (which we could represent as yet another string of bits, $W$ , representing the part of the world that we can’t observe). What about $A$ being probabilistic, is this an allowance for our algorithm being imperfect and making mistakes? I feel like we could introduce yet another random variable for that independent from everything, and make $A$ and $Y$ fully deterministic. Defined that way, the proposed theorem feels like it ought to be trivially true, though I haven’t given it nearly enough thought yet to call that an answer. I’ll give it a go later.

[Question] How Many Bits Of Optimization Can One Bit Of Observation Unlock?

Toy Example

Operationalization

Eliminating G

Standard Channel Capacity Setup

Our Problem