The Ammann Hypothesis: Free Will as a Failure of Self-Prediction
A fox chases a hare. The hare evades the fox. The fox tries to predict where the hare is going—the hare tries to make it as hard to predict as possible.
Q: Who needs the larger brain?
A: The fox.
This is a little animal tale meant to illustrate the following phenomenon:
Generative complexity can be much smaller than predictive complexity under partial observability. In other words, when partially observing a blackbox there are simple internal mechanism that create complex patterns that require very large predictors to predict well.
Consider the following simple 2-state HMM
Note that the symbol 0 is output in three different ways: A → A, A-> B, and B → B. This means that if we see the symbol 0 we don’t know where we are. We can use Bayesian updating to guess where we are but starting from a stationary distribution our belief states can become extremely complicated—in fact, the data sequence generated by the simple nonunifalar source has an optimal predictor HMM that requires infinitely many states :
This simple example illustrates the gap between generative complexity and predictive complexity, a generative-predictive gap.
I note that in this case the generative-predictive is intrinsic. The gap happens even (especially!) in the ideal limit of perfect prediction!
Free Will as generative-predictive gap
The brain is a predictive engine. So much is accepted. Now imagine an organism/agent endowed with a brain predicting the external world. To do well, it may be helpful to predict its own actions. What if this process has a predictive-generative gap?The brain will ascribe an inherent uncertainty [‘entropy’] to its own actions!
An agent having a generative-predictive gap for predicting its own action would experience a mysterious force ′ choosing’ its actions. It may even decide to call this irreducible uncertainty of self-prediction “Free Will” .
in my opinion, this is a poor choice of problem for demonstrating the generator/predictor simplicity gap.
If not restricted to Markov model based predictors, we can do a lot better simplicity-wise.
Simple Bayesian predictor tracks one real valued probability B in range 0...1. Probability of state A is implicitly 1-B.
This is initialized to B=p/(p+q) as a prior given equilibrium probabilities of A/B states after many time steps.
P("1")=qA is our prediction with P("0")=1-P("1") implicitly.
Then update the usual Bayesian way:
if “1”, B=0 (known state transition to A)
if “0″, A,B:=(A*(1-p),A*p+B*(1-q)), then normalise by dividing both by the sum. (standard bayesian update discarding falsified B-->A state transition)
In one step after simplification: B:=(B(1+p-q)-p)/(Bq-1)
That’s a lot more practical than having infinite states. Numerical stability and achieving acceptable accuracy of a real implementable predictor is straightforward but not trivial. A near perfect predictor is only slightly larger than the generator.
A perfect predictor can use 1 bit (have we ever observed a 1) and ceil(log2(n)) bits counting n, the number of observed zeroes in the last run to calculate the perfectly correct prediction. Technically as n-->infinity this turns into infinite bits but scaling is logarithmic so a practical predictor will never need more than ~500 bits given known physics.
Eliezer made that point nicely with respect to LLMs here:
Consider that somewhere on the internet is probably a list of thruples: <product of 2 prime numbers, first prime, second prime>.
GPT obviously isn’t going to predict that successfully for significantly-sized primes, but it illustrates the basic point:
There is no law saying that a predictor only needs to be as intelligent as the generator, in order to predict the generator’s next token.
Indeed, in general, you’ve got to be more intelligent to predict particular X, than to generate realistic X. GPTs are being trained to a much harder task than GANs.
Same spirit: <Hash, plaintext> pairs, which you can’t predict without cracking the hash algorithm, but which you could far more easily generate typical instances of if you were trying to pass a GAN’s discriminator about it (assuming a discriminator that had learned to compute hash functions).
I first heard this idea from Joscha Bach, and it is my favorite explanation of free will. I have not heard it called as a ‘predictive-generative gap’ before though, which is very well formulated imo
The Ammann Hypothesis: Free Will as a Failure of Self-Prediction
A fox chases a hare. The hare evades the fox. The fox tries to predict where the hare is going—the hare tries to make it as hard to predict as possible.
Q: Who needs the larger brain?
A: The fox.
This is a little animal tale meant to illustrate the following phenomenon:
Generative complexity can be much smaller than predictive complexity under partial observability. In other words, when partially observing a blackbox there are simple internal mechanism that create complex patterns that require very large predictors to predict well.
Consider the following simple 2-state HMM
Note that the symbol 0 is output in three different ways: A → A, A-> B, and B → B. This means that if we see the symbol 0 we don’t know where we are. We can use Bayesian updating to guess where we are but starting from a stationary distribution our belief states can become extremely complicated—in fact, the data sequence generated by the simple nonunifalar source has an optimal predictor HMM that requires infinitely many states :
This simple example illustrates the gap between generative complexity and predictive complexity, a generative-predictive gap.
I note that in this case the generative-predictive is intrinsic. The gap happens even (especially!) in the ideal limit of perfect prediction!
Free Will as generative-predictive gap
The brain is a predictive engine. So much is accepted. Now imagine an organism/agent endowed with a brain predicting the external world. To do well, it may be helpful to predict its own actions. What if this process has a predictive-generative gap?The brain will ascribe an inherent uncertainty [‘entropy’] to its own actions!
An agent having a generative-predictive gap for predicting its own action would experience a mysterious force ′ choosing’ its actions. It may even decide to call this irreducible uncertainty of self-prediction “Free Will” .
************************************************************
[Nora Ammann initially suggested this idea to me. Similar ideas have been expressed by Steven Byrnes]
in my opinion, this is a poor choice of problem for demonstrating the generator/predictor simplicity gap.
If not restricted to Markov model based predictors, we can do a lot better simplicity-wise.
Simple Bayesian predictor tracks one real valued probability B in range 0...1. Probability of state A is implicitly 1-B.
This is initialized to
B=p/(p+q)as a prior given equilibrium probabilities of A/B states after many time steps.P("1")=qAis our prediction withP("0")=1-P("1")implicitly.Then update the usual Bayesian way: if “1”,
B=0(known state transition to A) if “0″,A,B:=(A*(1-p),A*p+B*(1-q)), then normalise by dividing both by the sum. (standard bayesian update discarding falsified B-->A state transition) In one step after simplification:B:=(B(1+p-q)-p)/(Bq-1)That’s a lot more practical than having infinite states. Numerical stability and achieving acceptable accuracy of a real implementable predictor is straightforward but not trivial. A near perfect predictor is only slightly larger than the generator.
A perfect predictor can use 1 bit (have we ever observed a 1) and ceil(log2(n)) bits counting n, the number of observed zeroes in the last run to calculate the perfectly correct prediction. Technically as n-->infinity this turns into infinite bits but scaling is logarithmic so a practical predictor will never need more than ~500 bits given known physics.
Yes—this is specifically staying within the framework of hidden markov chains.
Even if you go outside though it seems you agree there is a generative predictive gap—you’re just saying it’s not infinite.
Eggsyntax below gives the canonical example of hash function where prediction is harder than generation which hold for general computable processes.
Can a Finite-State Fox Catch a Markov Mouse? for more details
Eliezer made that point nicely with respect to LLMs here:
I first heard this idea from Joscha Bach, and it is my favorite explanation of free will. I have not heard it called as a ‘predictive-generative gap’ before though, which is very well formulated imo