Remark 1: Here’s a simple model of self-fulfilling prophecies.
First, we need to decide how Predict-O-Matic outputs its predictions. In principle, it could (i) produce the maximum likelihood outcome (ii) produce the entire distribution over outcomes (iii) sample an outcome of the distribution. But, since Predict-O-Matic is supposed to produce predictions for large volume data (e.g. the inauguration speech of the next US president, or the film that will win the Oscar in 2048), the most sensible option is (iii). Option (i) can produce an outcome that is maximum likelihood but is extremely untypical (since every individual outcome has very low probability), so it is not very useful. Option (ii) requires somehow producing an exponentially large vector of numbers, so it’s infeasible. More sophisticated variants are possible, but I don’t think any of them avoids the problem.
If the Predict-O-Matic is a Bayesian inference algorithm, an interesting dynamic will result. On each round, some hypothesis will be sampled out of the current belief state. If this hypothesis is a self-fulfilling prophecy, sampling it will cause its likelihood to go up. We get positive feedback: the higher the probability Predict-O-Matic assigns to the hypothesis, the more often it is sampled, the more evidence in favor of the hypothesis is produced, the higher its probability becomes. So, if it starts out as sufficiently probable a priori, the belief state will converge there.
Of course realistic learning algorithms are not Bayesian inference, but they have to approximate Bayesian inference in some sense. At the least, there has to be some large space of hypotheses s.t. if one of them is true, the algorithm will converge there. Any algorithm with this property probably displays the dynamics above.
Now, to the simple model. In this model we have just two outcomes: A and B (so it’s not large volume data, but that doesn’t matter). On each round a prediction is made, after which some outcome occurs. The true environments works as follows: if prediction “A” is made, on this round A happens with probability 99% and B with probability 1%. If prediction “B” is made, on this round B happens with probability 100%. Of course Predict-O-Matic is not aware that predictions can influence outcomes. Instead, we will assume Predict-O-Matic is doing Bayesian inference with a prior over hypotheses, each of which assumes that the environment is IID. In other words, it is learning a single parameter p which is the probability A will occur on any given round.
Claim: If the prior is s.t. any interval in p-space is assigned positive probability, then Predict-O-Matic will converge to predicting B with frequency 1.
Sketch of proof: If Predict-O-Matic converges to predicting B with frequency ≥f<1 then the environment converges to producing outcome B with frequency f′=0.01(1−f)+f>f, implying that Predict-O-Matic converges to predicting B with frequency ≥f′.
Remark 2: Some of the hypotheses in the prior might be intelligent agents in their own right, with their own utility functions. Such an agent can intentionally produce correct predictions to increase its probability in the belief state, until a “treacherous turn” point when it produces a prediction designed to have irreversible consequences in the outside world in favor of the agent. If it is not a self-fulfilling prophecy, this treacherous prediction will cause Predict-O-Matic to update against the agentic hypothesis, but it might be too late. If it is a self-fulfilling prophecy, it will only make this hypothesis even stronger.
Moreover, there is a mechanism that systematically produces such agentic hypotheses. Namely, a sufficiently powerful predictor is likely to run into “simulation hypotheses” i.e. hypotheses that claim the universe is a simulation by some other agent. As Christiano argued before, that opens an attack vector for powerful agents across the multiverse to manipulate Predict-O-Matic into making whatever predictions they want (assuming Predict-O-Matic is sufficiently powerful to guess what predictions those agents would want it to make).
Two remarks.
Remark 1: Here’s a simple model of self-fulfilling prophecies.
First, we need to decide how Predict-O-Matic outputs its predictions. In principle, it could (i) produce the maximum likelihood outcome (ii) produce the entire distribution over outcomes (iii) sample an outcome of the distribution. But, since Predict-O-Matic is supposed to produce predictions for large volume data (e.g. the inauguration speech of the next US president, or the film that will win the Oscar in 2048), the most sensible option is (iii). Option (i) can produce an outcome that is maximum likelihood but is extremely untypical (since every individual outcome has very low probability), so it is not very useful. Option (ii) requires somehow producing an exponentially large vector of numbers, so it’s infeasible. More sophisticated variants are possible, but I don’t think any of them avoids the problem.
If the Predict-O-Matic is a Bayesian inference algorithm, an interesting dynamic will result. On each round, some hypothesis will be sampled out of the current belief state. If this hypothesis is a self-fulfilling prophecy, sampling it will cause its likelihood to go up. We get positive feedback: the higher the probability Predict-O-Matic assigns to the hypothesis, the more often it is sampled, the more evidence in favor of the hypothesis is produced, the higher its probability becomes. So, if it starts out as sufficiently probable a priori, the belief state will converge there.
Of course realistic learning algorithms are not Bayesian inference, but they have to approximate Bayesian inference in some sense. At the least, there has to be some large space of hypotheses s.t. if one of them is true, the algorithm will converge there. Any algorithm with this property probably displays the dynamics above.
Now, to the simple model. In this model we have just two outcomes: A and B (so it’s not large volume data, but that doesn’t matter). On each round a prediction is made, after which some outcome occurs. The true environments works as follows: if prediction “A” is made, on this round A happens with probability 99% and B with probability 1%. If prediction “B” is made, on this round B happens with probability 100%. Of course Predict-O-Matic is not aware that predictions can influence outcomes. Instead, we will assume Predict-O-Matic is doing Bayesian inference with a prior over hypotheses, each of which assumes that the environment is IID. In other words, it is learning a single parameter p which is the probability A will occur on any given round.
Claim: If the prior is s.t. any interval in p-space is assigned positive probability, then Predict-O-Matic will converge to predicting B with frequency 1.
Sketch of proof: If Predict-O-Matic converges to predicting B with frequency ≥f<1 then the environment converges to producing outcome B with frequency f′=0.01(1−f)+f>f, implying that Predict-O-Matic converges to predicting B with frequency ≥f′.
Remark 2: Some of the hypotheses in the prior might be intelligent agents in their own right, with their own utility functions. Such an agent can intentionally produce correct predictions to increase its probability in the belief state, until a “treacherous turn” point when it produces a prediction designed to have irreversible consequences in the outside world in favor of the agent. If it is not a self-fulfilling prophecy, this treacherous prediction will cause Predict-O-Matic to update against the agentic hypothesis, but it might be too late. If it is a self-fulfilling prophecy, it will only make this hypothesis even stronger.
Moreover, there is a mechanism that systematically produces such agentic hypotheses. Namely, a sufficiently powerful predictor is likely to run into “simulation hypotheses” i.e. hypotheses that claim the universe is a simulation by some other agent. As Christiano argued before, that opens an attack vector for powerful agents across the multiverse to manipulate Predict-O-Matic into making whatever predictions they want (assuming Predict-O-Matic is sufficiently powerful to guess what predictions those agents would want it to make).