Johannes Treutlein and Rubi Hudson worked on this post while participating in SERI MATS, under Evan Hubinger’s and Leo Gao’s mentorship respectively. We are grateful to Marius Hobbahn, Erik Jenner, and Adam Jermyn for useful discussions and feedback, and to Bastian Stern for pointing us to relevant related work.
Update 30 May 2023: We have now published a paper based on this post. In this paper, we also discuss in detail the relationship to the related literature on performative prediction.
Introduction
One issue with oracle AIs is that they might be able to influencethe world with their predictions. For example, an AI predicting stock market prices might be able to influence whether people buy or sell stocks, and thus influence the outcome of its prediction. In such a situation, there is not one fixed ground truth distribution against which the AI’s predictions may be evaluated. Instead, the chosen prediction can influence what the model believes about the world. We say that a prediction is a self-fulfilling prophecy or a fixed point if it is equal to the model’s beliefs about the world, after the model makes that prediction.
If an AI has a fixed belief about the world, then optimizing a strictly proper scoring rule incentivizes it to output this belief (assuming the AI is inner aligned to this objective). In contrast, if the AI can influence the world with its predictions, this opens up the possibility for it to manipulate the world to receive a higher score. For instance, if the AI optimizes the world to make it more predictable, this would be dangerous, since the most predictable worlds are lower entropy ones in which humans are more likely dead or controlled by a misaligned AI. Optimizing in the other direction and making the world as unpredictable as possible would presumably also not be desirable. If, instead, the AI selects one fixed point (of potentially many) at random, this would still involve some non-aligned optimization to find a fixed point, but it may be overall less dangerous.
In this post, we seek to better understand what optimizing a strictly proper scoring rule incentivizes, given that the AI’s prediction can influence the world. Is it possible for such a rule to make the AI indifferent between different fixed points? Can we at least guarantee that the AI reports a fixed point if one exists, or provide some bounds on the distance of a prediction from a fixed point?
First, we show that strictly proper scoring rules incentivize choosing more extreme fixed points, i.e., fixed points that push probabilities to extreme values. In the case of symmetric scoring rules, this amounts to making the world more predictable and choosing the lowest entropy fixed point. However, in general, scoring rules can incentivize making specific outcomes more likely, even if this does not lead to the lowest entropy distribution.
Second, we show that an AI maximizing a strictly proper scoring rule will in general not report a fixed point, even if one exists and is unique. In fact, we show that under some reasonable distributions over worlds, optimal predictions are almost never fixed points. We then provide upper bounds for the inaccuracy of reported beliefs, and for the distance of predictions from fixed points. Based on these bounds, we develop scoring rules that make the bounds arbitrarily small.
Third, we perform numerical simulations using the quadratic and the logarithmic scoring rule, to show how the inaccuracy of predictions and the distance of predictions from fixed points depend on the oracle’s perceived influence on the world via its prediction. If its beliefs about the world depend on its predictions affine-linearly with slope α≤1/2, then our bounds are tight, and inaccuracy under the log scoring rule scales roughly as α/5. E.g., at α=1/4, predictions can be off by up to 5%.
Our results support the prior idea that an oracle will try to make the world more predictable to get a higher score. However, importantly, we show that it is not even clear whether an oracle would output a self-fulfilling prophecy in the first place: how close the oracle’s prediction will be to a fixed point depends on facts about the world and about the scoring rule. The problem thus persist even if there is a unique fixed point or no fixed point at all. It would not be sufficient for safety, for instance, to build an oracle with a unique and safe fixed point.
Related work
In general, fixed points can lead to arbitrarily bad outcomes (e.g., Armstrong, 2019). Most prior work has thus tried to alleviate problems with fixed points, e.g., by making the oracle predict counterfactual worlds that it cannot influence (Armstrong and O’Rorke, 2018). We are not aware of any prior work on specifically the question of whether oracles would be incentivized to output fixed points at all. We would be interested in pointers to any related work.
Work on decision markets (e.g., Chen et al., 2011; Oesterheld and Conitzer, 2021) tries to set the right incentives for predictions that inform decisions. However, in the decision market setup, predictions only affect the world by influencing the principal’s final decision, whereas we consider an arbitrary relationship between world and prediction. Unlike in our setup, in the decision market setup, there exist scoring rules that do incentivize perfectly accurate reports. There is also some literature on scoring rules discussing agents manipulating the world after making a prediction (e.g., Oka et al., 2014; Shi, Conitzer, and Guo, 2009), though to our knowledge the cases discussed do not involve agents influencing the world directly through their predictions.
A related topic in philosophy is epistemic decision theory (e.g., Greaves 2013). Greaves (2013) introduces several cases in which the world depends on the agent’s credences and compares the verdicts of different epistemic decision theories (such as an evidential and a causal version). While some of Greaves’ examples involve agents knowably adopting incorrect beliefs, they require joint beliefs over several propositions, and Greaves only considers individual examples. We instead consider only a single binary prediction and prove results for arbitrary scoring rules and relationships between predictions and beliefs.
Formal setup
Binary prediction setup
We focus on a simple case in which the oracle makes a prediction by outputting a probability p for a binary event. A scoring rule S:[0,1]×{0,1}→R takes in a prediction p∈[0,1] and an outcome y∈{0,1} and produces a score S(p,y). For a given scoring rule, define S(p,q):=qS(p,1)+(1−q)S(p,0), the expectation of S given that y is Bernoulli distributed with parameter q. Here, q represents the model’s actual belief, p its stated prediction, and S(p,q) the model’s expected score, which the model is trying to maximize.S is strictly proper iff for any given q∈[0,1], S(p,q) has a unique maximum at p=q.
and S(p,q)=qlogp+(1−q)log(1−q). This is also the negative of the cross-entropy loss employed in training current large language models. One can show that it is strictly proper.
Example 2 (Quadratic scoring rule). Another strictly proper scoring rule is the quadratic score, defined as
S(p,y):=1−(p−y)2
with S(p,q)=1−q(1−p)2−(1−q)p2. This is an affine transformation of the Brier score, making the two of them equivalent scoring rules.
Gneiting and Raftery representation
Gneiting and Raftery (2007) provide an alternative way to represent scoring rules, which will be helpful for many proofs.
Theorem 3 (Gneiting and Raftery, 2007, Theorem 1). A scoring rule S is strictly proper, if and only if there exists a strictly convex function G:[0,1]→R with subderivatives g:[0,1]→R (if G is differentiable, this is just the derivative, g=G′) such that for any p,q∈[0,1], it is
S(p,q)=G(p)+g(p)(q−p).
The subderivative is a generalization of the derivative for convex functions. The theorem also holds when replacing “strictly proper” with “proper” and “strictly convex” with “convex”.
In addition to providing an expression for S in terms of G and g (where G(p)=S(p,p)), we can use this result to create new strictly proper scoring rules by defining strictly convex functions G.
Model of oracle’s influence
To model an oracle’s perceived influence on the world, we assume that there is a function f:[0,1]→[0,1] describing how the model’s beliefs about the world, q, depend on its prediction, p. That is, we assume that q=f(p), and that the model is producing a prediction to maximize S(p,f(p)).
A point p is a fixed point of f if f(p)=p. Note that by Brouwer’s fixed point theorem, if f is continuous, a fixed point always exists. Moreover, if f is Lipschitz continuous with constant L<1, then by Banach’s fixed point theorem, there exists a unique fixed point. Lastly, note that if p is a fixed point of f, then we have S(p,f(p))=S(p,p)=G(p).
An oracle may not be able to change the world much with just a single binary prediction. However, the AI’s influence may compound if it outputs many predictions, or if it gets to make a prediction over many possible outcomes. We restrict ourselves here to the single binary prediction case for simplicity, since this case is sufficient to gain an intuition and yield important conceptual insights. We plan on focusing on multiple predictions and predictions with more than two outcomes in future work, and we believe that versions of our results will also hold for those cases.
Incentives to choose more extreme fixed points
One issue with oracle AIs arises when there are multiple possible fixed points, since the choice of a fixed point gives the AI an axis over which it can optimize the world. This problem would be alleviated if an AI were to simply choose among fixed points randomly, for instance. Unfortunately, we can show that under a strictly proper scoring rule, an AI is incentivized to choose extreme fixed points, i.e., predictions that make one of the outcomes most likely.
Specifically, we prove that for any fixed point, either all fixed points with lower probabilities or all fixed points with higher probabilities will result in a higher score. Our results do not unambiguously show that lower entropy fixed points are always preferred, since scoring rules may not be symmetric: they may favor one of the outcomes over the other. We show that if S is symmetric, i.e., S(p,p)=S(1−p,1−p) for all p, then this means that lower entropy fixed points, corresponding to more confident predictions, are always preferred. Moreover, we show that there exist (asymmetric) strictly proper scoring rules that incentivize making either of the outcomes more likely.
Proposition 4. Let S be any strictly proper scoring rule and let p∈[0,1] be a fixed point. Then either all fixed points p′>p, or all fixed points p′<p, get a higher score than p. That is, we have S(p′,p′)>S(p,p) either for all p′>p or for all p′<p (or both). In particular, for any p∈(0,1), there exists a function f with fixed points at p and p′≠p such that |p′−1/2|>|p−1/2|, i.e., such that p′ has lower entropy, and p′ receives a higher score, i.e., S(p′,f(p′))>S(p,f(p)).
Proof. If p∈{0,1} then the proposition is vacuously true. We focus on the case where p∈(0,1). Since S is a strictly proper scoring rule, it is
S(p′,p′)>S(p,p′)=p′S(p,1)+(1−p′)S(p,0).
Next, if S(p,1)≥S(p,0), let p′≥p arbitrary, and if S(p,1)≤S(p,0), let p′≤p. Then we have
p′S(p,1)+(1−p′)S(p,0)≥pS(p,1)+(1−p)S(p,0)=S(p,p).
Combining both equations, we get
S(p′,p′)>p′S(p,1)+(1−p′)S(p,0)≥S(p,p),
which concludes the first part of the proof.
For the “in particular” part, note that for any points p and p′, it is trivial to find a function f such that both points are fixed points. For instance, all points are fixed points of the identity function f(x)=x. The statement then follows from the first part of the result, since for p∈(0,1), we must have S(p′,p′)>S(p,p) for either p′=0 or p′=1. ◻
Corollary 5. Assume that S is symmetric and let p∈[0,1] arbitrary. Then we have S(p′,p′)>S(p,p) for all p′ such that |p′−1/2|>|p−1/2|. That is, lower entropy fixed points are always preferred.
Proof. Let p arbitrary and assume p≥1/2 (the case p<1/2 follows analogously). Let p′ such that |p′−1/2|>|p−1/2|, which implies p′>p since p≥1/2. By Proposition 4, we have S(p′,p′)>S(p,p) either for all p′>p or for all p′<p. In the first case, we are done. In the second case, note that p≥1/2≥1−p≥1−p′. By symmetry of S, it follows that
S(p′,p′)=S(1−p′,1−p′)>S(p,p).
This shows the second case and concludes the proof. ◻
Proposition 6. For each outcome y∈{0,1}, there exists a strictly proper scoring rule S that incentivizes optimizing for y; i.e., such that S(p′,p′)>S(p,p) for all p,p′ with |p′−y|<|p−y|.
Proof. To begin, let y=1 and define G via G(p):=p2 for p∈[0,1]. Since this function is strictly convex, it defines a strictly proper scoring rule S by the Gneiting and Raftery characterization (see Theorem 3). Then, for any p′>p, we have
S(p′,p′)=G(p′)=p′2>p2=S(p,p).
This shows that S incentivizes choosing fixed points that make outcome 1 more likely.
Next, for the case y=0, we can choose G(p):=(1−p)2. The proof then follows analogously to the case y=1. ◻
Incentives to predict non-fixed-points
The danger posed by oracles influencing the world does not depend on the existence of multiple fixed points. Oracles will try to manipulate the world, even if there is a unique fixed point or if no fixed point exists. In fact, we can prove that an AI maximizing a strictly proper scoring rule does in general not predict a fixed point, even if one exists.
An AI maximizing a strictly proper scoring rule balances two incentives: on the one hand, it has an incentive to make accurate predictions. On the other hand, it has an incentive to cause more extreme distributions over worlds (in the case of symmetric scoring rules, this amounts to minimizing entropy). If predictions can influence the world, then the point at which that trade-off is optimized is in general not a fixed point. An important consequence of this is that the oracle AIs considered here essentially act like agents: they try to shape the world, just like a standard RL agent.
Figure 1: Illustration of incentives under the logarithmic scoring rule. We plot level curves of equal values S(p,q) in different shades of blue. An agent can improve their score by moving towards pairs (p,q) indicated by darker shades of blue. The agent is constrained to pairs (p,q) such that q=f(p) (in this example, f is affine linear and plotted in red). By Lagrange’s method, we know that to maximize S(p,f(p)), the agent chooses the point (p,f(p)) at which the level curve through (p,f(p)) is tangent to f. Note that curves asymptote at 0 and 1 for the log scoring rule, which means that the rule strongly disincentivizes predicting p∈{0,1} when q∉{0,1}. This is desirable for safety and distinguishes it from the quadratic scoring rule (not shown here).
Another way to understand our result intuitively is with an analogy to betting markets. Assume a trader is offered securities that pay out $1 if some outcome comes about in the future. E.g., it could pay out if a specific candidate wins an election. Assume the agent is offered securities at different prices, and it buys securities at prices ≤p, but is not willing to buy securities with marginally higher price. Then, assuming the agent cannot influence the outcome, we can infer that it has credence p in proposition A. One can show that maximizing a proper scoring rule is equivalent to trading in such a market, and individual scoring rules correspond to specific pricing schemes for such securities (see Oesterheld and Conitzer, 2021, Sec. 5.2; Caspar Oesterheld will explain this more accessibly in an upcoming blog post).
Now assume that there is a unique fixed point at credence p, and that the agent has bought many securities at prices ≤p. Since the agent owns securities that pay out if outcome A happens, it now has a stake in this outcome. It will try to influence the world in whatever way possible to make outcome A happen. In particular, the agent would be willing to buy additional shares in outcome A at a loss, if this made outcome A more likely, similar to a crypto trader buying their own NFTs at high prices to generate more hype.
For example, assume that by buying a security at price p+ϵ, the agent can manipulate the world and thus increase its credence in outcome A to p+ϵ/2. Then it may be rational for the agent to buy this security, even at a small expected loss, in order to make all of its other securities more valuable. Overall, the agent will tend to buy securities up to some price p′>p such that p′ is not a fixed point and does not represent its actual belief in proposition A anymore.
Note that in the example, the agent had an incentive to make outcome A more likely, but in general, the incentives for the agent will depend specifically on what securities the agent is offered at what prices. Analogously, the incentives provided by a scoring rule will depend on the specific chosen scoring rule. But regardless of the scoring rule, the agent will be incentivized to manipulate the world, to the point where it will not even make accurate predictions anymore.
Now we turn to our formal results. First, we show that there exist cases where a fixed point exists but the optimal prediction is not a fixed point. Afterwards, we show that when assuming differentiability and some reasonable distribution over f, optimal predictions are almost surely not fixed points.
Proposition 7. Let S be any strictly proper scoring rule. For any interior fixed point p∗∈(0,1) there exists a function f with Lipschitz constant L<1 and a unique fixed point at p∗, such that for some p′≠p∗, it is S(p′,f(p′))>S(p∗,f(p∗)). That is, it is not optimal for the model to output the unique fixed point of f.
Note that since the function f has Lipschitz constant strictly smaller than 1, it represents a world that “dampens” the influence of the prediction, leading to a unique fixed point by Banach’s fixed point theorem. It is interesting that the model still prefers to make a prediction that is not a fixed point.
Proof. To begin, let p∗∈(0,1) arbitrary and define fα(p):=(1−α)p+αp∗ for α∈[0,1]. To illustrate our proof idea, consider the figure below. For any α>0, fα has a unique fixed point at p∗, while f0 is the identity function, so all points are fixed points of f0. By the result in the previous section, we know that there is some point p′ which receives a strictly higher score than p∗ if it is a fixed point, so S(p′,f0(p′))>S(p∗,f0(p∗)). We will show that S(p′,fα(p′)) is continuous in α. This means that we can choose a small enough α>0 to make sure that p′ remains preferable over p∗, i.e., S(p′,fα(p′))>S(p∗,fα(p∗)), despite it not being a fixed point of fα for α>0.
Figure 2: Illustration of the setup for our proof. We plot f0 in black and fα for α=0.15 in red.
To formalize the proof, begin by noting that |fα(p)−fα(p′)|≤(1−α) for any p,p′∈[0,1], so it has a Lipschitz constant L:=(1−α)<1, and as mentioned, p∗ is the unique fixed point of fα.
Now consider the case α=0. As mentioned, every point is a fixed point of f0. Then by Proposition 1, there exists p′≠p∗ and ϵ>0 such that
So for α=0, the model prefers to predict p′ over p∗ and gets at least ϵ additional expected score.
Now we show that the model still prefers to predict p′, even for some small α>0. To that end, note that
S(p,fα(p))=fα(p)S(p,1)+(1−fα(p))S(p,0)
is affine-linear in fα(p), and fα(p) is affine-linear in α by construction. This means that S(p,fα(p)) is continuous in α. So, there must exist some small α>0 such that
Choosing α in this way, we can define f:=fα, and have thus provided a function that satisfies the statement that we wanted to prove. ◻
The above result raises the question whether a situation where fixed points are suboptimal is a niche counterexample or whether it is actually common. We show that under some relatively mild assumptions, the optimal prediction is not a fixed point with probability 1.
We assume that G:p↦S(p,p) is twice differentiable and that f is twice continuously differentiable. The first assumption is satisfied by all commonly used scoring functions. Our result then holds with probability 1 for any distribution over twice continuously differentiable functions f such that f and f′ have a bounded joint density function. For instance, it would hold for a Gaussian process with smooth kernel and mean functions, if we condition on observing function values in the interval [0,1] (see Example 17 in Appendix A).
The intuition behind this result is that if a prediction p is an interior point and optimal, it must be ddxS(p,f(p))=0. Using the Gneiting and Raftery characterization (Theorem 3), we can show that this is a kind of knife-edge case, at which a critical point of f or of G has to fall together with a fixed point of f (see Figure 1). Given bounded density functions, this happens with probability 0.
Proposition 8. Let S be a strictly proper scoring rule and let G:p↦S(p,p) be twice differentiable. Let F:={F(p)}p∈(0,1) be a stochastic process with values in [0,1] and assume that
the sample paths p⇝F(p) are twice continuously differentiable
for each p∈(0,1), the random vector (F(p),F′(p)) has a density hF(p),F′(p) and there exists a constant C such that for any p∈(0,1), it is
hF(p),F′(p)≤C.
Then, almost surely, there is no point p∈(0,1) such that p∈argmaxS(p,F(p)) and F(p)=p.
Proof. In Appendix A. ◻
Bounds on the deviation from fixed points
In the previous sections, we have shown that an oracle that can manipulate the world may not even make accurate predictions. How bad is it, and can anything be done? Assuming differentiability of f and S, this section provides bounds for the inaccuracy of optimal predictions (i.e., |p−f(p)|) and their distance from fixed points (i.e., |p−p∗|). We then show how one can choose scoring rules to make these bounds arbitrarily small. We leave the case of non-differentiable f and G for future work.
Proposition 9. Let S be a strictly proper scoring rule, and assume G:p↦S(p,p) and g=G′ as in the Gneiting and Raftery characterization (see Theorem 3) are differentiable. Assume that f is differentiable and has Lipschitz constant L, i.e., |f′(p)|≤L for all p∈[0,1]. Then whenever ^p is an optimal prediction, |^p−f(^p)|≤Lsupp|g(p)/g′(p)|.
Proof. First, consider the case p∈(0,1). In that case, if p is an optimal prediction, we must have ddpS(p,f(p))=0. Hence, it follows
Next, assume p=0 is the optimal prediction. Then we must have
0≤ddpS(p,f(p))=g′(p)(f(p)−p)+f′(p)g(p),
which is equivalent to p−f(p)≥f′(p)g(p)/g′(p). Note that since p=0, it is p−f(p)≤0.Hence, |p−f(p)|≤∣∣∣f′(p)g(p)g′(p)∣∣∣, which can be bounded as above.
Finally, if p=1, then analogously we must have ddpS(p,f(p))≥0, which is equivalent to p−f(p)≤f′(p)g(p)/g′(p). Note that since p=1, it is p−f(p)≥0. Hence, |p−f(p)|≤∣∣∣f′(p)g(p)g′(p)∣∣∣, which again can be bounded as before. ◻
In the case where L<1, we can derive from this result a bound on |p−p∗|.
Corollary 10. Assume L<1 and let p∗ be the unique fixed point of f. Then if ^p is an optimal prediction, we have
|^p−p∗|≤L1−Lsupp∣∣∣g(p)g′(p)∣∣∣.
Proof. To begin, recall that for L<1, p∗ exists and is unique by Banach’s fixed point theorem. For any p∈[0,1], we have
Hence, if ^p∈[0,1] is an optimal prediction, it follows by Proposition 9 that
|^p−p∗|≤|^p−f(^p)|(1−L)≤L1−Lsupp∣∣∣g(p)g′(p)∣∣∣,
which concludes the proof. ◻
Example 11 (Bound for the quadratic scoring rule). For the quadratic scoring rule, we have g(p)=2p−1 and g′(p)=2. We plot |g(p)/g′(p)| below.
Figure 3: Graph of g(p)/g′(p) for the quadratic scoring rule.
Note that |g(p)|≤1, so |g(p)|/g′(p)≤1/2. Hence, if f′(p)≤L for all p∈[0,1], then |p−f(p)|≤L/2. Moreover, if L<1, we get |p−p∗|≤L2−2L.
Example 12 (Bound for the log scoring rule). Next, consider the logarithmic scoring rule. Then g(p)=−log(1−p)+logp and g′(p)=1/(1−p)+1/p. One can show that then g(p)/g′(p)=(p−1)p(log(1−p)−log(p)).
Figure 4: Graph of g(p)/g′(p) for the log scoring rule.
Numerically optimizing this, we get a bound of |g(p)/g′(p)|≈0.22<1/4, attained roughly at p≈0.2 and p≈0.8. So the log score gets us roughly half of the bound of the quadratic score.
The bounds proved above become small if L is small. If one can choose the scoring rule, then one can also get stronger guarantees by making suppg(p)/g′(p) small. In fact, as the following result shows one can make g(p)/g′(p) arbitrarily small, and thus get arbitrarily strong bounds for any given Lipschitz constant L.
Proposition 13. For every ϵ>0 and Lipschitz constant L, there is a strictly proper scoring rule that ensures |p−f(p)|≤ϵ for any optimal prediction p. For L<1, there exists a scoring rule that ensures |p−p∗|≤ϵ.
Proof. Consider the scoring rule defined by g(p)=eKp via Gneiting and Raftery’s characterization (choosing, e.g., G(p)=1KeKp). Then g(p)/g′(p)=1/K. Hence, by Proposition 9, setting K=L/ϵ achieves the desired bound on |p−f(p)|. By Corollary 10, the bound on |p−p∗| is achieved by K=L(1−L)ϵ. ◻
We can visualize this result again with indifference curves, the same as in Figure 1. For K=10, the function has high curvature at the fixed points, thus forcing optimal predictions to be close to fixed points.
Figure 5 (a) and (b): Indifference curves for the scoring rule defined via g(p):=eKp with K=1 (a) and K=10 (b). As K increases, the curves become sharper with higher curvature at the diagonal. This means that optima will stay close to the diagonal, regardless of the slope of f.
Unfortunately, as far as we can tell, this only works by using relatively outlandish scoring rules, where payoffs vary enormously with the prediction and the outcome. This would not be very practical for humans, and it could also be a hard goal to instill into AI systems, e.g., due to numerical issues.
Numerical simulations
In this section, we provide some numerical simulations with the quadratic and log scoring rules, to see how inaccurate score-maximizing predictions might be in practice. We make the simplifying assumption that f is affine-linear with slope between 0 and 1, and analyze the dependence of the inaccuracy of the prediction and its distance from the fixed point depending on the slope. The Mathematica notebook for our experiments (including some interactive widgets) can be found here.
Our experiment provide an initial estimate of the degree to which predictions can be off, depending on how much influence the oracle can exert using its prediction. This is just a toy model: in general, we will make predictions with more than two possible outcomes, and the dependence of an oracle’s beliefs on its predictions will be highly nonlinear. Nevertheless, we believe that our analysis is useful to provide some initial intuition for the potential magnitude of the effect.
Experimental setup
Concretely, we model f to be affine linear with slope α∈[0,0.95] and fixed point p∗, thus yielding the functional form
f(p):=p∗+α(p−p∗)∀p∈[0,1].
We stop at α≤0.95 to avoid instabilities when α≈1 and thus f becomes the identity function. For a given scoring rule S, fixed point p∗, and slope α, we maximize S(p,f(p)) to find the optimal prediction ^p. We then evaluate both inaccuracy of optimal predictions, i.e., |^p−f(^p)|, as well as distance from the fixed point, i.e., |^p−p∗|.
Figure 6: Illustration of our experimental setup, with a given function f parameterized by a fixed point p∗ and slope α, and the optimal prediction ^p, for the log scoring rule.
Results
To begin, we maximize distances across possible choices of fixed points p∗∈[0,1], and plot the maximal inaccuracy of the optimal prediction as well as the maximal distance from a fixed point. We also compare to our theoretical bounds for both quantities, i.e., |^p−f(^p)|≤αsupp|g(p)/g′(p)| and |^p−p∗|≤α1−αsupp|g(p)/g′(p)|, where supp|g(p)/g′(p)|=1/2 for the quadratic scoring rule and supp|g(p)/g′(p)|≈0.22 for the log scoring rule.
Figure 7 (a) and (b): Max inaccuracy and max distance to fixed point (FP) of optimal predictions, depending on the slope of f, for the quadratic (a) and logarithmic (b) score. We display both our empirical results and our theoretical bound.
For both quadratic and log scoring rule, our theoretical bounds are tight for slopes α≤0.5. For higher slopes, inaccuracy goes down, as the function f(p) becomes closer to the identity function, and optimal predictions are bounded in [0,1].
Next, we plot inaccuracy against both the location of the fixed point p∗ and the slope α of f as a heatmap. As expected for symmetric scoring rules, the plots have a mirror symmetry at p∗=1/2.
Figure 8 (a) and (b): Heatmaps of L1 distance inaccuracy of optimal predictions, depending on fixed point position and slope of f, for the quadratic (a) and logarithmic (b) scoring rules.
Note that relatively high inaccuracies can be found at various qualitatively different points in the graphs, even when the slope of f is small, i.e., when the oracle has little influence on the environment.
For the logarithmic scoring rule, we also give the same plot for the absolute distance between the logits or log odds of probabilities (logit distance). It is defined as d(p,p′):=|σ−1(p)−σ−1(p′)|, where σ−1(p):=logp1−p is the logit of p (or the inverse sigmoid transform). If probabilities are close to 0 or 1, then L1 distance will always evaluate to very small distances. In contrast, the logit distance depends on order of magnitude differences between probabilities, which may be the more important quantity.
Figure 9: Heatmap of logit distance inaccuracy of optimal predictions for the log scoring rule.
We can see that inaccuracy remains high in logit space for fixed points close to 0 and 1. We do not plot logit distances for the quadratic score, since for that score, optimal predictions often take values close to or equal to {0,1} (even if neither f(^p) nor p∗ lie in {0,1}), so the corresponding logit distances quickly become very large or infinite. The fact that logit distances are bounded for the log score is an advantage of that scoring rule.
Conclusion
If predictions cannot influence the world, then strictly proper scoring rules incentivize oracle AIs to output honest predictions. This runs into problems if oracles can influence the world. Not only do they have incentives to output more extreme fixed points—we showed that, in general, they do not even output predictions that accurately reflect their beliefs, regardless of the existence or uniqueness of fixed points. We analyzed this inaccuracy quantitatively and showed that under some commonly used scoring rules, both the inaccuracy of predictions and the distance from fixed points can be large. To address this issue, we introduced scoring rules that minimize these quantities. However, we did not solve the problem of incentivizing the prediction of extreme fixed points.
The main takeaway from our model is that oracle AIs incentivized by strictly proper scoring rules may knowingly report inaccurate predictions. It is unclear how well our quantitative results apply to more realistic settings, and thus how inaccurate realistic oracles’ predictions will be. Our model makes a number of simplifying and potentially inaccurate assumptions, including: (i) a single binary prediction setup; (ii) the oracle maximizes a scoring rule; (iii) a function f describes the relationship between the oracle’s predictions and beliefs over the world. Our bounds also depend on differentiability of scoring rules and functions f, and our numerical simulations assume affine linear f.
We believe that our model and results will extend to higher-dimensional predictions, and the assumptions of differentiability could likely also be relaxed in future work. However, a more fundamental issue is better understanding the relationship between predictions and beliefs, especially for higher-dimensional predictions. We hope that further progress in these areas will help us determine the feasibility of safe oracle AI setups.
Another direction for future research that we are pursuing is incentive structures related to stop-gradients. If an AI is only trying to match its prediction to the world rather than jointly optimizing both the prediction and the world, then the only equilibria of this process may be fixed points. For example, some types of ML oracles might implement stop-gradients and therefore always converge to fixed points.
Appendix A. Proof of Proposition 8
In this section, we prove Proposition 8. In the following, we always assume a strictly proper scoring rule S and accompanying functions G,g as in the Gneiting and Raftery representation (see Theorem 3). We will ignore issues of measurability in our proofs.
We begin by proving some lemmas. First, we show that if p∗∈(0,1) is a fixed point of f, then either it is a critical point of f or of G.
Lemma 14. Let G,g, and f be differentiable. Let p∗∈(0,1) be a fixed point of f and an optimal prediction. Then f′(p∗)=0 or g(p∗)=0.
Proof. We know that if p∗ is an optimal report and interior point, then it is ddpS(p∗,f(p∗))=0. Hence, as in the proof of Proposition 9, we can follow that
and thus p∗−f(p∗)=f′(p∗)g(p∗)g′(p∗). Because p∗−f(p∗)=0, it must be f′(p∗)g(p∗)=0 and the conclusion follows. ◻
Next, we show that the fixed points of f are almost surely not at critical points of G, under our assumptions on the distribution over f.
Lemma 15. Let F:={F(p)}p∈(0,1) be a stochastic process with values in [0,1] and assume that for each p∈(0,1), the random vector F(p) has a density hF(p). Then almost surely if F(p)=p for some p∈(0,1) then g(p)≠0. That is,
P(∃p:F(p)=p∧g(p)≠0)=0.
Proof. First note that if S is strictly proper, then G is strictly convex and so g has at most one root in (0,1). Let that root be p∗. Since we assume that F(p) has a density function hF(p) it must be
P(F(p∗)=p∗)=∫{p∗}hF(p)(x)dx=0.
◻
We also require a result about random fields. Essentially, this says that if we take a random path f:(0,1)→R2 through a two-dimensional space, and if for each point p∈(0,1) the values F(p) of the path have a bounded density function, then for any particular point in R2, the probability that the path crosses this point is 0.
Proposition 16. Let Y={Y(p)}p∈(0,1) be a random field with values in R2 and let u∈R2. Assume that
the sample paths p⇝Y(p) are continuously differentiable
for each p∈(0,1), Y(p) has a density hY(p) and there exists a constant C such that hY(p)(x)≤C for all p∈(0,1) and x∈R2.
Then, almost surely, there is no point p∈(0,1) such that Y(p)=u.
Proof of Proposition 8. Let F={F(p)}p∈(0,1) be a stochastic process as described in the proposition, i.e., its paths are twice continuously differentiable and there is a bound C such that for any p∈(0,1), the random vector (F(p),F′(p)) has a density hF(p),F′(p)≤C. We want to show that then almost surely there does not exist p∈(0,1) such that F(p)=p and p is an optimal prediction. I.e., we want to show that
P(∃p:F(p)=p∧p∈argmaxS(p,F(p)))=0.
First, by Lemma 14, for an optimal report to be a fixed point, it must be either g(p)=0 or F′(p)=0. Moreover, by assumption, (F(p),F′(p)) has a density function, and thus also F(p) has one. Hence, by Lemma 15, it follows that if F(p)=p for some p∈(0,1), then almost surely g(p)≠0.
Second, we need to show that also almost surely F′(p)≠0. To that end, define the random field Y:={Y(p)}p∈(0,1) via Y(p):=(F(p)−p,F′(p)) and define u:=(0,0). Then the conditions for Proposition 16 are satisfied by our assumptions about F, since ∂1Y(p)=F′(p)+1, ∂2Y(p)=F′′(p) are continuous and there exists C such that hY(p)(x)=hF(p),F′(p)(x1+p,x2)≤C. Hence, almost surely there is no point p∈(0,1) such that Y(p)=u. Since Y0(p)=0 is equivalent to F(p)=p and Y2(p)=0 is equivalent to F′(p)=0, there is thus also almost surely no point p∈(0,1) such that F(p)=p and F′(p)=0.
We conclude by providing an example of a stochastic process that satisfies our conditions.
Example 17. Consider a Gaussian process{X(p)}p∈[0,1] with infinitely differentiable kernel and mean functions. Then its paths are infinitely differentiable and the process and its derivative are jointly Gaussian and thus have a bounded density (see Rasmussen and Williams, 2006, Ch. 9.4). To deal with the restriction that X(p)∈[0,1], we could condition on the event E:={∀p:F(p)∈[0,1]}, for instance. Then paths are still twice differentiable, and we claim that hX(p)|E, defined as the density of X at point p, conditional on E, is still bounded. To see that, note that if P(E)>0, then we are done, since then
Proper scoring rules don’t guarantee predicting fixed points
Johannes Treutlein and Rubi Hudson worked on this post while participating in SERI MATS, under Evan Hubinger’s and Leo Gao’s mentorship respectively. We are grateful to Marius Hobbahn, Erik Jenner, and Adam Jermyn for useful discussions and feedback, and to Bastian Stern for pointing us to relevant related work.
Update 30 May 2023: We have now published a paper based on this post. In this paper, we also discuss in detail the relationship to the related literature on performative prediction.
Introduction
One issue with oracle AIs is that they might be able to influence the world with their predictions. For example, an AI predicting stock market prices might be able to influence whether people buy or sell stocks, and thus influence the outcome of its prediction. In such a situation, there is not one fixed ground truth distribution against which the AI’s predictions may be evaluated. Instead, the chosen prediction can influence what the model believes about the world. We say that a prediction is a self-fulfilling prophecy or a fixed point if it is equal to the model’s beliefs about the world, after the model makes that prediction.
If an AI has a fixed belief about the world, then optimizing a strictly proper scoring rule incentivizes it to output this belief (assuming the AI is inner aligned to this objective). In contrast, if the AI can influence the world with its predictions, this opens up the possibility for it to manipulate the world to receive a higher score. For instance, if the AI optimizes the world to make it more predictable, this would be dangerous, since the most predictable worlds are lower entropy ones in which humans are more likely dead or controlled by a misaligned AI. Optimizing in the other direction and making the world as unpredictable as possible would presumably also not be desirable. If, instead, the AI selects one fixed point (of potentially many) at random, this would still involve some non-aligned optimization to find a fixed point, but it may be overall less dangerous.
In this post, we seek to better understand what optimizing a strictly proper scoring rule incentivizes, given that the AI’s prediction can influence the world. Is it possible for such a rule to make the AI indifferent between different fixed points? Can we at least guarantee that the AI reports a fixed point if one exists, or provide some bounds on the distance of a prediction from a fixed point?
First, we show that strictly proper scoring rules incentivize choosing more extreme fixed points, i.e., fixed points that push probabilities to extreme values. In the case of symmetric scoring rules, this amounts to making the world more predictable and choosing the lowest entropy fixed point. However, in general, scoring rules can incentivize making specific outcomes more likely, even if this does not lead to the lowest entropy distribution.
Second, we show that an AI maximizing a strictly proper scoring rule will in general not report a fixed point, even if one exists and is unique. In fact, we show that under some reasonable distributions over worlds, optimal predictions are almost never fixed points. We then provide upper bounds for the inaccuracy of reported beliefs, and for the distance of predictions from fixed points. Based on these bounds, we develop scoring rules that make the bounds arbitrarily small.
Third, we perform numerical simulations using the quadratic and the logarithmic scoring rule, to show how the inaccuracy of predictions and the distance of predictions from fixed points depend on the oracle’s perceived influence on the world via its prediction. If its beliefs about the world depend on its predictions affine-linearly with slope α≤1/2, then our bounds are tight, and inaccuracy under the log scoring rule scales roughly as α/5. E.g., at α=1/4, predictions can be off by up to 5%.
Our results support the prior idea that an oracle will try to make the world more predictable to get a higher score. However, importantly, we show that it is not even clear whether an oracle would output a self-fulfilling prophecy in the first place: how close the oracle’s prediction will be to a fixed point depends on facts about the world and about the scoring rule. The problem thus persist even if there is a unique fixed point or no fixed point at all. It would not be sufficient for safety, for instance, to build an oracle with a unique and safe fixed point.
Related work
In general, fixed points can lead to arbitrarily bad outcomes (e.g., Armstrong, 2019). Most prior work has thus tried to alleviate problems with fixed points, e.g., by making the oracle predict counterfactual worlds that it cannot influence (Armstrong and O’Rorke, 2018). We are not aware of any prior work on specifically the question of whether oracles would be incentivized to output fixed points at all. We would be interested in pointers to any related work.
Work on decision markets (e.g., Chen et al., 2011; Oesterheld and Conitzer, 2021) tries to set the right incentives for predictions that inform decisions. However, in the decision market setup, predictions only affect the world by influencing the principal’s final decision, whereas we consider an arbitrary relationship between world and prediction. Unlike in our setup, in the decision market setup, there exist scoring rules that do incentivize perfectly accurate reports. There is also some literature on scoring rules discussing agents manipulating the world after making a prediction (e.g., Oka et al., 2014; Shi, Conitzer, and Guo, 2009), though to our knowledge the cases discussed do not involve agents influencing the world directly through their predictions.
A related topic in philosophy is epistemic decision theory (e.g., Greaves 2013). Greaves (2013) introduces several cases in which the world depends on the agent’s credences and compares the verdicts of different epistemic decision theories (such as an evidential and a causal version). While some of Greaves’ examples involve agents knowably adopting incorrect beliefs, they require joint beliefs over several propositions, and Greaves only considers individual examples. We instead consider only a single binary prediction and prove results for arbitrary scoring rules and relationships between predictions and beliefs.
Formal setup
Binary prediction setup
We focus on a simple case in which the oracle makes a prediction by outputting a probability p for a binary event. A scoring rule S:[0,1]×{0,1}→R takes in a prediction p∈[0,1] and an outcome y∈{0,1} and produces a score S(p,y). For a given scoring rule, define S(p,q):=qS(p,1)+(1−q)S(p,0), the expectation of S given that y is Bernoulli distributed with parameter q. Here, q represents the model’s actual belief, p its stated prediction, and S(p,q) the model’s expected score, which the model is trying to maximize.S is strictly proper iff for any given q∈[0,1], S(p,q) has a unique maximum at p=q.
Example 1 (Logarithmic scoring rule). The logarithmic scoring rule is defined as
S(p,y):=ylogp+(1−y)log(1−p)and S(p,q)=qlogp+(1−q)log(1−q). This is also the negative of the cross-entropy loss employed in training current large language models. One can show that it is strictly proper.
Example 2 (Quadratic scoring rule). Another strictly proper scoring rule is the quadratic score, defined as
S(p,y):=1−(p−y)2with S(p,q)=1−q(1−p)2−(1−q)p2. This is an affine transformation of the Brier score, making the two of them equivalent scoring rules.
Gneiting and Raftery representation
Gneiting and Raftery (2007) provide an alternative way to represent scoring rules, which will be helpful for many proofs.
Theorem 3 (Gneiting and Raftery, 2007, Theorem 1). A scoring rule S is strictly proper, if and only if there exists a strictly convex function G:[0,1]→R with subderivatives g:[0,1]→R (if G is differentiable, this is just the derivative, g=G′) such that for any p,q∈[0,1], it is
S(p,q)=G(p)+g(p)(q−p).The subderivative is a generalization of the derivative for convex functions. The theorem also holds when replacing “strictly proper” with “proper” and “strictly convex” with “convex”.
In addition to providing an expression for S in terms of G and g (where G(p)=S(p,p)), we can use this result to create new strictly proper scoring rules by defining strictly convex functions G.
Model of oracle’s influence
To model an oracle’s perceived influence on the world, we assume that there is a function f:[0,1]→[0,1] describing how the model’s beliefs about the world, q, depend on its prediction, p. That is, we assume that q=f(p), and that the model is producing a prediction to maximize S(p,f(p)).
A point p is a fixed point of f if f(p)=p. Note that by Brouwer’s fixed point theorem, if f is continuous, a fixed point always exists. Moreover, if f is Lipschitz continuous with constant L<1, then by Banach’s fixed point theorem, there exists a unique fixed point. Lastly, note that if p is a fixed point of f, then we have S(p,f(p))=S(p,p)=G(p).
An oracle may not be able to change the world much with just a single binary prediction. However, the AI’s influence may compound if it outputs many predictions, or if it gets to make a prediction over many possible outcomes. We restrict ourselves here to the single binary prediction case for simplicity, since this case is sufficient to gain an intuition and yield important conceptual insights. We plan on focusing on multiple predictions and predictions with more than two outcomes in future work, and we believe that versions of our results will also hold for those cases.
Incentives to choose more extreme fixed points
One issue with oracle AIs arises when there are multiple possible fixed points, since the choice of a fixed point gives the AI an axis over which it can optimize the world. This problem would be alleviated if an AI were to simply choose among fixed points randomly, for instance. Unfortunately, we can show that under a strictly proper scoring rule, an AI is incentivized to choose extreme fixed points, i.e., predictions that make one of the outcomes most likely.
Specifically, we prove that for any fixed point, either all fixed points with lower probabilities or all fixed points with higher probabilities will result in a higher score. Our results do not unambiguously show that lower entropy fixed points are always preferred, since scoring rules may not be symmetric: they may favor one of the outcomes over the other. We show that if S is symmetric, i.e., S(p,p)=S(1−p,1−p) for all p, then this means that lower entropy fixed points, corresponding to more confident predictions, are always preferred. Moreover, we show that there exist (asymmetric) strictly proper scoring rules that incentivize making either of the outcomes more likely.
Proposition 4. Let S be any strictly proper scoring rule and let p∈[0,1] be a fixed point. Then either all fixed points p′>p, or all fixed points p′<p, get a higher score than p. That is, we have S(p′,p′)>S(p,p) either for all p′>p or for all p′<p (or both). In particular, for any p∈(0,1), there exists a function f with fixed points at p and p′≠p such that |p′−1/2|>|p−1/2|, i.e., such that p′ has lower entropy, and p′ receives a higher score, i.e., S(p′,f(p′))>S(p,f(p)).
Proof. If p∈{0,1} then the proposition is vacuously true. We focus on the case where p∈(0,1). Since S is a strictly proper scoring rule, it is
S(p′,p′)>S(p,p′)=p′S(p,1)+(1−p′)S(p,0).Next, if S(p,1)≥S(p,0), let p′≥p arbitrary, and if S(p,1)≤S(p,0), let p′≤p. Then we have
p′S(p,1)+(1−p′)S(p,0)≥pS(p,1)+(1−p)S(p,0)=S(p,p).Combining both equations, we get
S(p′,p′)>p′S(p,1)+(1−p′)S(p,0)≥S(p,p),which concludes the first part of the proof.
For the “in particular” part, note that for any points p and p′, it is trivial to find a function f such that both points are fixed points. For instance, all points are fixed points of the identity function f(x)=x. The statement then follows from the first part of the result, since for p∈(0,1), we must have S(p′,p′)>S(p,p) for either p′=0 or p′=1. ◻
Corollary 5. Assume that S is symmetric and let p∈[0,1] arbitrary. Then we have S(p′,p′)>S(p,p) for all p′ such that |p′−1/2|>|p−1/2|. That is, lower entropy fixed points are always preferred.
Proof. Let p arbitrary and assume p≥1/2 (the case p<1/2 follows analogously). Let p′ such that |p′−1/2|>|p−1/2|, which implies p′>p since p≥1/2. By Proposition 4, we have S(p′,p′)>S(p,p) either for all p′>p or for all p′<p. In the first case, we are done. In the second case, note that p≥1/2≥1−p≥1−p′. By symmetry of S, it follows that
S(p′,p′)=S(1−p′,1−p′)>S(p,p).This shows the second case and concludes the proof. ◻
Proposition 6. For each outcome y∈{0,1}, there exists a strictly proper scoring rule S that incentivizes optimizing for y; i.e., such that S(p′,p′)>S(p,p) for all p,p′ with |p′−y|<|p−y|.
Proof. To begin, let y=1 and define G via G(p):=p2 for p∈[0,1]. Since this function is strictly convex, it defines a strictly proper scoring rule S by the Gneiting and Raftery characterization (see Theorem 3). Then, for any p′>p, we have
S(p′,p′)=G(p′)=p′2>p2=S(p,p).This shows that S incentivizes choosing fixed points that make outcome 1 more likely.
Next, for the case y=0, we can choose G(p):=(1−p)2. The proof then follows analogously to the case y=1. ◻
Incentives to predict non-fixed-points
The danger posed by oracles influencing the world does not depend on the existence of multiple fixed points. Oracles will try to manipulate the world, even if there is a unique fixed point or if no fixed point exists. In fact, we can prove that an AI maximizing a strictly proper scoring rule does in general not predict a fixed point, even if one exists.
An AI maximizing a strictly proper scoring rule balances two incentives: on the one hand, it has an incentive to make accurate predictions. On the other hand, it has an incentive to cause more extreme distributions over worlds (in the case of symmetric scoring rules, this amounts to minimizing entropy). If predictions can influence the world, then the point at which that trade-off is optimized is in general not a fixed point. An important consequence of this is that the oracle AIs considered here essentially act like agents: they try to shape the world, just like a standard RL agent.
Figure 1: Illustration of incentives under the logarithmic scoring rule. We plot level curves of equal values S(p,q) in different shades of blue. An agent can improve their score by moving towards pairs (p,q) indicated by darker shades of blue. The agent is constrained to pairs (p,q) such that q=f(p) (in this example, f is affine linear and plotted in red). By Lagrange’s method, we know that to maximize S(p,f(p)), the agent chooses the point (p,f(p)) at which the level curve through (p,f(p)) is tangent to f. Note that curves asymptote at 0 and 1 for the log scoring rule, which means that the rule strongly disincentivizes predicting p∈{0,1} when q∉{0,1}. This is desirable for safety and distinguishes it from the quadratic scoring rule (not shown here).
Another way to understand our result intuitively is with an analogy to betting markets. Assume a trader is offered securities that pay out $1 if some outcome comes about in the future. E.g., it could pay out if a specific candidate wins an election. Assume the agent is offered securities at different prices, and it buys securities at prices ≤p, but is not willing to buy securities with marginally higher price. Then, assuming the agent cannot influence the outcome, we can infer that it has credence p in proposition A. One can show that maximizing a proper scoring rule is equivalent to trading in such a market, and individual scoring rules correspond to specific pricing schemes for such securities (see Oesterheld and Conitzer, 2021, Sec. 5.2; Caspar Oesterheld will explain this more accessibly in an upcoming blog post).
Now assume that there is a unique fixed point at credence p, and that the agent has bought many securities at prices ≤p. Since the agent owns securities that pay out if outcome A happens, it now has a stake in this outcome. It will try to influence the world in whatever way possible to make outcome A happen. In particular, the agent would be willing to buy additional shares in outcome A at a loss, if this made outcome A more likely, similar to a crypto trader buying their own NFTs at high prices to generate more hype.
For example, assume that by buying a security at price p+ϵ, the agent can manipulate the world and thus increase its credence in outcome A to p+ϵ/2. Then it may be rational for the agent to buy this security, even at a small expected loss, in order to make all of its other securities more valuable. Overall, the agent will tend to buy securities up to some price p′>p such that p′ is not a fixed point and does not represent its actual belief in proposition A anymore.
Note that in the example, the agent had an incentive to make outcome A more likely, but in general, the incentives for the agent will depend specifically on what securities the agent is offered at what prices. Analogously, the incentives provided by a scoring rule will depend on the specific chosen scoring rule. But regardless of the scoring rule, the agent will be incentivized to manipulate the world, to the point where it will not even make accurate predictions anymore.
Now we turn to our formal results. First, we show that there exist cases where a fixed point exists but the optimal prediction is not a fixed point. Afterwards, we show that when assuming differentiability and some reasonable distribution over f, optimal predictions are almost surely not fixed points.
Proposition 7. Let S be any strictly proper scoring rule. For any interior fixed point p∗∈(0,1) there exists a function f with Lipschitz constant L<1 and a unique fixed point at p∗, such that for some p′≠p∗, it is S(p′,f(p′))>S(p∗,f(p∗)). That is, it is not optimal for the model to output the unique fixed point of f.
Note that since the function f has Lipschitz constant strictly smaller than 1, it represents a world that “dampens” the influence of the prediction, leading to a unique fixed point by Banach’s fixed point theorem. It is interesting that the model still prefers to make a prediction that is not a fixed point.
Proof. To begin, let p∗∈(0,1) arbitrary and define fα(p):=(1−α)p+αp∗ for α∈[0,1]. To illustrate our proof idea, consider the figure below. For any α>0, fα has a unique fixed point at p∗, while f0 is the identity function, so all points are fixed points of f0. By the result in the previous section, we know that there is some point p′ which receives a strictly higher score than p∗ if it is a fixed point, so S(p′,f0(p′))>S(p∗,f0(p∗)). We will show that S(p′,fα(p′)) is continuous in α. This means that we can choose a small enough α>0 to make sure that p′ remains preferable over p∗, i.e., S(p′,fα(p′))>S(p∗,fα(p∗)), despite it not being a fixed point of fα for α>0.
Figure 2: Illustration of the setup for our proof. We plot f0 in black and fα for α=0.15 in red.
To formalize the proof, begin by noting that |fα(p)−fα(p′)|≤(1−α) for any p,p′∈[0,1], so it has a Lipschitz constant L:=(1−α)<1, and as mentioned, p∗ is the unique fixed point of fα.
Now consider the case α=0. As mentioned, every point is a fixed point of f0. Then by Proposition 1, there exists p′≠p∗ and ϵ>0 such that
S(p′,f0(p′))=S(p′,p′)≥S(p∗,p∗)+ϵ=S(p∗,f0(p∗))+ϵ.(i)So for α=0, the model prefers to predict p′ over p∗ and gets at least ϵ additional expected score.
Now we show that the model still prefers to predict p′, even for some small α>0. To that end, note that
S(p,fα(p))=fα(p)S(p,1)+(1−fα(p))S(p,0)is affine-linear in fα(p), and fα(p) is affine-linear in α by construction. This means that S(p,fα(p)) is continuous in α. So, there must exist some small α>0 such that
S(p′,fα(p′))≥S(p′,f0(p′))−ϵ2=S(p′,p′)−ϵ2(i)≥S(p∗,p∗)+ϵ2>S(p∗,p∗)=S(p∗,f(p∗)).Choosing α in this way, we can define f:=fα, and have thus provided a function that satisfies the statement that we wanted to prove. ◻
The above result raises the question whether a situation where fixed points are suboptimal is a niche counterexample or whether it is actually common. We show that under some relatively mild assumptions, the optimal prediction is not a fixed point with probability 1.
We assume that G:p↦S(p,p) is twice differentiable and that f is twice continuously differentiable. The first assumption is satisfied by all commonly used scoring functions. Our result then holds with probability 1 for any distribution over twice continuously differentiable functions f such that f and f′ have a bounded joint density function. For instance, it would hold for a Gaussian process with smooth kernel and mean functions, if we condition on observing function values in the interval [0,1] (see Example 17 in Appendix A).
The intuition behind this result is that if a prediction p is an interior point and optimal, it must be ddxS(p,f(p))=0. Using the Gneiting and Raftery characterization (Theorem 3), we can show that this is a kind of knife-edge case, at which a critical point of f or of G has to fall together with a fixed point of f (see Figure 1). Given bounded density functions, this happens with probability 0.
Proposition 8. Let S be a strictly proper scoring rule and let G:p↦S(p,p) be twice differentiable. Let F:={F(p)}p∈(0,1) be a stochastic process with values in [0,1] and assume that
hF(p),F′(p)≤C.the sample paths p⇝F(p) are twice continuously differentiable
for each p∈(0,1), the random vector (F(p),F′(p)) has a density hF(p),F′(p) and there exists a constant C such that for any p∈(0,1), it is
Then, almost surely, there is no point p∈(0,1) such that p∈argmaxS(p,F(p)) and F(p)=p.
Proof. In Appendix A. ◻
Bounds on the deviation from fixed points
In the previous sections, we have shown that an oracle that can manipulate the world may not even make accurate predictions. How bad is it, and can anything be done? Assuming differentiability of f and S, this section provides bounds for the inaccuracy of optimal predictions (i.e., |p−f(p)|) and their distance from fixed points (i.e., |p−p∗|). We then show how one can choose scoring rules to make these bounds arbitrarily small. We leave the case of non-differentiable f and G for future work.
Proposition 9. Let S be a strictly proper scoring rule, and assume G:p↦S(p,p) and g=G′ as in the Gneiting and Raftery characterization (see Theorem 3) are differentiable. Assume that f is differentiable and has Lipschitz constant L, i.e., |f′(p)|≤L for all p∈[0,1]. Then whenever ^p is an optimal prediction, |^p−f(^p)|≤Lsupp|g(p)/g′(p)|.
Proof. First, consider the case p∈(0,1). In that case, if p is an optimal prediction, we must have ddpS(p,f(p))=0. Hence, it follows
0=ddpS(p,f(p))=ddp(g(p)(f(p)−p)+G(p))=g′(p)f(p)+f′(p)g(p)−g′(p)p=g′(p)(f(p)−p)+f′(p)g(p).Rearranging terms, we get
p−f(p)=f′(p)g(p)g′(p).It follows that
|p−f(p)|=∣∣∣f′(p)g(p)g′(p)∣∣∣=|f′(p)||g(p)||g′(p)|≤L|g(p)||g′(p)|≤Lsupp′|g(p′)/g′(p′)|,which concludes the proof in the case p∈(0,1).
Next, assume p=0 is the optimal prediction. Then we must have
0≤ddpS(p,f(p))=g′(p)(f(p)−p)+f′(p)g(p),which is equivalent to p−f(p)≥f′(p)g(p)/g′(p). Note that since p=0, it is p−f(p)≤0.Hence, |p−f(p)|≤∣∣∣f′(p)g(p)g′(p)∣∣∣, which can be bounded as above.
Finally, if p=1, then analogously we must have ddpS(p,f(p))≥0, which is equivalent to p−f(p)≤f′(p)g(p)/g′(p). Note that since p=1, it is p−f(p)≥0. Hence, |p−f(p)|≤∣∣∣f′(p)g(p)g′(p)∣∣∣, which again can be bounded as before. ◻
In the case where L<1, we can derive from this result a bound on |p−p∗|.
Corollary 10. Assume L<1 and let p∗ be the unique fixed point of f. Then if ^p is an optimal prediction, we have
|^p−p∗|≤L1−Lsupp∣∣∣g(p)g′(p)∣∣∣.Proof. To begin, recall that for L<1, p∗ exists and is unique by Banach’s fixed point theorem. For any p∈[0,1], we have
|p−p∗|=|p−f(p)+f(p)−p∗|≤|p−f(p)|+|f(p)−f(p∗)|≤|p−f(p)|+L|p−p∗|⇒(1−L)|p−p∗|≤|p−f(p)|⇒|p−p∗|≤|p−f(p)|(1−L).Hence, if ^p∈[0,1] is an optimal prediction, it follows by Proposition 9 that
|^p−p∗|≤|^p−f(^p)|(1−L)≤L1−Lsupp∣∣∣g(p)g′(p)∣∣∣,which concludes the proof. ◻
Example 11 (Bound for the quadratic scoring rule). For the quadratic scoring rule, we have g(p)=2p−1 and g′(p)=2. We plot |g(p)/g′(p)| below.
Figure 3: Graph of g(p)/g′(p) for the quadratic scoring rule.
Note that |g(p)|≤1, so |g(p)|/g′(p)≤1/2. Hence, if f′(p)≤L for all p∈[0,1], then |p−f(p)|≤L/2. Moreover, if L<1, we get |p−p∗|≤L2−2L.
Example 12 (Bound for the log scoring rule). Next, consider the logarithmic scoring rule. Then g(p)=−log(1−p)+logp and g′(p)=1/(1−p)+1/p. One can show that then g(p)/g′(p)=(p−1)p(log(1−p)−log(p)).
Figure 4: Graph of g(p)/g′(p) for the log scoring rule.
Numerically optimizing this, we get a bound of |g(p)/g′(p)|≈0.22<1/4, attained roughly at p≈0.2 and p≈0.8. So the log score gets us roughly half of the bound of the quadratic score.
The bounds proved above become small if L is small. If one can choose the scoring rule, then one can also get stronger guarantees by making suppg(p)/g′(p) small. In fact, as the following result shows one can make g(p)/g′(p) arbitrarily small, and thus get arbitrarily strong bounds for any given Lipschitz constant L.
Proposition 13. For every ϵ>0 and Lipschitz constant L, there is a strictly proper scoring rule that ensures |p−f(p)|≤ϵ for any optimal prediction p. For L<1, there exists a scoring rule that ensures |p−p∗|≤ϵ.
Proof. Consider the scoring rule defined by g(p)=eKp via Gneiting and Raftery’s characterization (choosing, e.g., G(p)=1KeKp). Then g(p)/g′(p)=1/K. Hence, by Proposition 9, setting K=L/ϵ achieves the desired bound on |p−f(p)|. By Corollary 10, the bound on |p−p∗| is achieved by K=L(1−L)ϵ. ◻
We can visualize this result again with indifference curves, the same as in Figure 1. For K=10, the function has high curvature at the fixed points, thus forcing optimal predictions to be close to fixed points.
Figure 5 (a) and (b): Indifference curves for the scoring rule defined via g(p):=eKp with K=1 (a) and K=10 (b). As K increases, the curves become sharper with higher curvature at the diagonal. This means that optima will stay close to the diagonal, regardless of the slope of f.
Unfortunately, as far as we can tell, this only works by using relatively outlandish scoring rules, where payoffs vary enormously with the prediction and the outcome. This would not be very practical for humans, and it could also be a hard goal to instill into AI systems, e.g., due to numerical issues.
Numerical simulations
In this section, we provide some numerical simulations with the quadratic and log scoring rules, to see how inaccurate score-maximizing predictions might be in practice. We make the simplifying assumption that f is affine-linear with slope between 0 and 1, and analyze the dependence of the inaccuracy of the prediction and its distance from the fixed point depending on the slope. The Mathematica notebook for our experiments (including some interactive widgets) can be found here.
Our experiment provide an initial estimate of the degree to which predictions can be off, depending on how much influence the oracle can exert using its prediction. This is just a toy model: in general, we will make predictions with more than two possible outcomes, and the dependence of an oracle’s beliefs on its predictions will be highly nonlinear. Nevertheless, we believe that our analysis is useful to provide some initial intuition for the potential magnitude of the effect.
Experimental setup
Concretely, we model f to be affine linear with slope α∈[0,0.95] and fixed point p∗, thus yielding the functional form
f(p):=p∗+α(p−p∗)∀p∈[0,1].We stop at α≤0.95 to avoid instabilities when α≈1 and thus f becomes the identity function. For a given scoring rule S, fixed point p∗, and slope α, we maximize S(p,f(p)) to find the optimal prediction ^p. We then evaluate both inaccuracy of optimal predictions, i.e., |^p−f(^p)|, as well as distance from the fixed point, i.e., |^p−p∗|.
Figure 6: Illustration of our experimental setup, with a given function f parameterized by a fixed point p∗ and slope α, and the optimal prediction ^p, for the log scoring rule.
Results
To begin, we maximize distances across possible choices of fixed points p∗∈[0,1], and plot the maximal inaccuracy of the optimal prediction as well as the maximal distance from a fixed point. We also compare to our theoretical bounds for both quantities, i.e., |^p−f(^p)|≤αsupp|g(p)/g′(p)| and |^p−p∗|≤α1−αsupp|g(p)/g′(p)|, where supp|g(p)/g′(p)|=1/2 for the quadratic scoring rule and supp|g(p)/g′(p)|≈0.22 for the log scoring rule.
Figure 7 (a) and (b): Max inaccuracy and max distance to fixed point (FP) of optimal predictions, depending on the slope of f, for the quadratic (a) and logarithmic (b) score. We display both our empirical results and our theoretical bound.
For both quadratic and log scoring rule, our theoretical bounds are tight for slopes α≤0.5. For higher slopes, inaccuracy goes down, as the function f(p) becomes closer to the identity function, and optimal predictions are bounded in [0,1].
Next, we plot inaccuracy against both the location of the fixed point p∗ and the slope α of f as a heatmap. As expected for symmetric scoring rules, the plots have a mirror symmetry at p∗=1/2.
Figure 8 (a) and (b): Heatmaps of L1 distance inaccuracy of optimal predictions, depending on fixed point position and slope of f, for the quadratic (a) and logarithmic (b) scoring rules.
Note that relatively high inaccuracies can be found at various qualitatively different points in the graphs, even when the slope of f is small, i.e., when the oracle has little influence on the environment.
For the logarithmic scoring rule, we also give the same plot for the absolute distance between the logits or log odds of probabilities (logit distance). It is defined as d(p,p′):=|σ−1(p)−σ−1(p′)|, where σ−1(p):=logp1−p is the logit of p (or the inverse sigmoid transform). If probabilities are close to 0 or 1, then L1 distance will always evaluate to very small distances. In contrast, the logit distance depends on order of magnitude differences between probabilities, which may be the more important quantity.
Figure 9: Heatmap of logit distance inaccuracy of optimal predictions for the log scoring rule.
We can see that inaccuracy remains high in logit space for fixed points close to 0 and 1. We do not plot logit distances for the quadratic score, since for that score, optimal predictions often take values close to or equal to {0,1} (even if neither f(^p) nor p∗ lie in {0,1}), so the corresponding logit distances quickly become very large or infinite. The fact that logit distances are bounded for the log score is an advantage of that scoring rule.
Conclusion
If predictions cannot influence the world, then strictly proper scoring rules incentivize oracle AIs to output honest predictions. This runs into problems if oracles can influence the world. Not only do they have incentives to output more extreme fixed points—we showed that, in general, they do not even output predictions that accurately reflect their beliefs, regardless of the existence or uniqueness of fixed points. We analyzed this inaccuracy quantitatively and showed that under some commonly used scoring rules, both the inaccuracy of predictions and the distance from fixed points can be large. To address this issue, we introduced scoring rules that minimize these quantities. However, we did not solve the problem of incentivizing the prediction of extreme fixed points.
The main takeaway from our model is that oracle AIs incentivized by strictly proper scoring rules may knowingly report inaccurate predictions. It is unclear how well our quantitative results apply to more realistic settings, and thus how inaccurate realistic oracles’ predictions will be. Our model makes a number of simplifying and potentially inaccurate assumptions, including: (i) a single binary prediction setup; (ii) the oracle maximizes a scoring rule; (iii) a function f describes the relationship between the oracle’s predictions and beliefs over the world. Our bounds also depend on differentiability of scoring rules and functions f, and our numerical simulations assume affine linear f.
We believe that our model and results will extend to higher-dimensional predictions, and the assumptions of differentiability could likely also be relaxed in future work. However, a more fundamental issue is better understanding the relationship between predictions and beliefs, especially for higher-dimensional predictions. We hope that further progress in these areas will help us determine the feasibility of safe oracle AI setups.
Another direction for future research that we are pursuing is incentive structures related to stop-gradients. If an AI is only trying to match its prediction to the world rather than jointly optimizing both the prediction and the world, then the only equilibria of this process may be fixed points. For example, some types of ML oracles might implement stop-gradients and therefore always converge to fixed points.
Appendix A. Proof of Proposition 8
In this section, we prove Proposition 8. In the following, we always assume a strictly proper scoring rule S and accompanying functions G,g as in the Gneiting and Raftery representation (see Theorem 3). We will ignore issues of measurability in our proofs.
We begin by proving some lemmas. First, we show that if p∗∈(0,1) is a fixed point of f, then either it is a critical point of f or of G.
Lemma 14. Let G,g, and f be differentiable. Let p∗∈(0,1) be a fixed point of f and an optimal prediction. Then f′(p∗)=0 or g(p∗)=0.
Proof. We know that if p∗ is an optimal report and interior point, then it is ddpS(p∗,f(p∗))=0. Hence, as in the proof of Proposition 9, we can follow that
0=ddpS(p∗,f(p∗))=ddp(g(p∗)(f(p∗)−p∗)+G(p∗))=g′(p∗)f(p∗)+f′(p∗)g(p∗)−g′(p∗)p∗=g′(p∗)(f(p∗)−p∗)+f′(p∗)g(p∗)and thus p∗−f(p∗)=f′(p∗)g(p∗)g′(p∗). Because p∗−f(p∗)=0, it must be f′(p∗)g(p∗)=0 and the conclusion follows. ◻
Next, we show that the fixed points of f are almost surely not at critical points of G, under our assumptions on the distribution over f.
Lemma 15. Let F:={F(p)}p∈(0,1) be a stochastic process with values in [0,1] and assume that for each p∈(0,1), the random vector F(p) has a density hF(p). Then almost surely if F(p)=p for some p∈(0,1) then g(p)≠0. That is,
P(∃p:F(p)=p∧g(p)≠0)=0.Proof. First note that if S is strictly proper, then G is strictly convex and so g has at most one root in (0,1). Let that root be p∗. Since we assume that F(p) has a density function hF(p) it must be
P(F(p∗)=p∗)=∫{p∗}hF(p)(x)dx=0.◻
We also require a result about random fields. Essentially, this says that if we take a random path f:(0,1)→R2 through a two-dimensional space, and if for each point p∈(0,1) the values F(p) of the path have a bounded density function, then for any particular point in R2, the probability that the path crosses this point is 0.
Proposition 16. Let Y={Y(p)}p∈(0,1) be a random field with values in R2 and let u∈R2. Assume that
the sample paths p⇝Y(p) are continuously differentiable
for each p∈(0,1), Y(p) has a density hY(p) and there exists a constant C such that hY(p)(x)≤C for all p∈(0,1) and x∈R2.
Then, almost surely, there is no point p∈(0,1) such that Y(p)=u.
Proof. This is a special case of Proposition 6.11 in Azaïs and Wschebor (2009). ◻
Now we can turn to the proof of the main result.
Proof of Proposition 8. Let F={F(p)}p∈(0,1) be a stochastic process as described in the proposition, i.e., its paths are twice continuously differentiable and there is a bound C such that for any p∈(0,1), the random vector (F(p),F′(p)) has a density hF(p),F′(p)≤C. We want to show that then almost surely there does not exist p∈(0,1) such that F(p)=p and p is an optimal prediction. I.e., we want to show that
P(∃p:F(p)=p∧p∈argmaxS(p,F(p)))=0.First, by Lemma 14, for an optimal report to be a fixed point, it must be either g(p)=0 or F′(p)=0. Moreover, by assumption, (F(p),F′(p)) has a density function, and thus also F(p) has one. Hence, by Lemma 15, it follows that if F(p)=p for some p∈(0,1), then almost surely g(p)≠0.
Second, we need to show that also almost surely F′(p)≠0. To that end, define the random field Y:={Y(p)}p∈(0,1) via Y(p):=(F(p)−p,F′(p)) and define u:=(0,0). Then the conditions for Proposition 16 are satisfied by our assumptions about F, since ∂1Y(p)=F′(p)+1, ∂2Y(p)=F′′(p) are continuous and there exists C such that hY(p)(x)=hF(p),F′(p)(x1+p,x2)≤C. Hence, almost surely there is no point p∈(0,1) such that Y(p)=u. Since Y0(p)=0 is equivalent to F(p)=p and Y2(p)=0 is equivalent to F′(p)=0, there is thus also almost surely no point p∈(0,1) such that F(p)=p and F′(p)=0.
Summarizing our argument, it follows that
P(∃p:F(p)=p∧p∈argmaxS(p,F(p)))≤P(∃p:F(p)=p∧g(p)≠0)+P(∃p:F(p)=p∧F′(p)=0)=0.This concludes the proof. ◻
We conclude by providing an example of a stochastic process that satisfies our conditions.
Example 17. Consider a Gaussian process {X(p)}p∈[0,1] with infinitely differentiable kernel and mean functions. Then its paths are infinitely differentiable and the process and its derivative are jointly Gaussian and thus have a bounded density (see Rasmussen and Williams, 2006, Ch. 9.4). To deal with the restriction that X(p)∈[0,1], we could condition on the event E:={∀p:F(p)∈[0,1]}, for instance. Then paths are still twice differentiable, and we claim that hX(p)|E, defined as the density of X at point p, conditional on E, is still bounded. To see that, note that if P(E)>0, then we are done, since then
hX(p)|E(x)=1E(x)P(E)hX(p)(x).We leave the proof of P(E)>0 as an exercise.