Vanessa Kosoy comments on Vanessa Kosoy’s Shortform

Vanessa Kosoy 9 Apr 2022 12:16 UTC
LW: 3 AF: 2
AF
Infradistributions admit an information-theoretic quantity that doesn’t exist in classical theory. Namely, it’s a quantity that measures how many bits of Knightian uncertainty an infradistribution has. We define it as follows:

Let $X$ be a finite set and $Θ$ a crisp infradistribution (credal set) on $X$ , i.e. a closed convex subset of $Δ X$ . Then, imagine someone trying to communicate a message by choosing a distribution out of $Θ$ . Formally, let $Y$ be any other finite set (space of messages), $θ \in Δ Y$ (prior over messages) and $K : Y \to Θ$ (communication protocol). Consider the distribution $η := θ ⋉ K \in Δ (Y \times X)$ . Then, the information capacity of the protocol is the mutual information between the projection on $Y$ and the projection on $X$ according to $η$ , i.e. $I_{η} ({p r}_{X}; {p r}_{Y})$ . The “Knightian entropy” of $Θ$ is now defined to be the maximum of $I_{η} ({p r}_{X}; {p r}_{Y})$ over all choices of $Y$ , $θ$ , $K$ . For example, if $Θ$ is Bayesian then it’s $0$ , whereas if $Θ = ⊤_{X}$ , it is $ln | X |$ .

Here is one application^[1] of this concept, orthogonal to infra-Bayesianism itself. Suppose we model inner alignment by assuming that some portion $ϵ$ of the prior $ζ$ consists of malign hypotheses. And we want to design e.g. a prediction algorithm that will converge to good predictions without allowing the malign hypotheses to attack, using methods like confidence thresholds. Then we can analyze the following metric for how unsafe the algorithm is.

Let $O$ be the set of observations and $A$ the set of actions (which might be “just” predictions) of our AI, and for any environment $τ$ and prior $ξ$ , let $D_{τ}^{ξ} (n) \in Δ (A \times O)^{n}$ be the distribution over histories resulting from our algorithm starting with prior $ξ$ and interacting with environment $τ$ for $n$ time steps. We have $ζ = ϵ μ + (1 - ϵ) β$ , where $μ$ is the malign part of the prior and $β$ the benign part. For any $μ^{'}$ , consider $D_{τ}^{ϵ μ^{'} + (1 - ϵ) β} (n)$ . The closure of the convex hull of these distributions for all choices of $μ^{'}$ (“attacker policy”) is some $Θ_{τ}^{β} (n) \in Δ (A \times O)^{n}$ . The maximal Knightian entropy of $Θ_{τ}^{β} (n)$ over all admissible $τ$ and $β$ is called the malign capacity of the algorithm. Essentially, this is a bound on how much information the malign hypotheses can transmit into the world via the AI during a period of $n$ . The goal then becomes finding algorithms with simultaneously good regret bounds and good (in particular, at most polylogarithmic in $n$ ) malign capacity bounds.
1. ↩︎
  This is an idea I’m collaborating on with Johannes Treutlein.