Vanessa Kosoy comments on Vanessa Kosoy’s Shortform

Vanessa Kosoy Jan 16, 2021, 12:01 AM
LW: 3 AF: 1
AF
Infra-Bayesianism can be naturally understood as semantics for a certain non-classical logic. This promises an elegant synthesis between deductive/symbolic reasoning and inductive/intuitive reasoning, with several possible applications. Specifically, here we will explain how this can work for higher-order logic. There might be holes and/or redundancies in the precise definitions given here, but I’m quite confident the overall idea is sound.

We will work with homogenous ultracontributions (HUCs). $□ X$ will denote the space of HUCs over $X$ . Given $μ \in □ X$ , $S (μ) \subseteq Δ^{c} X$ will denote the corresponding convex set. Given $p \in Δ X$ and $μ \in □ X$ , $p : μ$ will mean $p \in S (μ)$ . Given $μ, ν \in □ X$ , $μ ⪯ ν$ will mean $S (μ) \subseteq S (ν)$ .

Syntax

Let $T^{ι}$ denote a set which we interpret as the types of individuals (we allow more than one). We then recursively define the full set of types $T$ by:
- $0 \in T$ (intended meaning: the uninhabited type)
- $1 \in T$ (intended meaning: the one element type)
- If $α \in T^{ι}$ then $α \in T$
- If $α, β \in T$ then $α + β \in T$ (intended meaning: disjoint union)
- If $α, β \in T$ then $α \times β \in T$ (intended meaning: Cartesian product)
- If $α \in T$ then $(α) \in T$ (intended meaning: predicates with argument of type $α$ )
For each $α, β \in T$ , there is a set $F_{α \to β}^{0}$ which we interpret as atomic terms of type $α \to β$ . We will denote $V_{α}^{0} := F_{1 \to α}^{0}$ . Among those we distinguish the logical atomic terms:
- ${p r}_{α β} \in F_{α \times β \to α}^{0}$
- $i_{α β} \in F_{α \to α + β}^{0}$
- Symbols we will not list explicitly, that correspond to the algebraic properties of $+$ and $\times$ (commutativity, associativity, distributivity and the neutrality of $0$ and $1$ ). For example, given $α, β \in T$ there is a “commutator” of type $α \times β \to β \times α$ .
- $=_{α} \in V_{(α \times α)}^{0}$
- ${d i a g}_{α} \in F_{α \to α \times α}^{0}$
- $()_{α} \in V_{((α) \times α)}^{0}$ (intended meaning: predicate evaluation)
- $⊥ \in V_{(1)}^{0}$
- $⊤ \in V_{(1)}^{0}$
- $\lor_{α} \in F_{(α) \times (α) \to (α)}^{0}$
- $\land_{α} \in F_{(α) \times (α) \to (α)}^{0}$ [EDIT: Actually this doesn’t work because, except for finite sets, the resulting mapping (see semantics section) is discontinuous. There are probably ways to fix this.]
- $\exists_{α β} \in F_{(α \times β) \to (β)}^{0}$
- $\forall_{α β} \in F_{(α \times β) \to (β)}^{0}$ [EDIT: Actually this doesn’t work because, except for finite sets, the resulting mapping (see semantics section) is discontinuous. There are probably ways to fix this.]
- Assume that for each $n \in N$ there is some $D_{n} \subseteq □ [n]$ : the set of “describable” ultracontributions [EDIT: it is probably sufficient to only have the fair coin distribution in $D_{2}$ in order for it to be possible to approximate all ultracontributions on finite sets]. If $μ \in D_{n}$ then $┌ μ ┐ \in V_{(\sum_{i = 1}^{n} 1)}$
We recursively define the set of all terms $F_{α \to β}$ . We denote $V_{α} := F_{1 \to α}$ .
- If $f \in F_{α \to β}^{0}$ then $f \in F_{α \to β}$
- If $f_{1} \in F_{α_{1} \to β_{1}}$ and $f_{2} \in F_{α_{2} \to β_{2}}$ then $f_{1} \times f_{2} \in F_{α_{1} \times α_{2} \to β_{1} \times β_{2}}$
- If $f_{1} \in F_{α_{1} \to β_{1}}$ and $f_{2} \in F_{α_{2} \to β_{2}}$ then $f_{1} + f_{2} \in F_{α_{1} + α_{2} \to β_{1} + β_{2}}$
- If $f \in F_{α \to β}$ then $f^{- 1} : F_{(β) \to (α)}$
- If $f \in F_{α \to β}$ and $g \in F_{β \to γ}$ then $g \circ f \in F_{α \to γ}$
Elements of $V_{(α)}$ are called formulae. Elements of $V_{(1)}$ are called sentences. A subset of $V_{(1)}$ is called a theory.

Semantics

Given $T \subseteq V_{(1)}$ , a model $M$ of $T$ is the following data. To each $α \in T$ , there must correspond some compact Polish space $M (t)$ s.t.:
- $M (0) = \emptyset$
- $M (1) = p t$ (the one point space)
- $M (α + β) = M (α) ⊔ M (β)$
- $M (α \times β) = M (α) \times M (β)$
- $M ((α)) = □ M (α)$
To each $f \in F_{α \to β}$ , there must correspond a continuous mapping $M (f) : M (α) \to M (β)$ , under the following constraints:
- $p r$ , $i$ , $d i a g$ and the “algebrators” have to correspond to the obvious mappings.
- $M (=_{α}) = ⊤_{{d i a g}_{M (α)}}$ . Here, ${d i a g}_{X} \subseteq X \times X$ is the diagonal and $⊤_{C} \in □ X$ is the sharp ultradistribution corresponding to the closed set $C \subseteq X$ .
- Consider $α \in T$ and denote $X := M (α)$ . Then, $M (()_{α}) = ⊤_{□ X} ⋉ {i d}_{□ X}$ . Here, we use the observation that the identity mapping ${i d}_{□ X}$ can be regarded as an infrakernel from $□ X$ to $X$ .
- $M (⊥) = ⊥_{p t}$
- $M (⊤) = ⊤_{p t}$
- $S (M (\lor) (μ, ν))$ is the convex hull of $S (μ) \cup S (ν)$
- $S (M (\land) (μ, ν))$ is the intersection of $S (μ) \cup S (ν)$
- Consider $α, β \in T$ and denote $X := M (α)$ , $Y := M (β)$ and $p r : X \times Y \to Y$ the projection mapping. Then, $M (\exists_{α β}) (μ) = {p r}_{*} μ$ .
- Consider $α, β \in T$ and denote $X := M (α)$ , $Y := M (β)$ and $p r : X \times Y \to Y$ the projection mapping. Then, $p : M (\forall_{α β}) (μ)$ iff for all $q \in Δ^{c} (X \times Y)$ , if ${p r}_{*} q = p$ then $q : μ$ .
- $M (f_{1} \times f_{2}) = M (f_{1}) \times M (f_{2})$
- $M (f_{1} + f_{2}) = M (f_{1}) ⊔ M (f_{2})$
- $M (f^{- 1}) (μ) = M (f)^{*} (μ)$ .
- $M (g \circ f) = M (g) \circ M (f)$
- $M (┌ μ ┐) = μ$
Finally, for each $ϕ \in T$ , we require $M (ϕ) = ⊤_{p t}$ .

Semantic Consequence

Given $ϕ \in V_{(1)}$ , we say $M ⊨ ϕ$ when $M (ϕ) = ⊤_{p t}$ . We say $T ⊨ ϕ$ when for any model $M$ of $T$ , $M ⊨ ϕ$ . It is now interesting to ask what is the computational complexity of deciding $T ⊨ ϕ$ . [EDIT: My current best guess is co-RE]

Applications

As usual, let $A$ be a finite set of actions and $O$ be a finite set of observation. Require that for each $o \in O$ there is $σ_{o} \in T^{ι}$ which we interpret as the type of states producing observation $o$ . Denote $σ_{*} := \sum_{o \in O} σ_{o}$ (the type of all states). Moreover, require that our language has the nonlogical symbols $s_{0} \in V_{(σ_{*})}^{0}$ (the initial state) and, for each $a \in A$ , $K_{a} \in F_{σ_{*} \to (σ_{*})}^{0}$ (the transition kernel). Then, every model defines a (pseudocausal) infra-POMDP. This way we can use symbolic expressions to define infra-Bayesian RL hypotheses. It is then tempting to study the control theoretic and learning theoretic properties of those hypotheses. Moreover, it is natural to introduce a prior which weights those hypotheses by length, analogical to the Solomonoff prior. This leads to some sort of bounded infra-Bayesian algorithmic information theory and bounded infra-Bayesian analogue of AIXI.
What links here?
- Vanessa Kosoy Jan 23, 2021, 5:05 PM
  LW: 5 AF: 2
  AF Parent
  Let’s also explicitly describe 0th order and 1st order infra-Bayesian logic (although they are should be segments of higher-order).
  
  0-th order
  
  Syntax
  
  Let $A$ be the set of propositional variables. We define the language $L$ :
  - Any $a \in A$ is also in $L$
  - $⊥ \in L$
  - $⊤ \in L$
  - Given $ϕ, ψ \in L$ , $ϕ \land ψ \in L$
  - Given $ϕ, ψ \in L$ , $ϕ \lor ψ \in L$
  Notice there’s no negation or implication. We define the set of judgements $J := L \times L$ . We write judgements as $ϕ ⊢ ψ$ (” $ψ$ in the context of $ϕ$ ”). A theory is a subset of $J$ .
  
  Semantics
  
  Given $T \subseteq J$ , a model of $T$ consists of a compact Polish space $X$ and a mapping $M : L \to □ X$ . The latter is required to satisfy:
  - $M (⊥) = ⊥_{X}$
  - $M (⊤) = ⊤_{X}$
  - $M (ϕ \land ψ) = M (ϕ) \land M (ψ)$ . Here, we define $\land$ of infradistributions as intersection of the corresponding sets
  - $M (ϕ \lor ψ) = M (ϕ) \lor M (ψ)$ . Here, we define $\lor$ of infradistributions as convex hull of the corresponding sets
  - For any $ϕ ⊢ ψ \in T$ , $M (ϕ) ⪯ M (ψ)$
  1-st order
  
  Syntax
  
  We define the language using the usual syntax of 1-st order logic, where the allowed operators are $\land$ , $\lor$ and the quantifiers $\forall$ and $\exists$ . Variables are labeled by types from some set $T$ . For simplicity, we assume no constants, but it is easy to introduce them. For any sequence of variables $(v_{1} \dots v_{n})$ , we denote $L_{v}$ the set of formulae whose free variables are a subset of $v_{1} \dots v_{n}$ . We define the set of judgements $J := ⋃_{v} L_{v} \times L_{v}$ .
  
  Semantics
  
  Given $T \subseteq J$ , a model of $T$ consists of
  - For every $t \in T$ , a compact Polish space $M (t)$
  - For every $ϕ \in L_{v}$ where $v_{1} \dots v_{n}$ have types $t_{1} \dots t_{n}$ , an element $M_{v} (ϕ)$ of $□ X_{v}$ , where $X_{v} := (\prod_{i = 1}^{n} M (t_{i}))$
  It must satisfy the following:
  - $M_{v} (⊥) = ⊥_{X_{v}}$
  - $M_{v} (⊤) = ⊤_{X_{v}}$
  - $M_{v} (ϕ \land ψ) = M_{v} (ϕ) \land M_{v} (ψ)$
  - $M_{v} (ϕ \lor ψ) = M_{v} (ϕ) \lor M_{v} (ψ)$
  - Consider variables $u_{1} \dots u_{n}$ of types $t_{1} \dots t_{n}$ and variables $v_{1} \dots v_{m}$ of types $s_{1} \dots s_{m}$ . Consider also some $σ : {1 \dots m} \to {1 \dots n}$ s.t. $s_{i} = t_{σ i}$ . Given $ϕ \in L_{v}$ , we can form the substitution $ψ := ϕ [v_{i} = u_{σ (i)}] \in L_{u}$ . We also have a mapping $f_{σ} : X_{u} \to X_{v}$ given by $f_{σ} (x_{1} \dots x_{n}) = (x_{σ (1)} \dots x_{σ (m)})$ . We require $M_{u} (ψ) = f^{*} (M_{v} (ϕ))$
  - Consider variables $v_{1} \dots v_{n}$ and $i \in {1 \dots n}$ . Denote $p r : X_{v} \to X_{v ∖ v_{i}}$ the projection mapping. We require $M_{v ∖ v_{i}} (\exists v_{i} : ϕ) = {p r}_{*} (M_{v} (ϕ))$
  - Consider variables $v_{1} \dots v_{n}$ and $i \in {1 \dots n}$ . Denote $p r : X_{v} \to X_{v ∖ v_{i}}$ the projection mapping. We require that $p : M_{v ∖ v_{i}} (\forall v_{i} : ϕ)$ if an only if, for all $q \in Δ X_{v}$ s.t ${p r}_{*} q = p$ , $q : {p r}_{*} (M_{v} (ϕ))$
  - For any $ϕ ⊢ ψ \in T$ , $M_{v} (ϕ) ⪯ M_{v} (ψ)$
  What links here?
  - Vanessa Kosoy Nov 1, 2021, 3:49 PM
    LW: 3 AF: 1
    AF Parent
    There is a special type of crisp infradistributions that I call “affine infradistributions”: those that, represented as sets, are closed not only under convex linear combinations but also under affine linear combinations. In other words, they are intersections between the space of distributions and some closed affine subspace of the space of signed measures. Conjecture: in 0-th order logic of affine infradistributions, consistency is polynomial-time decidable (whereas for classical logic it is ofc NP-hard).
    
    To produce some evidence for the conjecture, let’s consider a slightly different problem. Specifically, introduce a new semantics in which $□ X$ is replaced by the set of linear subspaces of some finite dimensional vector space $V$ . A model $M$ is required to satisfy:
    
    $M (⊥) = 0$
    $M (⊤) = V$
    $M (ϕ \land ψ) = M (ϕ) \cap M (ψ)$
    $M (ϕ \lor ψ) = M (ϕ) + M (ψ)$
    For any $ϕ ⊢ ψ \in T$ , $M (ϕ) \subseteq M (ψ)$
    
    If you wish, this is “non-unitary quantum logic”. In this setting, I have a candidate polynomial-time algorithm for deciding consistency. First, we transform $T$ into an equivalent theory s.t. all judgments are of the following forms:
    
    $a = ⊥$
    $a = ⊤$
    $a ⊢ b$
    Pairs of the form $c = a \land b$ , $d = a \lor b$ .
    
    Here, $a, b, c, d \in A$ are propositional variables and “ $ϕ = ψ$ ” is a shorthand for the pair of judgments $ϕ ⊢ ψ$ and $ψ ⊢ ϕ$ .
    
    Second, we make sure that our $T$ also satisfies the following “closure” properties:
    
    If $a ⊢ b$ and $b ⊢ c$ are in $T$ then so is $a ⊢ c$
    If $c = a \land b$ is in $T$ then so are $c ⊢ a$ and $c ⊢ b$
    If $c = a \lor b$ is in $T$ then so are $a ⊢ c$ and $b ⊢ c$
    If $c = a \land b$ , $d ⊢ a$ and $d ⊢ b$ are in $T$ then so is $d ⊢ c$
    If $c = a \lor b$ , $a ⊢ d$ and $b ⊢ d$ are in $T$ then so is $c ⊢ d$
    
    Third, we assign to each $a \in A$ a real-valued variable $x_{a}$ . Then we construct a linear program for these variables consisting of the following inequalities:
    
    For any $a \in A$ : $0 \leq x_{a} \leq 1$
    For any $a ⊢ b$ in $T$ : $x_{a} \leq x_{b}$
    For any pair $c = a \land b$ and $d = a \lor b$ in $T$ : $x_{c} + x_{d} = x_{a} + x_{b}$
    For any $a = ⊥$ : $x_{a} = 0$
    For any $a = ⊤$ : $x_{a} = 1$
    
    Conjecture: the theory is consistent if and only if the linear program has a solution. To see why it might be so, notice that for any model $M$ we can construct a solution by setting
    
    $x_{a} := \frac{d i m M (a)}{d i m M (⊤)}$
    
    I don’t have a full proof for the converse but here are some arguments. If a solution exists, then it can be chosen to be rational. We can then rescale it to get integers which are candidate dimensions of our subspaces. Consider the space of all ways to choose subspaces of these dimensions s.t. the constraints coming from judgments of the form $a ⊢ b$ are satisfied. This is a moduli space of poset representations. It is easy to see it’s non-empty (just let the subspaces be spans of vectors taken from a fixed basis). By Proposition A.2 in Futorny and Iusenko it is an irreducible algebraic variety. Therefore, to show that we can also satisfy the remaining constraints, it is enough to check that (i) the remaining constraints are open (ii) each of the remaining constraints (considered separately) holds at some point of the variety. The first is highly likely and the second is at least plausible.
    
    The algorithm also seems to have a natural extension to the original infra-Bayesian setting.
    What links here?
    What are the coolest topics in AI safety, to a hopelessly pure mathematician? by Jenny K E (EA Forum; May 7, 2022, 7:18 AM; 89 points)
    Non-Unitary Quantum Logic—SERI MATS Research Sprint by Yegreg (Feb 16, 2023, 7:31 PM; 27 points)
    Vanessa Kosoy's comment on [Closed] Job Offering: Help Communicate Infrabayesianism by abramdemski (Mar 26, 2022, 7:49 AM; 15 points)
- Vanessa Kosoy Jan 16, 2021, 12:03 PM
  LW: 2 AF: 1
  AF Parent
  When using infra-Bayesian logic to define a simplicity prior, it is natural to use “axiom circuits” rather than plain formulae. That is, when we write the axioms defining our hypothesis, we are allowed to introduce “shorthand” symbols for repeating terms. This doesn’t affect the expressiveness, but it does affect the description length. Indeed, eliminating all the shorthand symbols can increase the length exponentially.
- Vanessa Kosoy Jan 16, 2021, 11:54 AM
  LW: 2 AF: 1
  AF Parent
  Instead of introducing all the “algebrator” logical symbols, we can define $T$ as the quotient by the equivalence relation defined by the algebraic laws. We then need only two extra logical atomic terms:
  - For any $n \in N$ and $σ \in S_{n}$ (permutation), denote $n := \sum_{i = 1}^{n} 1$ and require $σ^{+} \in F_{n \to n}$
  - For any $n \in N$ and $σ \in S_{n}$ , $σ_{α}^{\times} \in F_{α^{n} \to α^{n}}$
  However, if we do this then it’s not clear whether deciding that an expression is a well-formed term can be done in polynomial time. Because, to check that the types match, we need to test the identity of algebraic expressions and opening all parentheses might result in something exponentially long.
  - Vanessa Kosoy Jan 24, 2021, 6:10 PM
    LW: 2 AF: 1
    AF Parent
    Actually the Schwartz–Zippel algorithm can easily be adapted to this case (just imagine that types are variables over $Q$ , and start from testing the identity of the types appearing inside parentheses), so we can validate expressions in randomized polynomial time (and, given standard conjectures, in deterministic polynomial time as well).