In the anthropic trilemma, Yudkowsky writes about the thorny problem of understanding subjective probability in a setting where copying and modifying minds is possible. Here, I will argue that infra-Bayesianism (IB) leads to the solution.
Consider a population of robots, each of which in a regular RL agent. The environment produces the observations of the robots, but can also make copies or delete portions of their memories. If we consider a random robot sampled from the population, the history they observed will be biased compared to the “physical” baseline. Indeed, suppose that a particular observation c has the property that every time a robot makes it, 10 copies of them are created in the next moment. Then, a random robot will have c much more often in their history than the physical frequency with which c is encountered, due to the resulting “selection bias”. We call this setting “anthropic RL” (ARL).
The original motivation for IB was non-realizability. But, in ARL, Bayesianism runs into issues even when the environment is realizable from the “physical” perspective. For example, we can consider an “anthropic MDP” (AMDP). An AMDP has finite sets of actions (A) and states (S), and a transition kernel T:A×S→Δ(S∗). The output is a string of states instead of a single state, because many copies of the agent might be instantiated on the next round, each with their own state. In general, there will be no single Bayesian hypothesis that captures the distribution over histories that the average robot sees at any given moment of time (at any given moment of time we sample a robot out of the population and look at their history). This is because the distributions at different moments of time are mutually inconsistent.
[EDIT: Actually, given that we don’t care about the order of robots, the signature of the transition kernel should be T:A×S→ΔNS]
The consistency that is violated is exactly the causality property of environments. Luckily, we know how to deal with acausality: using the IB causal-acausal correspondence! The result can be described as follows: Murphy chooses a time moment n∈N and guesses the robot policy π until time n. Then, a simulation of the dynamics of (π,T) is performed until time n, and a single history is sampled from the resulting population. Finally, the observations of the chosen history unfold in reality. If the agent chooses an action different from what is prescribed, Nirvana results. Nirvana also happens after time n (we assume Nirvana reward 1 rather than ∞).
This IB hypothesis is consistent with what the average robot sees at any given moment of time. Therefore, the average robot will learn this hypothesis (assuming learnability). This means that for n≫11−γ≫0, the population of robots at time n has expected average utility with a lower bound close to the optimum for this hypothesis. I think that for an AMDP this should equal the optimum expected average utility you can possibly get, but it would be interesting to verify.
Curiously, the same conclusions should hold if we do a weighted average over the population, with any fixed method of weighting. Therefore, the posterior of the average robot behaves adaptively depending on which sense of “average” you use. So, your epistemology doesn’t have to fix a particular method of counting minds. Instead different counting methods are just different “frames of reference” through which to look, and you can be simultaneously rational in all of them.
Could you expand a little on why you say that no Bayesian hypothesis captures the distribution over robot-histories at different times? It seems like you can unroll an AMDP into a “memory MDP” that puts memory information of the robot into the state, thus allowing Bayesian calculation of the distribution over states in the memory MDP to capture history information in the AMDP.
I’m not sure what do you mean by that “unrolling”. Can you write a mathematical definition?
Let’s consider a simple example. There are two states: s0 and s1. There is just one action so we can ignore it.s0 is the initial state. An s0 robot transition into an s1 robot. An s1 robot transitions into an s0 robot and an s1 robot. How will our population look like?
0th step: all robots remember s0
1st step: all robots remember s0s1
2nd step: 1⁄2 of robots remember s0s1s0 and 1⁄2 of robots remember s0s1s1
3rd step: 1⁄3 of robots remembers s0s1s0s1, 1⁄3 of robots remember s0s1s1s0 and 1⁄3 of robots remember s0s1s1s1
There is no Bayesian hypothesis a robot can have that gives correct predictions both for step 2 and step 3. Indeed, to be consistent with step 2 we must have Pr[s0s1s0]=12 and Pr[s0s1s1]=12. But, to be consistent with step 3 we must have Pr[s0s1s0]=13, Pr[s0s1s1]=23.
In other words, there is no Bayesian hypothesis s.t. we can guarantee that a randomly sampled robot on a sufficiently late time step will have learned this hypothesis with high probability. The apparent transition probabilities keep shifting s.t. it might always continue to seem that the world is complicated enough to prevent our robot from having learned it already.
Or, at least it’s not obvious there is such a hypothesis. In this example, Pr[s0s1s1]Pr[s0s1s0] will converge to the golden ratio at late steps. But, do all probabilities converge fast enough for learning to happen, in general? I don’t know, maybe for finite state spaces it can work. Would definitely be interesting to check.
[EDIT: actually, in this example there is such a hypothesis but in general there isn’t, see below]
Great example. At least for the purposes of explaining what I mean :) The memory AMDP would just replace the states s0, s1 with the memory states [s0], [s1], [s0,s0], [s0,s1], etc. The action takes a robot in [s0] to memory state [s0,s1], and a robot in [s0,s1] to one robot in [s0,s1,s0] and another in [s0,s1,s1].
(Skip this paragraph unless the specifics of what’s going on aren’t obvious: given a transition distribution P(s′∗|s,π) (P being the distribution over sets of states s’* given starting state s and policy π), we can define the memory transition distribution P(s′∗m|sm,π) given policy π and starting “memory state” sm∈S∗ (Note that this star actually does mean finite sequences, sorry for notational ugliness). First we plug the last element of sm into the transition distribution as the current state. Then for each s′∗ in the domain, for each element in s′∗ we concatenate that element onto the end of sm and collect these s′m into a set s′∗m, which is assigned the same probability P(s′∗).)
So now at time t=2, if you sample a robot, the probability that its state begins with [s0,s1,s1] is 0.5. And at time t=3, if you sample a robot that probability changes to 0.66. This is the same result as for the regular MDP, it’s just that we’ve turned a question about the history of agents, which may be ill-defined, into a question about which states agents are in.
I’m still confused about what you mean by “Bayesian hypothesis” though. Do you mean a hypothesis that takes the form of a non-anthropic MDP?
I’m not quite sure what are you trying to say here, probably my explanation of the framework was lacking. The robots already remember the history, like in classical RL. The question about the histories is perfectly well-defined. In other words, we are already implicitly doing what you described. It’s like in classical RL theory, when you’re proving a regret bound or whatever, your probability space consists of histories.
I’m still confused about what you mean by “Bayesian hypothesis” though. Do you mean a hypothesis that takes the form of a non-anthropic MDP?
Yes, or a classical RL environment. Ofc if we allow infinite state spaces, then any environment can be regarded as an MDP (whose states are histories). That is, I’m talking about hypotheses which conform to the classical “cybernetic agent model”. If you wish, we can call it “Bayesian cybernetic hypothesis”.
Also, I want to clarify something I was myself confused about in the previous comment. For an anthropic Markov chain (when there is only one action) with a finite number of states, we can give a Bayesian cybernetic description, but for a general anthropic MDP we cannot even if the number of states is finite.
Indeed, consider some T:S→ΔNS. We can take its expected value to get ET:S→RS+. Assuming the chain is communicating, ET is an irreducible non-negative matrix, so by the Perron-Frobenius theorem it has a unique-up-to-scalar maximal eigenvector η∈RS+. We then get the subjective transition kernel:
ST(t∣s)=ET(t∣s)ηt∑t′∈SET(t′∣s)ηt′
Now, consider the following example of an AMDP. There are three actions A:={a,b,c} and two states S:={s0,s1}. When we apply a to an s0 robot, it creates two s0 robots, whereas when we apply a to an s1 robot, it leaves one s1 robot. When we apply b to an s1 robot, it creates two s1 robots, whereas when we apply b to an s0 robot, it leaves one s0 robot. When we apply c to any robot, it results in one robot whose state is s0 with probability 12 and s1 with probability 12.
Consider the following two policies.πa takes the sequence of actions cacaca… and πb takes the sequence of actions cbcbcb…. A population that follows πa would experience the subjective probability ST(s0∣s0,c)=23, whereas a population that follows πb would experience the subjective probability ST(s0∣s0,c)=13. Hence, subjective probabilities depend on future actions. So, effectively anthropics produces an acausal (Newcomb-like) environment. And, we already know such environments are learnable by infra-Bayesian RL agents and, (most probably) not learnable by Bayesian RL agents.
Ah, okay, I see what you mean. Like how preferences are divisible into “selfish” and “worldly” components, where the selfish component is what’s impacted by a future simulation of you that is about to have good things happen to it.
(edit: The reward function in AMDPs can either be analogous to “wordly” and just sum the reward calculated at individual timesteps, or analogous to “selfish” and calculated by taking the limit of the subjective distribution over parts of the history, then applying a reward function to the expected histories.)
I brought up the histories->states thing because I didn’t understand what you were getting at, so I was concerned that something unrealistic was going on. For example, if you assume that the agent can remember its history, how can you possibly handle an environment with memory-wiping?
In fact, to me the example is still somewhat murky, because you’re talking about the subjective probability of a state given a policy and a timestep, but if the agents know their histories there is no actual agent in the information-state that corresponds to having those probabilities. In an MDP the agents just have probabilities over transitions—so maybe a clearer example is an agent that copies itself if it wins the lottery having a larger subjective transition probability of going from gambling to winning. (i.e. states are losing and winning, actions are gamble and copy, the policy is to gamble until you win and then copy).
Ah, okay, I see what you mean. Like how preferences are divisible into “selfish” and “worldly” components, where the selfish component is what’s impacted by a future simulation of you that is about to have good things happen to it.
...I brought up the histories->states thing because I didn’t understand what you were getting at, so I was concerned that something unrealistic was going on. For example, if you assume that the agent can remember its history, how can you possibly handle an environment with memory-wiping?
AMDP is only a toy model that distills the core difficulty into more or less the simplest non-trivial framework. The rewards are “selfish”: there is a reward function r:(S×A)∗→R which allows assigning utilities to histories by time discounted summation, and we consider the expected utility of a random robot sampled from a late population. And, there is no memory wiping. To describe memory wiping we indeed need to do the “unrolling” you suggested. (Notice that from the cybernetic model POV, the history is only the remembered history.)
For a more complete framework, we can use an ontology chain, but (i) instead of A×O labels use A×M labels, where M is the set of possible memory states (a policy is then described by π:M→A), to allow for agents that don’t fully trust their memory (ii) consider another chain with a bigger state space S′ plus a mapping p:S′→NS s.t. the transition kernels are compatible. Here, the semantics of p(s) is: the multiset of ontological states resulting from interpreting the physical state s by taking the viewpoints of different agents s contains.
In fact, to me the example is still somewhat murky, because you’re talking about the subjective probability of a state given a policy and a timestep, but if the agents know their histories there is no actual agent in the information-state that corresponds to having those probabilities.
I didn’t understand “no actual agent in the information-state that corresponds to having those probabilities”. What does it mean to have an agent in the information-state?
Is it possible to replace the maximin decision rule in infra-Bayesianism with a different decision rule? One surprisingly strong desideratum for such decision rules is the learnability of some natural hypothesis classes.
In the following, all infradistributions are crisp.
Fix finite action set A and finite observation set O. For any k∈N and γ∈(0,1), let
Mkγ:(A×O)ω→Δ(A×O)k
be defined by
Mkγ(h|d):=(1−γ)∞∑n=0γn[[h=dn:n+k]]
In other words, this kernel samples a time step n out of the geometric distribution with parameter γ, and then produces the sequence of length k that appears in the destiny starting at n.
For any continuous[1] function D:□(A×O)k→R, we get a decision rule. Namely, this rule says that, given infra-Bayesian law Λ and discount parameter γ, the optimal policy is
π∗DΛ:=argmaxπ:O∗→AD(Mkγ∗Λ(π))
The usual maximin is recovered when we have some reward function r:(A×O)k→R and corresponding to it is
Dr(Θ):=minθ∈ΘEθ[r]
Given a set H of laws, it is said to be learnable w.r.t.D when there is a family of policies {πγ}γ∈(0,1) such that for any Λ∈H
limγ→1(maxπD(Mkγ∗Λ(π))−D(Mkγ∗Λ(πγ))=0
For Dr we know that e.g. the set of all communicating[2] finite infra-RDPs is learnable. More generally, for any t∈[0,1] we have the learnable decision rule
Also, any monotonically increasing D seems to be learnable, i.e. any D s.t. for Θ1⊆Θ2 we have D(Θ1)≤D(Θ2). For such decision rules, you can essentially assume that “nature” (i.e. whatever resolves the ambiguity of the infradistributions) is collaborative with the agent. These rules are not very interesting.
On the other hand, decision rules of the form Dr1+Dr2 are not learnable in general, and so are decision rules of the form Dr+D′ for D′ monotonically increasing.
Open Problem: Are there any learnable decision rules that are not mesomism or monotonically increasing?
A positive answer to the above would provide interesting generalizations of infra-Bayesianism. A negative answer to the above would provide an interesting novel justification of the maximin. Indeed, learnability is not a criterion that was ever used in axiomatic constructions of decision theory[3], AFAIK.
We can try considering discontinuous functions as well, but it seems natural to start with continuous. If we want the optimal policy to exist, we usually need D to be at least upper semicontinuous.
There are weaker conditions than “communicating” that are sufficient, e.g. “resettable” (meaning that the agent can always force returning to the initial state), and some even weaker conditions that I will not spell out here.
Consider a one-shot decision theory setting. There is a set of unobservable states S, a set of actions A and a reward function r:A×S→[0,1]. An IBDT agent has some belief β∈□S[1], and it chooses the action a∗:=argmaxa∈AEβ[λs.r(a,s)].
We can construct an equivalent scenario, by augmenting this one with a perfect predictor of the agent (Omega). To do so, define S′:=A×S, where the semantics of (p,s) is “the unobservable state is s and Omega predicts the agent will take action p”. We then define r′:A×S′→[0,1] by r′(a,p,s):=1a=pr(a,s)+1a≠p and β′∈□S′ by Eβ′[f]:=minp∈AEβ[λs.f(p,s)] (β′ is what we call the pullback of β to S′, i.e we have utter Knightian uncertainty about Omega). This is essentially the usual Nirvana construction.
The new setup produces the same optimal action as before. However, we can now give an alternative description of the decision rule.
For any p∈A, define Ωp∈□S′ by EΩp[f]:=mins∈Sf(p,s). That is, Ωp is an infra-Bayesian representation of the belief “Omega will make prediction p”. For any u∈[0,1], define Ru∈□S′ by ERu[f]:=minμ∈ΔS′:Eμ[r(p,s)]≥uEμ[f(p,s)]. Ru can be interpreted as the belief “assuming Omega is accurate, the expected reward will be at least u”.
We will also need to use the order ⪯ on □X defined by: ϕ⪯ψ when ∀f∈[0,1]X:Eϕ[f]≥Eψ[f]. The reversal is needed to make the analogy to logic intuitive. Indeed, ϕ⪯ψ can be interpreted as ”ϕ implies ψ“[2], the meet operator ∧ can be interpreted as logical conjunction and the join operator ∨ can be interpreted as logical disjunction.
Claim:
a∗=argmaxa∈Amax{u∈[0,1]∣β′∧Ωa⪯Ru}
(Actually I only checked it when we restrict to crisp infradistributions, in which case ∧ is intersection of sets and ⪯ is set containment, but it’s probably true in general.)
Now, β′∧Ωa⪯Ru can be interpreted as “the conjunction of the belief β′ and Ωa implies Ru”. Roughly speaking, “according to β′, if the predicted action is a then the expected reward is at least u”. So, our decision rule says: choose the action that maximizes the value for which this logical implication holds (but “holds” is better thought of as “is provable”, since we’re talking about the agent’s belief). Which is exactly the decision rule of MUDT!
Technically it’s better to think of it as ”ψ is true in the context of ϕ”, since it’s not another infradistribution so it’s not a genuine implication operator.
I believe that all or most of the claims here are true, but I haven’t written all the proofs in detail, so take it with a grain of salt.
Ambidistributions are a mathematical object that simultaneously generalizes infradistributions and ultradistributions. It is useful to represent how much power an agent has over a particular system: which degrees of freedom it can control, which degrees of freedom obey a known probability distribution and which are completely unpredictable.
Definition 1: Let X be a compact Polish space. A (crisp) ambidistribution on X is a function Q:C(X)→R s.t.
(Monotonocity) For any f,g∈C(X), if f≤g then Q(f)≤Q(g).
(Homogeneity) For any f∈C(X) and λ≥0, Q(λf)=λQ(f).
(Constant-additivity) For any f∈C(X) and c∈R, Q(f+c)=Q(f)+c.
Conditions 1+3 imply that Q is 1-Lipschitz. We could introduce non-crisp ambidistributions by dropping conditions 2 and/or 3 (and e.g. requiring 1-Lipschitz instead), but we will stick to crisp ambidistributions in this post.
The space of all ambidistributions on X will be denoted ♡X.[1] Obviously, □X⊆♡X (where □X stands for (crisp) infradistributions), and likewise for ultradistributions.
Examples
Example 1: Consider compact Polish spaces X,Y,Z and a continuous mapping F:X×Y→Z. We can then define F♡∈♡Z by
F♡(u):=maxθ∈ΔXminη∈ΔYEθ×η[u∘F]
That is, F♡(u) is the value of the zero-sum two-player game with strategy spaces X and Y and utility function u∘F.
Notice that F in Example 1 can be regarded as a Cartesian frame: this seems like a natural connection to explore further.
Example 2: Let A and O be finite sets representing actions and observations respectively, and Λ:{O∗→A}→□(A×O)∗ be an infra-Bayesian law. Then, we can define Λ♡∈♡(A×O)∗ by
Λ♡(u):=maxπ:O∗→AEΛ(π)[u]
In fact, this is a faithful representation: Λ can be recovered from Λ♡.
Example 3: Consider an infra-MDP with finite state set S, initial state s0∈S and transition infrakernel T:S×A→□S. We can then define the “ambikernel” T♡:S→♡S by
T♡(s;u):=maxa∈AET(s,a)[u]
Thus, every infra-MDP induces an “ambichain”. Moreover:
Claim 1:♡ is a monad. In particular, ambikernels can be composed.
This allows us defining
ϕ(γ):=(1−γ)∞∑n=0γn(T♡)n(s0)
This object is the infra-Bayesian analogue of the convex polytope of accessible state occupancy measures in an MDP.
Claim 2: The following limit always exists:
ϕ∗:=limγ→1ϕ(γ)
Legendre-Fenchel Duality
Definition 3: Let D be a convex space and A1,A2…An,B⊆D. We say that Boccludes(A1…An) when for any (a1…an)∈A1×…×An, we have
CH(a1…an)∩B≠∅
Here, CH stands for convex hull.
We denote this relation A1…An⊢B. The reason we call this “occlusion” is apparent for the n=2 case.
Here are some properties of occlusion:
For any 1≤i≤n, A1…An⊢Ai.
More generally, if c∈Δ{1…n} then A1…An⊢∑iciAi.
If Φ⊢A and Φ⊆Ψ then Ψ⊢A.
If Φ⊢A and A⊆B then Φ⊢B.
If A1…An⊢B and A′i⊆Ai for all 1≤i≤n, then A′1…A′n⊢B.
If Φ⊢Ai for all 1≤i≤n, and also A1…An⊢B, then Φ⊢B.
Notice that occlusion has similar algebraic properties to logical entailment, if we think of A⊆B as ”B is a weaker proposition than A”.
Definition 4: Let X be a compact Polish space. A cramble set[2] over X is Φ⊆□X s.t.
Φ is non-empty.
Φ is topologically closed.
For any finite Φ0⊆Φ and Θ∈□X, if Φ0⊢Θ then Θ∈Φ. (Here, we interpret elements of □X as credal sets.)
Question: If instead of condition 3, we only consider binary occlusion (i.e. require |Φ0|≤2), do we get the same concept?
Given a cramble set Φ, its Legendre-Fenchel dual ambidistribution is
^Φ(f):=maxΘ∈ΦEΘ[f]
Claim 3: Legendre-Fenchel duality is a bijection between cramble sets and ambidistributions.
Lattice Structure
Functionals
The space ♡X is equipped with the obvious partial order: Q≤P when for all f∈C(X),Q(f)≤P(f). This makes ♡X into a distributive lattice, with
(P∧Q)(f)=min(P(f),Q(f))(P∨Q)(f)=max(P(f),Q(f))
This is in contrast to □X which is a non-distributive lattice.
The bottom and top elements are given by
⊥(f)=minx∈Xf(x)⊤(f)=maxx∈Xf(x)
Ambidistributions are closed under pointwise suprema and infima, and hence ♡X is complete and satisfies both infinite distributive laws, making it a complete Heyting and co-Heyting algebra.
♡X is also a De Morgan algebra with the involution
¯Q(f):=−Q(−f)
For X≠∅, ♡X is not a Boolean algebra: ΔX⊆♡X and for any θ∈ΔX we have ¯θ=θ.
One application of this partial order is formalizing the “no traps” condition for infra-MDP:
Definition 2: A finite infra-MDP is quasicommunicating when for any s∈S
Here is a modification of the IBP framework which removes the monotonicity principle, and seems to be more natural in other ways as well.
First, let our notion of “hypothesis” be Θ∈□c(Γ×2Γ). The previous framework can be interpreted in terms of hypotheses of this form satisfying the condition
prΓ×2ΓBr(Θ)=Θ
(See Proposition 2.8 in the original article.) In the new framework, we replace it by the weaker condition
Br(Θ)⊇(idΓ×diag2Γ)∗Θ
This can be roughly interpreted as requiring that (i) whenever the output of a program P determines whether some other program Q will run, program P has to run as well (ii) whenever programs P and Q are logically equivalent, program P runs iff program Q runs.
The new condition seems to be well-justified, and is also invariant under (i) mixing hypotheses (ii) taking joins/meets of hypotheses. The latter was not the case for the old condition. Moreover, it doesn’t imply that Θ is downward closed, and hence there is no longer a monotonicity principle[1].
The next question is, how do we construct hypotheses satisfying this condition? In the old framework, we could construct hypotheses of the form Ξ∈□c(Γ×Φ) and then apply the bridge transform. In particular, this allows a relatively straightforward translation of physics theories into IBP language (for example our treatment of quantum theory). Luckily, there is an analogous construction in the new framework as well.
First notice that our new condition on Θ can be reformulated as requiring that
suppΘ⊆elΓ
For any s:Γ→Γ define τs:ΔcelΓ→ΔcelΓ by τsθ:=χelΓ(s×id2Γ)∗. Then, we require τsΘ⊆Θ.
For any Φ, we also define τΦs:Δc(elΓ×Φ)→Δc(elΓ×Φ) by
τΦsθ:=χelΓ×Φ(s×id2Γ×Φ)∗
Now, for any Ξ∈□c(Γ×Φ), we define the “conservative bridge transform[2]” CBr(Ξ)∈□c(Γ×2Γ×Φ) as the closure of all τΦsθ where θ is a maximal element of Br(Ξ). It is then possible to see that Θ∈□c(Γ×2Γ) is a valid hypothesis if and only if it is of the form prΓ×2ΓCBr(Ξ) for some Φ and Ξ∈□c(Γ×Φ).
I still think the monotonicity principle is saying something about the learning theory of IBP which is still true in the new framework. Namely, it is possible to learn that a program is running but not possible to (confidently) learn that a program is not running, and this limits the sort of frequentist guarantees we can expect.
Intuitively, it can be interpreted as a version of the bridge transform where we postulate that a program doesn’t run unless Ξ contains a reason while it must run.
Quine’s are non-unique (there can be multiple fixed points). This means that, viewed as a prescriptive theory, IBP produces multi-valued prescriptions. It might be the case that this multi-valuedness can resolve problems with UDT such as Wei Dai’s 3-player Prisoner’s Dilemma and the anti-Newcomb problem[1]. In these cases, a particular UDT/IBP (corresponding to a particular quine) loses to CDT. But, a different UDT/IBP (corresponding to a different quine) might do as well as CDT.
What to do about agents that don’t know their own source-code? (Arguably humans are such.) Upon reflection, this is not really an issue! If we use IBP prescriptively, then we can always assume quining: IBP is just telling you to follow a procedure that uses quining to access its own (i.e. the procedure’s) source code. Effectively, you are instantiating an IBP agent inside yourself with your own prior and utility function. On the other hand, if we use IBP descriptively, then we don’t need quining: Any agent can be assigned “physicalist intelligence” (Definition 1.6 in the original post, can also be extended to not require a known utility function and prior, along the lines of ADAM) as long as the procedure doing the assigning knows its source code. The agent doesn’t need to know its own source code in any sense.
Physicalist agents see themselves as inhabiting an unprivileged position within the universe. However, it’s unclear whether humans should be regarded as such agents. Indeed, monotonicity is highly counterintuitive for humans. Moreover, historically human civilization struggled a lot with accepting the Copernican principle (and is still confused about issues such as free will, anthropics and quantum physics which physicalist agents shouldn’t be confused about). This presents a problem for superimitation.
What if humans are actually cartesian agents? Then, it makes sense to consider a variant of physicalist superimitation where instead of just seeing itself as unprivileged, the AI sees the user as a privileged agent. We call such agents “transcartesian”. Here is how this can be formalized as a modification of IBP.
In IBP, a hypothesis is specified by choosing the state space Φ and the belief Θ∈□(Γ×Φ). In the transcartesian framework, we require that a hypothesis is augmented by a mapping τ:Φ→(A0×O0)≤ω, where A0 is the action set of the reference agent (user) and O0 is the observation set of the reference agent. Given G0 the source code of the reference agent, we require that Θ is supported on the set
{(y,x)∈Γ×Φ∣∣ha⊑τ(x)⟹a=Gy0(h)}
That is, the actions of the reference agent are indeed computed by the source code of the reference agent.
Now, instead of using a loss function of the form L:elΓ→R, we can use a loss function of the form L:(A0×O0)≤ω→R which doesn’t have to satisfy any monotonicity constraint. (More generally, we can consider hybrid loss functions of the form L:(A0×O0)≤ω×elΓ→R monotonic in the second argument.) This can also be generalized to reference agents with hidden rewards.
As opposed to physicalist agents, transcartesian agents do suffer from penalties associated with the description complexity of bridge rules (for the reference agent). Such an agent can (for example) come to believe in a simulation hypothesis that is unlikely from a physicalist perspective. However, since such a simulation hypothesis would be compelling for the reference agent as well, this is not an alignment problem (epistemic alignment is maintained).
Up to light editing, the following was written by me during the “Finding the Right Abstractions for healthy systems” research workshop, hosted by Topos Institute in January 2023. However, I invented the idea before.
In order to allow R (the set of programs) to be infinite in IBP, we need to define the bridge transform for infinite Γ. At first, it might seem Γ can be allowed to be any compact Polish space, and the bridge transform should only depend on the topology on Γ, but that runs into problems. Instead, the right structure on Γ for defining the bridge transform seems to be that of a “profinite field space”: a category I came up with that I haven’t seen in the literature so far.
The category PFS of profinite field spaces is defined as follows. An object F of PFS is a set ind(F) and a family of finite sets Fαα∈ind(F). We denote Tot(F):=∏αFα. Given F and G objects of PFS, a morphism from F to G is a mapping f:Tot(F)→Tot(G) such that there exists R⊆ind(F)×ind(G) with the following properties:
For any α∈ind(F), the set R(α):=β∈ind(G)∣(α,β)∈R is finite.
For any β∈ind(G), the set R−1(β):=α∈ind(F)∣(α,β)∈R is finite.
For any β∈ind(G), there exists a mapping fβ:∏α∈R−1(β)Fα→Gβ s.t. for any x∈Tot(F), f(x)β:=fβ(prRβ(x)) where prRβ:Tot(F)→∏α∈R−1(β)Fα is the projection mapping.
The composition of PFS morphisms is just the composition of mappings.
It is easy to see that every PFS morphism is a continuous mapping in the product topology, but the converse is false. However, the converse is true for objects with finite ind (i.e. for such objects any mapping is a morphism). Hence, an object F in PFS can be thought of as Tot(F) equipped with additional structure that is stronger than the topology but weaker than the factorization into Fα.
The name “field space” is inspired by the following observation. Given F an object of PFS, there is a natural condition we can impose on a Borel probability distribution on Tot(F) which makes it a “Markov random field” (MRF). Specifically, μ∈ΔTot(F) is called an MRF if there is an undirected graph G whose vertices are ind(F) and in which every vertex is of finite degree, s.t.μ is an MRF on G in the obvious sense. The property of being an MRF is preserved under pushforwards w.r.t.PFS morphisms.
Infra-Bayesian physicalism is an interesting example in favor of the thesis that the more qualitatively capable an agent is, the less corrigible it is. (a.k.a. “corrigibility is anti-natural to consequentialist reasoning”). Specifically, alignment protocols that don’t rely on value learning become vastly less safe when combined with IBP:
Example 1:Using steep time discount to disincentivize dangerous long-term plans. For IBP, “steep time discount” just means, predominantly caring about your source code running with particular short inputs. Such a goal strongly incentives the usual convergent instrumental goals: first take over the world, then run your source code with whatever inputs you want. IBP agents just don’t have time discount in the usual sense: a program running late in physical time is just as good as one running early in physical time.
Example 2:Debate. This protocol relies on a zero-sum game between two AIs. But, the monotonicity principle rules out the possibility of zero-sum! (If L and −L are both monotonic loss functions then L is a constant). So, in a “debate” between IBP agents, they cooperate to take over the world and then run the source code of each debater with the input “I won the debate”.
Example 3:Forecasting/imitation (an IDA in particular). For an IBP agent, the incentivized strategy is: take over the world, then run yourself with inputs showing you making perfect forecasts.
The conclusion seems to be, it is counterproductive to use IBP to solve the acausal attack problem for most protocols. Instead, you need to do PreDCA or something similar. And, if acausal attack is a serious problem, then approaches that don’t do value learning might be doomed.
Infradistributions admit an information-theoretic quantity that doesn’t exist in classical theory. Namely, it’s a quantity that measures how many bits of Knightian uncertainty an infradistribution has. We define it as follows:
Let X be a finite set and Θ a crisp infradistribution (credal set) on X, i.e. a closed convex subset of ΔX. Then, imagine someone trying to communicate a message by choosing a distribution out of Θ. Formally, let Y be any other finite set (space of messages), θ∈ΔY (prior over messages) and K:Y→Θ (communication protocol). Consider the distribution η:=θ⋉K∈Δ(Y×X). Then, the information capacity of the protocol is the mutual information between the projection on Y and the projection on X according to η, i.e. Iη(prX;prY). The “Knightian entropy” of Θ is now defined to be the maximum of Iη(prX;prY) over all choices of Y, θ, K. For example, if Θ is Bayesian then it’s 0, whereas if Θ=⊤X, it is ln|X|.
Here is one application[1] of this concept, orthogonal to infra-Bayesianism itself. Suppose we model inner alignment by assuming that some portion ϵ of the prior ζ consists of malign hypotheses. And we want to design e.g. a prediction algorithm that will converge to good predictions without allowing the malign hypotheses to attack, using methods like confidence thresholds. Then we can analyze the following metric for how unsafe the algorithm is.
Let O be the set of observations and A the set of actions (which might be “just” predictions) of our AI, and for any environment τ and prior ξ, let Dξτ(n)∈Δ(A×O)n be the distribution over histories resulting from our algorithm starting with prior ξ and interacting with environment τ for n time steps. We have ζ=ϵμ+(1−ϵ)β, where μ is the malign part of the prior and β the benign part. For any μ′, consider Dϵμ′+(1−ϵ)βτ(n). The closure of the convex hull of these distributions for all choices of μ′ (“attacker policy”) is some Θβτ(n)∈Δ(A×O)n. The maximal Knightian entropy of Θβτ(n) over all admissible τ and β is called the malign capacity of the algorithm. Essentially, this is a bound on how much information the malign hypotheses can transmit into the world via the AI during a period of n. The goal then becomes finding algorithms with simultaneously good regret bounds and good (in particular, at most polylogarithmic in n) malign capacity bounds.
Infra-Bayesianism can be naturally understood as semantics for a certain non-classical logic. This promises an elegant synthesis between deductive/symbolic reasoning and inductive/intuitive reasoning, with several possible applications. Specifically, here we will explain how this can work for higher-order logic. There might be holes and/or redundancies in the precise definitions given here, but I’m quite confident the overall idea is sound.
We will work with homogenous ultracontributions (HUCs). □X will denote the space of HUCs over X. Given μ∈□X, S(μ)⊆ΔcX will denote the corresponding convex set. Given p∈ΔX and μ∈□X, p:μ will mean p∈S(μ). Given μ,ν∈□X, μ⪯ν will mean S(μ)⊆S(ν).
Syntax
Let Tι denote a set which we interpret as the types of individuals (we allow more than one). We then recursively define the full set of types T by:
0∈T (intended meaning: the uninhabited type)
1∈T (intended meaning: the one element type)
If α∈Tι then α∈T
If α,β∈T then α+β∈T (intended meaning: disjoint union)
If α,β∈T then α×β∈T (intended meaning: Cartesian product)
If α∈T then (α)∈T (intended meaning: predicates with argument of type α)
For each α,β∈T, there is a set F0α→β which we interpret as atomic terms of type α→β. We will denote V0α:=F01→α. Among those we distinguish the logical atomic terms:
prαβ∈F0α×β→α
iαβ∈F0α→α+β
Symbols we will not list explicitly, that correspond to the algebraic properties of + and × (commutativity, associativity, distributivity and the neutrality of 0 and 1). For example, given α,β∈T there is a “commutator” of type α×β→β×α.
∧α∈F0(α)×(α)→(α) [EDIT: Actually this doesn’t work because, except for finite sets, the resulting mapping (see semantics section) is discontinuous. There are probably ways to fix this.]
∃αβ∈F0(α×β)→(β)
∀αβ∈F0(α×β)→(β) [EDIT: Actually this doesn’t work because, except for finite sets, the resulting mapping (see semantics section) is discontinuous. There are probably ways to fix this.]
Assume that for each n∈N there is some Dn⊆□[n]: the set of “describable” ultracontributions [EDIT: it is probably sufficient to only have the fair coin distribution in D2 in order for it to be possible to approximate all ultracontributions on finite sets]. If μ∈Dn then ┌μ┐∈V(∑ni=11)
We recursively define the set of all terms Fα→β. We denote Vα:=F1→α.
If f∈F0α→β then f∈Fα→β
If f1∈Fα1→β1 and f2∈Fα2→β2 then f1×f2∈Fα1×α2→β1×β2
If f1∈Fα1→β1 and f2∈Fα2→β2 then f1+f2∈Fα1+α2→β1+β2
If f∈Fα→β then f−1:F(β)→(α)
If f∈Fα→β and g∈Fβ→γ then g∘f∈Fα→γ
Elements of V(α) are called formulae. Elements of V(1) are called sentences. A subset of V(1) is called a theory.
Semantics
Given T⊆V(1), a modelM of T is the following data. To each α∈T, there must correspond some compact Polish space M(t) s.t.:
M(0)=∅
M(1)=pt (the one point space)
M(α+β)=M(α)⊔M(β)
M(α×β)=M(α)×M(β)
M((α))=□M(α)
To each f∈Fα→β, there must correspond a continuous mapping M(f):M(α)→M(β), under the following constraints:
pr, i, diag and the “algebrators” have to correspond to the obvious mappings.
M(=α)=⊤diagM(α). Here, diagX⊆X×X is the diagonal and ⊤C∈□X is the sharp ultradistribution corresponding to the closed set C⊆X.
Consider α∈T and denote X:=M(α). Then, M(()α)=⊤□X⋉id□X. Here, we use the observation that the identity mapping id□X can be regarded as an infrakernel from □X to X.
M(⊥)=⊥pt
M(⊤)=⊤pt
S(M(∨)(μ,ν)) is the convex hull of S(μ)∪S(ν)
S(M(∧)(μ,ν)) is the intersection of S(μ)∪S(ν)
Consider α,β∈T and denote X:=M(α), Y:=M(β) and pr:X×Y→Y the projection mapping. Then, M(∃αβ)(μ)=pr∗μ.
Consider α,β∈T and denote X:=M(α), Y:=M(β) and pr:X×Y→Y the projection mapping. Then, p:M(∀αβ)(μ) iff for all q∈Δc(X×Y), if pr∗q=p then q:μ.
M(f1×f2)=M(f1)×M(f2)
M(f1+f2)=M(f1)⊔M(f2)
M(f−1)(μ)=M(f)∗(μ).
M(g∘f)=M(g)∘M(f)
M(┌μ┐)=μ
Finally, for each ϕ∈T, we require M(ϕ)=⊤pt.
Semantic Consequence
Given ϕ∈V(1), we say M⊨ϕ when M(ϕ)=⊤pt. We say T⊨ϕ when for any model M of T, M⊨ϕ. It is now interesting to ask what is the computational complexity of deciding T⊨ϕ. [EDIT: My current best guess is co-RE]
Applications
As usual, let A be a finite set of actions and O be a finite set of observation. Require that for each o∈O there is σo∈Tι which we interpret as the type of states producing observation o. Denote σ∗:=∑o∈Oσo (the type of all states). Moreover, require that our language has the nonlogical symbols s0∈V0(σ∗) (the initial state) and, for each a∈A, Ka∈F0σ∗→(σ∗) (the transition kernel). Then, every model defines a (pseudocausal) infra-POMDP. This way we can use symbolic expressions to define infra-Bayesian RL hypotheses. It is then tempting to study the control theoretic and learning theoretic properties of those hypotheses. Moreover, it is natural to introduce a prior which weights those hypotheses by length, analogical to the Solomonoff prior. This leads to some sort of bounded infra-Bayesian algorithmic information theory and bounded infra-Bayesian analogue of AIXI.
Let’s also explicitly describe 0th order and 1st order infra-Bayesian logic (although they are should be segments of higher-order).
0-th order
Syntax
Let A be the set of propositional variables. We define the language L:
Any a∈A is also in L
⊥∈L
⊤∈L
Given ϕ,ψ∈L, ϕ∧ψ∈L
Given ϕ,ψ∈L, ϕ∨ψ∈L
Notice there’s no negation or implication. We define the set of judgements J:=L×L. We write judgements as ϕ⊢ψ (”ψ in the context of ϕ”). A theory is a subset of J.
Semantics
Given T⊆J, a model of T consists of a compact Polish space X and a mapping M:L→□X. The latter is required to satisfy:
M(⊥)=⊥X
M(⊤)=⊤X
M(ϕ∧ψ)=M(ϕ)∧M(ψ). Here, we define ∧ of infradistributions as intersection of the corresponding sets
M(ϕ∨ψ)=M(ϕ)∨M(ψ). Here, we define ∨ of infradistributions as convex hull of the corresponding sets
For any ϕ⊢ψ∈T, M(ϕ)⪯M(ψ)
1-st order
Syntax
We define the language using the usual syntax of 1-st order logic, where the allowed operators are ∧, ∨ and the quantifiers ∀ and ∃. Variables are labeled by types from some set T. For simplicity, we assume no constants, but it is easy to introduce them. For any sequence of variables (v1…vn), we denote Lv the set of formulae whose free variables are a subset of v1…vn. We define the set of judgements J:=⋃vLv×Lv.
Semantics
Given T⊆J, a model of T consists of
For every t∈T, a compact Polish space M(t)
For every ϕ∈Lv where v1…vn have types t1…tn, an element Mv(ϕ) of □Xv, where Xv:=(∏ni=1M(ti))
It must satisfy the following:
Mv(⊥)=⊥Xv
Mv(⊤)=⊤Xv
Mv(ϕ∧ψ)=Mv(ϕ)∧Mv(ψ)
Mv(ϕ∨ψ)=Mv(ϕ)∨Mv(ψ)
Consider variables u1…un of types t1…tn and variables v1…vm of types s1…sm. Consider also some σ:{1…m}→{1…n} s.t. si=tσi. Given ϕ∈Lv, we can form the substitution ψ:=ϕ[vi=uσ(i)]∈Lu. We also have a mapping fσ:Xu→Xv given by fσ(x1…xn)=(xσ(1)…xσ(m)). We require Mu(ψ)=f∗(Mv(ϕ))
Consider variables v1…vn and i∈{1…n}. Denote pr:Xv→Xv∖vi the projection mapping. We require Mv∖vi(∃vi:ϕ)=pr∗(Mv(ϕ))
Consider variables v1…vn and i∈{1…n}. Denote pr:Xv→Xv∖vi the projection mapping. We require that p:Mv∖vi(∀vi:ϕ) if an only if, for all q∈ΔXv s.t pr∗q=p, q:pr∗(Mv(ϕ))
There is a special type of crisp infradistributions that I call “affine infradistributions”: those that, represented as sets, are closed not only under convex linear combinations but also under affine linear combinations. In other words, they are intersections between the space of distributions and some closed affine subspace of the space of signed measures. Conjecture: in 0-th order logic of affine infradistributions, consistency is polynomial-time decidable (whereas for classical logic it is ofc NP-hard).
To produce some evidence for the conjecture, let’s consider a slightly different problem. Specifically, introduce a new semantics in which □X is replaced by the set of linear subspaces of some finite dimensional vector space V. A model M is required to satisfy:
M(⊥)=0
M(⊤)=V
M(ϕ∧ψ)=M(ϕ)∩M(ψ)
M(ϕ∨ψ)=M(ϕ)+M(ψ)
For any ϕ⊢ψ∈T, M(ϕ)⊆M(ψ)
If you wish, this is “non-unitary quantum logic”. In this setting, I have a candidate polynomial-time algorithm for deciding consistency. First, we transform T into an equivalent theory s.t. all judgments are of the following forms:
a=⊥
a=⊤
a⊢b
Pairs of the form c=a∧b, d=a∨b.
Here, a,b,c,d∈A are propositional variables and “ϕ=ψ” is a shorthand for the pair of judgments ϕ⊢ψ and ψ⊢ϕ.
Second, we make sure that our T also satisfies the following “closure” properties:
If a⊢b and b⊢c are in T then so is a⊢c
If c=a∧b is in T then so are c⊢a and c⊢b
If c=a∨b is in T then so are a⊢c and b⊢c
If c=a∧b, d⊢a and d⊢b are in T then so is d⊢c
If c=a∨b, a⊢d and b⊢d are in T then so is c⊢d
Third, we assign to each a∈A a real-valued variable xa. Then we construct a linear program for these variables consisting of the following inequalities:
For any a∈A: 0≤xa≤1
For any a⊢b in T: xa≤xb
For any pair c=a∧b and d=a∨b in T: xc+xd=xa+xb
For any a=⊥: xa=0
For any a=⊤: xa=1
Conjecture: the theory is consistent if and only if the linear program has a solution. To see why it might be so, notice that for any model M we can construct a solution by setting
xa:=dimM(a)dimM(⊤)
I don’t have a full proof for the converse but here are some arguments. If a solution exists, then it can be chosen to be rational. We can then rescale it to get integers which are candidate dimensions of our subspaces. Consider the space of all ways to choose subspaces of these dimensions s.t. the constraints coming from judgments of the form a⊢b are satisfied. This is a moduli space of poset representations. It is easy to see it’s non-empty (just let the subspaces be spans of vectors taken from a fixed basis). By Proposition A.2 in Futorny and Iusenko it is an irreducible algebraic variety. Therefore, to show that we can also satisfy the remaining constraints, it is enough to check that (i) the remaining constraints are open (ii) each of the remaining constraints (considered separately) holds at some point of the variety. The first is highly likely and the second is at least plausible.
The algorithm also seems to have a natural extension to the original infra-Bayesian setting.
When using infra-Bayesian logic to define a simplicity prior, it is natural to use “axiom circuits” rather than plain formulae. That is, when we write the axioms defining our hypothesis, we are allowed to introduce “shorthand” symbols for repeating terms. This doesn’t affect the expressiveness, but it does affect the description length. Indeed, eliminating all the shorthand symbols can increase the length exponentially.
Instead of introducing all the “algebrator” logical symbols, we can define T as the quotient by the equivalence relation defined by the algebraic laws. We then need only two extra logical atomic terms:
For any n∈N and σ∈Sn (permutation), denote n:=∑ni=11 and require σ+∈Fn→n
For any n∈N and σ∈Sn, σ×α∈Fαn→αn
However, if we do this then it’s not clear whether deciding that an expression is a well-formed term can be done in polynomial time. Because, to check that the types match, we need to test the identity of algebraic expressions and opening all parentheses might result in something exponentially long.
Actually the Schwartz–Zippel algorithm can easily be adapted to this case (just imagine that types are variables over Q, and start from testing the identity of the types appearing inside parentheses), so we can validate expressions in randomized polynomial time (and, given standard conjectures, in deterministic polynomial time as well).
Sort of obvious but good to keep in mind: Metacognitive regret bounds are not easily reducible to “plain” IBRL regret bounds when we consider the core and the envelope as the “inside” of the agent.
Assume that the action and observation sets factor as A=A0×A1 and O=O0×O1, where (A0,O0) is the interface with the external environment and (A1,O1) is the interface with the envelope.
Let Λ:Π→□(Γ×(A×O)ω) be a metalaw. Then, there are two natural ways to reduce it to an ordinary law:
Marginalizing over Γ. That is, let pr−Γ:Γ×(A×O)ω→(A×O)ω and pr0:(A×O)ω→(A0×O0)ω be the projections. Then, we have the law Λ?:=(pr0pr−Γ)∗∘Λ.
Assuming “logical omniscience”. That is, let τ∗∈Γ be the ground truth. Then, we have the law Λ!:=pr0∗(Λ∣τ∗). Here, we use the conditional defined by Θ∣A:={θ∣A∣θ∈argmaxΘPr[A]}. It’s easy to see this indeed defines a law.
However, requiring low regret w.r.t. neither of these is equivalent to low regret w.r.t Λ:
Learning Λ? is typically no less feasible than learning Λ, however it is a much weaker condition. This is because the metacognitive agents can use policies that query the envelope to get higher guaranteed expected utility.
Learning Λ! is a much stronger condition than learning Λ, however it is typically infeasible. Requiring it leads to AIXI-like agents.
Therefore, metacognitive regret bounds hit a “sweep spot” of stength vs. feasibility which produces a genuinely more powerful agents than IBRL[1].
More precisely, more powerful than IBRL with the usual sort of hypothesis classes (e.g. nicely structured crisp infra-RDP). In principle, we can reduce metacognitive regret bounds to IBRL regret bounds using non-crsip laws, since there’s a very general theorem for representing desiderata as laws. But, these laws would have a very peculiar form that seems impossible to guess without starting with metacognitive agents.
Intuitively, it feels that there is something special about mathematical knowledge from a learning-theoretic perspective. Mathematics seems infinitely rich: no matter how much we learn, there is always more interesting structure to be discovered. Impossibility results like the halting problem and Godel incompleteness lend some credence to this intuition, but are insufficient to fully formalize it.
Here is my proposal for how to formulate a theorem that would make this idea rigorous.
(Wrong) First Attempt
Fix some natural hypothesis class for mathematical knowledge, such as some variety of tree automata. Each such hypothesis Θ represents an infradistribution over Γ: the “space of counterpossible computational universes”. We can say that Θ is a “true hypothesis” when there is some θ in the credal set Θ (a distribution over Γ) s.t. the ground truth Υ∗∈Γ “looks” as if it’s sampled from θ. The latter should be formalizable via something like a computationally bounded version of Marin-Lof randomness.
We can now try to say that Υ∗ is “rich” if for any true hypothesis Θ, there is a refinement Ξ⊆Θ which is also a true hypothesis and “knows” at least one bit of information that Θ doesn’t, in some sense. This is clearly true, since there can be no automaton or even any computable hypothesis which fully describes Υ∗. But, it’s also completely boring: the required Ξ can be constructed by “hardcoding” an additional fact into Θ. This doesn’t look like “discovering interesting structure”, but rather just like brute-force memorization.
(Wrong) Second Attempt
What if instead we require that Ξ knows infinitely many bits of information that Θ doesn’t? This is already more interesting. Imagine that instead of metacognition / mathematics, we would be talking about ordinary sequence prediction. In this case it is indeed an interesting non-trivial condition that the sequence contains infinitely many regularities, s.t. each of them can be expressed by a finite automaton but their conjunction cannot. For example, maybe the n-th bit in the sequence depends only the largest k s.t.2k divides n, but the dependence on k is already uncomputable (or at least inexpressible by a finite automaton).
However, for our original application, this is entirely insufficient. This is because in the formal language we use to define Γ (e.g. combinator calculus) has some “easy” equivalence relations. For example, consider the family of programs of the form “if 2+2=4 then output 0, otherwise...”. All of those programs would output 0, which is obvious once you know that 2+2=4. Therefore, once your automaton is able to check some such easy equivalence relations, hardcoding a single new fact (in the example, 2+2=4) generates infinitely many “new” bits of information. Once again, we are left with brute-force memorization.
(Less Wrong) Third Attempt
Here’s the improved condition: For any true hypothesis Θ, there is a true refinement Ξ⊆Θ s.t. conditioning Θon any finite set of observations cannot produce a refinement ofΞ.
There is a technicality here, because we’re talking about infradistributions, so what is “conditioning” exactly? For credal sets, I think it is sufficient to allow two types of “conditioning”:
For any given observation A and p∈(0,1], we can form {θ∈Θ∣θ(A)≥p}.
For any given observation A s.t. minθ∈Θθ(A)>0, we can form {(θ∣A)∣θ∈Θ}.
This rules-out the counterexample from before: the easy equivalence relation can be represented inside Θ, and then the entire sequence of “novel” bits can be generated by a conditioning.
Alright, so does Υ∗ actually satisfy this condition? I think it’s very probable, but I haven’t proved it yet.
Here is the sketch of a simplified model for how a metacognitive agent deals with traps.
Consider some (unlearnable) prior ζ over environments, s.t. we can efficiently compute the distribution ζ(h) over observations given any history h. For example, any prior over a small set of MDP hypotheses would qualify. Now, for each h, we regard ζ(h) as a “program” that the agent can execute and form beliefs about. In particular, we have a “metaprior” ξ consisting of metahypotheses: hypotheses-about-programs.
For example, if we let every metahypothesis be a small infra-RDP satisfying appropriate assumptions, we probably have an efficient “metalearning” algorithm. More generally, we can allow a metahypothesis to be a learnable mixture of infra-RDPs: for instance, there is a finite state machine for specifying “safe” actions, and the infra-RDPs in the mixture guarantee no long-term loss upon taking safe actions.
In this setting, there are two levels of learning algorithms:
The metalearning algorithm, which learns the correct infra-RDP mixture. The flavor of this algorithm is RL in a setting where we have a simulator of the environment (since we can evaluate ζ(h) for any h). In particular, here we don’t worry about exploitation/exploration tradeoffs.
The “metacontrol” algorithm, which given an infra-RDP mixture, approximates the optimal policy. The flavor of this algorithm is “standard” RL with exploitation/exploration tradeoffs.
In the simplest toy model, we can imagine that metalearning happens entirely in advance of actual interaction with the environment. More realistically, the two needs to happen in parallel. It is then natural to apply metalearning to the current environmental posterior rather than the prior (i.e. the histories starting from the history that already occurred). Such an agent satisfies “opportunistic” guarantees: if at any point of time, the posterior admits a useful metahypothesis, the agent can exploit this metahypothesis. Thus, we address both parts of the problem of traps:
The complexity-theoretic part (subproblem 1.2) is addressed by approximating the intractable Bayes-optimality problem by the metacontrol problem of the (coarser) metahypothesis.
The statistical part (subproblem 2.1) is addressed by opportunism: if at some point, we can easily learn something about the physical environment, then we do.
Jobst Heitzig asked me whether infra-Bayesianism has something to say about the absent-minded driver (AMD) problem. Good question! Here is what I wrote in response:
Philosophically, I believe that it is only meaningful to talk about a decision problem when there is also some mechanism for learning the rules of the decision problem. In ordinary Newcombian problems, you can achieve this by e.g. making the problem iterated. In AMD, iteration doesn’t really help because the driver doesn’t remember anything that happened before. We can consider a version of iterated AMD where the driver has a probability 0<ϵ≪1 to remember every intersection, but they always remember whether they arrived at the right destination. Then, it is equivalent to the following Newcombian problem:
With probability 1−2ϵ, counterfactual A happens, in which Omega decides about both intersections via simulating the driver in counterfactuals B and C.
With probability ϵ, counterfactual B happens, in which the driver decides about the first intersection, and Omega decides about the second intersection via simulating the driver in counterfactual C.
With probability ϵ, counterfactual C happens, in which the driver decides about the second intersection, and Omega decides about the first intersection via simulating the driver in counterfactual B.
For this, an IB agent indeed learns the updateless optimal policy (although the learning rate carries an ϵ−1 penalty).
The following was written by me during the “Finding the Right Abstractions for healthy systems” research workshop, hosted by Topos Institute in January 2023. However, I invented the idea before.
Here’s an elegant diagrammatic notation for constructing new infrakernels out of given infrakernels. There is probably some natural category-theoretic way to think about it, but at present I don’t know what it is.
By “infrakernel” we will mean a continuous mapping of the form X→□Y, where X and Y are compact Polish spaces and □Y is the space of credal sets (i.e. closed convex sets of probability distributions) over Y.
Syntax
The diagram consists of child vertices, parent vertices, squiggly lines, arrows, dashed arrows and slashes.
There can be solid arrows incoming into the diagram. Each such arrow a is labeled by a compact Polish space D(a) and ends on a parent vertex t(a). And, s(a)=⊥ (i.e. the arrow has no source vertex).
There can be dashed and solid arrows between vertices. Each such arrow a starts from a child vertex s(a) and ends on a parent vertex t(a). We require that P(s(a))≠t(a) (i.e. they should not be also connected by a squiggly line).
There are two types of vertices: parent vertices (denoted by a letter) and child vertices (denoted by a letter or number in a circle).
Each child vertex v is labeled by a compact Polish space D(v) and connected (by a squiggly line) to a unique parent vertex P(v). It may or may not be crossed-out by a slash.
Each parent vertex p is labeled by an infrakernel Kp with source S1×…×Sk and target T1×…×Tl where each Si is corresponds to a solid arrow a with t(a)=p and each Tj is D(v) for some child vertex v with P(v)=p. We can also add squares with numbers where solid arrows end to keep track of the correspondence between the arguments of Kp and the arrows.
If s(a)=⊥ then the corresponding Si is D(a).
If s(a)=v≠⊥ then the corresponding Si is D(v).
Semantics
Every diagram D represents an infrakernel KD.
The source space of KD is a product X1×…×Xn, where each Xi is D(a) for some solid arrow a with s(a)=⊥.
The target space of KD is a product Y1×…×Ym, where each Yj is D(v) for some non-crossed-out child vertex.
The value of the KD at a given point x is defined as follows. Let ~Y:=∏vD(v) (a product that includes the cross-out vertices). Then, KD(x) is the set of all the marginal distributions of distributions μ∈Δ~Y satisfying the following condition. Consider any parent vertex p. Let a1,a2…ak be the (dashed or solid) arrows s.t.s(ai)≠⊥ and t(ai)=p. For each i s.t., choose any yi∈D(s(ai)). We require that Kp(x,y) contains the marginal distribution of μ∣y. Here, the notation Kp(x,y) means we are using the components of x and y corresponding to solid arrows a with t(a)=p.
Two deterministic toy models for regret bounds of infra-Bayesian bandits. The lesson seems to be that equalities are much easier to learn than inequalities.
Model 1: Let A be the space of arms, O the space of outcomes, r:A×O→R the reward function, X and Y vector spaces, H⊆X the hypothesis space and F:A×O×H→Y a function s.t. for any fixed a∈A and o∈O, F(a,o):H→Y extends to some linear operator Ta,o:X→Y. The semantics of hypothesis h∈H is defined by the equation F(a,o,h)=0 (i.e. an outcome o of action a is consistent with hypothesis h iff this equation holds).
For any h∈H denote by V(h) the reward promised by h:
V(h):=maxa∈Amino∈O:F(a,o,h)=0r(a,o)
Then, there is an algorithm with mistake bound dimX, as follows. On round n∈N, let Gn⊆H be the set of unfalsified hypotheses. Choose hn∈S optimistically, i.e.
hn:=argmaxh∈GnV(h)
Choose the arm an recommended by hypothesis hn. Let on∈O be the outcome we observed, rn:=r(an,on) the reward we received and h∗∈H the (unknown) true hypothesis.
If rn≥V(hn) then also rn≥V(h∗) (since h∗∈Gn and hence V(h∗)≤V(hn)) and therefore an wasn’t a mistake.
If rn<V(hn) then F(an,on,hn)≠0 (if we had F(an,on,hn)=0 then the minimization in the definition of V(hn) would include r(an,on)). Hence, hn∉Gn+1=Gn∩kerTan,on. This implies dimspan(Gn+1)<dimspan(Gn). Obviously this can happen at most dimX times.
Model 2: Let the spaces of arms and hypotheses be
A:=H:=Sd:={x∈Rd+1∣∥x∥=1}
Let the reward r∈R be the only observable outcome, and the semantics of hypothesis h∈Sd be r≥h⋅a. Then, the sample complexity cannot be bound by a polynomial of degree that doesn’t depend on d. This is because Murphy can choose the strategy of producing reward 1−ϵ whenever h⋅a≤1−ϵ. In this case, whatever arm you sample, in each round you can only exclude ball of radius ≈√2ϵ around the sampled arm. The number of such balls that fit into the unit sphere is Ω(ϵ−12d). So, normalized regret below ϵ cannot be guaranteed in less than that many rounds.
For t=1 we get the usual maximin (“pessimism”), for t=0 we get maximax (“optimism”) and for other values of t we get something in the middle (we can call “t-mism”).
It turns out that, in some sense, this new decision rule is actually reducible to ordinary maximin! Indeed, set
μ∗t:=argmaxμEμ[U(a∗t)]
Θt:=tΘ+(1−t)μ∗t
Then we get
a∗(Θt)=a∗t(Θ)
More precisely, any pessimistically optimal action for Θt is t-mistically optimal for Θ (the converse need not be true in general, thanks to the arbitrary choice involved in μ∗t).
To first approximation it means we don’t need to consider t-mistic agents since they are just special cases of “pessimistic” agents. To second approximation, we need to look at what the transformation of Θ to Θt does to the prior. If we start with a simplicity prior then the result is still a simplicity prior. If U has low description complexity and t is not too small then essentially we get full equivalence between “pessimism” and t-mism. If tis small then we get a strictly “narrower” prior (for t=0 we are back at ordinary Bayesianism). However, if U has high description complexity then we get a rather biased simplicity prior. Maybe the latter sort of prior is worth considering.
Master post for ideas about infra-Bayesianism.
In the anthropic trilemma, Yudkowsky writes about the thorny problem of understanding subjective probability in a setting where copying and modifying minds is possible. Here, I will argue that infra-Bayesianism (IB) leads to the solution.
Consider a population of robots, each of which in a regular RL agent. The environment produces the observations of the robots, but can also make copies or delete portions of their memories. If we consider a random robot sampled from the population, the history they observed will be biased compared to the “physical” baseline. Indeed, suppose that a particular observation c has the property that every time a robot makes it, 10 copies of them are created in the next moment. Then, a random robot will have c much more often in their history than the physical frequency with which c is encountered, due to the resulting “selection bias”. We call this setting “anthropic RL” (ARL).
The original motivation for IB was non-realizability. But, in ARL, Bayesianism runs into issues even when the environment is realizable from the “physical” perspective. For example, we can consider an “anthropic MDP” (AMDP). An AMDP has finite sets of actions (A) and states (S), and a transition kernel T:A×S→Δ(S∗). The output is a string of states instead of a single state, because many copies of the agent might be instantiated on the next round, each with their own state. In general, there will be no single Bayesian hypothesis that captures the distribution over histories that the average robot sees at any given moment of time (at any given moment of time we sample a robot out of the population and look at their history). This is because the distributions at different moments of time are mutually inconsistent.
[EDIT: Actually, given that we don’t care about the order of robots, the signature of the transition kernel should be T:A×S→ΔNS]
The consistency that is violated is exactly the causality property of environments. Luckily, we know how to deal with acausality: using the IB causal-acausal correspondence! The result can be described as follows: Murphy chooses a time moment n∈N and guesses the robot policy π until time n. Then, a simulation of the dynamics of (π,T) is performed until time n, and a single history is sampled from the resulting population. Finally, the observations of the chosen history unfold in reality. If the agent chooses an action different from what is prescribed, Nirvana results. Nirvana also happens after time n (we assume Nirvana reward 1 rather than ∞).
This IB hypothesis is consistent with what the average robot sees at any given moment of time. Therefore, the average robot will learn this hypothesis (assuming learnability). This means that for n≫11−γ≫0, the population of robots at time n has expected average utility with a lower bound close to the optimum for this hypothesis. I think that for an AMDP this should equal the optimum expected average utility you can possibly get, but it would be interesting to verify.
Curiously, the same conclusions should hold if we do a weighted average over the population, with any fixed method of weighting. Therefore, the posterior of the average robot behaves adaptively depending on which sense of “average” you use. So, your epistemology doesn’t have to fix a particular method of counting minds. Instead different counting methods are just different “frames of reference” through which to look, and you can be simultaneously rational in all of them.
Could you expand a little on why you say that no Bayesian hypothesis captures the distribution over robot-histories at different times? It seems like you can unroll an AMDP into a “memory MDP” that puts memory information of the robot into the state, thus allowing Bayesian calculation of the distribution over states in the memory MDP to capture history information in the AMDP.
I’m not sure what do you mean by that “unrolling”. Can you write a mathematical definition?
Let’s consider a simple example. There are two states: s0 and s1. There is just one action so we can ignore it.s0 is the initial state. An s0 robot transition into an s1 robot. An s1 robot transitions into an s0 robot and an s1 robot. How will our population look like?
0th step: all robots remember s0
1st step: all robots remember s0s1
2nd step: 1⁄2 of robots remember s0s1s0 and 1⁄2 of robots remember s0s1s1
3rd step: 1⁄3 of robots remembers s0s1s0s1, 1⁄3 of robots remember s0s1s1s0 and 1⁄3 of robots remember s0s1s1s1
There is no Bayesian hypothesis a robot can have that gives correct predictions both for step 2 and step 3. Indeed, to be consistent with step 2 we must have Pr[s0s1s0]=12 and Pr[s0s1s1]=12. But, to be consistent with step 3 we must have Pr[s0s1s0]=13, Pr[s0s1s1]=23.
In other words, there is no Bayesian hypothesis s.t. we can guarantee that a randomly sampled robot on a sufficiently late time step will have learned this hypothesis with high probability. The apparent transition probabilities keep shifting s.t. it might always continue to seem that the world is complicated enough to prevent our robot from having learned it already.
Or, at least it’s not obvious there is such a hypothesis. In this example, Pr[s0s1s1]Pr[s0s1s0] will converge to the golden ratio at late steps. But, do all probabilities converge fast enough for learning to happen, in general? I don’t know, maybe for finite state spaces it can work. Would definitely be interesting to check.
[EDIT: actually, in this example there is such a hypothesis but in general there isn’t, see below]
Great example. At least for the purposes of explaining what I mean :) The memory AMDP would just replace the states s0, s1 with the memory states [s0], [s1], [s0,s0], [s0,s1], etc. The action takes a robot in [s0] to memory state [s0,s1], and a robot in [s0,s1] to one robot in [s0,s1,s0] and another in [s0,s1,s1].
(Skip this paragraph unless the specifics of what’s going on aren’t obvious: given a transition distribution P(s′∗|s,π) (P being the distribution over sets of states s’* given starting state s and policy π), we can define the memory transition distribution P(s′∗m|sm,π) given policy π and starting “memory state” sm∈S∗ (Note that this star actually does mean finite sequences, sorry for notational ugliness). First we plug the last element of sm into the transition distribution as the current state. Then for each s′∗ in the domain, for each element in s′∗ we concatenate that element onto the end of sm and collect these s′m into a set s′∗m, which is assigned the same probability P(s′∗).)
So now at time t=2, if you sample a robot, the probability that its state begins with [s0,s1,s1] is 0.5. And at time t=3, if you sample a robot that probability changes to 0.66. This is the same result as for the regular MDP, it’s just that we’ve turned a question about the history of agents, which may be ill-defined, into a question about which states agents are in.
I’m still confused about what you mean by “Bayesian hypothesis” though. Do you mean a hypothesis that takes the form of a non-anthropic MDP?
I’m not quite sure what are you trying to say here, probably my explanation of the framework was lacking. The robots already remember the history, like in classical RL. The question about the histories is perfectly well-defined. In other words, we are already implicitly doing what you described. It’s like in classical RL theory, when you’re proving a regret bound or whatever, your probability space consists of histories.
Yes, or a classical RL environment. Ofc if we allow infinite state spaces, then any environment can be regarded as an MDP (whose states are histories). That is, I’m talking about hypotheses which conform to the classical “cybernetic agent model”. If you wish, we can call it “Bayesian cybernetic hypothesis”.
Also, I want to clarify something I was myself confused about in the previous comment. For an anthropic Markov chain (when there is only one action) with a finite number of states, we can give a Bayesian cybernetic description, but for a general anthropic MDP we cannot even if the number of states is finite.
Indeed, consider some T:S→ΔNS. We can take its expected value to get ET:S→RS+. Assuming the chain is communicating, ET is an irreducible non-negative matrix, so by the Perron-Frobenius theorem it has a unique-up-to-scalar maximal eigenvector η∈RS+. We then get the subjective transition kernel:
ST(t∣s)=ET(t∣s)ηt∑t′∈SET(t′∣s)ηt′
Now, consider the following example of an AMDP. There are three actions A:={a,b,c} and two states S:={s0,s1}. When we apply a to an s0 robot, it creates two s0 robots, whereas when we apply a to an s1 robot, it leaves one s1 robot. When we apply b to an s1 robot, it creates two s1 robots, whereas when we apply b to an s0 robot, it leaves one s0 robot. When we apply c to any robot, it results in one robot whose state is s0 with probability 12 and s1 with probability 12.
Consider the following two policies.πa takes the sequence of actions cacaca… and πb takes the sequence of actions cbcbcb…. A population that follows πa would experience the subjective probability ST(s0∣s0,c)=23, whereas a population that follows πb would experience the subjective probability ST(s0∣s0,c)=13. Hence, subjective probabilities depend on future actions. So, effectively anthropics produces an acausal (Newcomb-like) environment. And, we already know such environments are learnable by infra-Bayesian RL agents and, (most probably) not learnable by Bayesian RL agents.
Ah, okay, I see what you mean. Like how preferences are divisible into “selfish” and “worldly” components, where the selfish component is what’s impacted by a future simulation of you that is about to have good things happen to it.
(edit: The reward function in AMDPs can either be analogous to “wordly” and just sum the reward calculated at individual timesteps, or analogous to “selfish” and calculated by taking the limit of the subjective distribution over parts of the history, then applying a reward function to the expected histories.)
I brought up the histories->states thing because I didn’t understand what you were getting at, so I was concerned that something unrealistic was going on. For example, if you assume that the agent can remember its history, how can you possibly handle an environment with memory-wiping?
In fact, to me the example is still somewhat murky, because you’re talking about the subjective probability of a state given a policy and a timestep, but if the agents know their histories there is no actual agent in the information-state that corresponds to having those probabilities. In an MDP the agents just have probabilities over transitions—so maybe a clearer example is an agent that copies itself if it wins the lottery having a larger subjective transition probability of going from gambling to winning. (i.e. states are losing and winning, actions are gamble and copy, the policy is to gamble until you win and then copy).
AMDP is only a toy model that distills the core difficulty into more or less the simplest non-trivial framework. The rewards are “selfish”: there is a reward function r:(S×A)∗→R which allows assigning utilities to histories by time discounted summation, and we consider the expected utility of a random robot sampled from a late population. And, there is no memory wiping. To describe memory wiping we indeed need to do the “unrolling” you suggested. (Notice that from the cybernetic model POV, the history is only the remembered history.)
For a more complete framework, we can use an ontology chain, but (i) instead of A×O labels use A×M labels, where M is the set of possible memory states (a policy is then described by π:M→A), to allow for agents that don’t fully trust their memory (ii) consider another chain with a bigger state space S′ plus a mapping p:S′→NS s.t. the transition kernels are compatible. Here, the semantics of p(s) is: the multiset of ontological states resulting from interpreting the physical state s by taking the viewpoints of different agents s contains.
I didn’t understand “no actual agent in the information-state that corresponds to having those probabilities”. What does it mean to have an agent in the information-state?
Nevermind, I think I was just looking at it with the wrong class of reward function in mind.
Is it possible to replace the maximin decision rule in infra-Bayesianism with a different decision rule? One surprisingly strong desideratum for such decision rules is the learnability of some natural hypothesis classes.
In the following, all infradistributions are crisp.
Fix finite action set A and finite observation set O. For any k∈N and γ∈(0,1), let
Mkγ:(A×O)ω→Δ(A×O)kbe defined by
Mkγ(h|d):=(1−γ)∞∑n=0γn[[h=dn:n+k]]In other words, this kernel samples a time step n out of the geometric distribution with parameter γ, and then produces the sequence of length k that appears in the destiny starting at n.
For any continuous[1] function D:□(A×O)k→R, we get a decision rule. Namely, this rule says that, given infra-Bayesian law Λ and discount parameter γ, the optimal policy is
π∗DΛ:=argmaxπ:O∗→AD(Mkγ∗Λ(π))The usual maximin is recovered when we have some reward function r:(A×O)k→R and corresponding to it is
Dr(Θ):=minθ∈ΘEθ[r]Given a set H of laws, it is said to be learnable w.r.t.D when there is a family of policies {πγ}γ∈(0,1) such that for any Λ∈H
limγ→1(maxπD(Mkγ∗Λ(π))−D(Mkγ∗Λ(πγ))=0For Dr we know that e.g. the set of all communicating[2] finite infra-RDPs is learnable. More generally, for any t∈[0,1] we have the learnable decision rule
Dtr:=tmaxθ∈ΘEθ[r]+(1−t)minθ∈ΘEθ[r]This is the “mesomism” I taked about before.
Also, any monotonically increasing D seems to be learnable, i.e. any D s.t. for Θ1⊆Θ2 we have D(Θ1)≤D(Θ2). For such decision rules, you can essentially assume that “nature” (i.e. whatever resolves the ambiguity of the infradistributions) is collaborative with the agent. These rules are not very interesting.
On the other hand, decision rules of the form Dr1+Dr2 are not learnable in general, and so are decision rules of the form Dr+D′ for D′ monotonically increasing.
Open Problem: Are there any learnable decision rules that are not mesomism or monotonically increasing?
A positive answer to the above would provide interesting generalizations of infra-Bayesianism. A negative answer to the above would provide an interesting novel justification of the maximin. Indeed, learnability is not a criterion that was ever used in axiomatic constructions of decision theory[3], AFAIK.
We can try considering discontinuous functions as well, but it seems natural to start with continuous. If we want the optimal policy to exist, we usually need D to be at least upper semicontinuous.
There are weaker conditions than “communicating” that are sufficient, e.g. “resettable” (meaning that the agent can always force returning to the initial state), and some even weaker conditions that I will not spell out here.
I mean theorems like VNM, Savage etc.
There is a formal analogy between infra-Bayesian decision theory (IBDT) and modal updateless decision theory (MUDT).
Consider a one-shot decision theory setting. There is a set of unobservable states S, a set of actions A and a reward function r:A×S→[0,1]. An IBDT agent has some belief β∈□S[1], and it chooses the action a∗:=argmaxa∈AEβ[λs.r(a,s)].
We can construct an equivalent scenario, by augmenting this one with a perfect predictor of the agent (Omega). To do so, define S′:=A×S, where the semantics of (p,s) is “the unobservable state is s and Omega predicts the agent will take action p”. We then define r′:A×S′→[0,1] by r′(a,p,s):=1a=pr(a,s)+1a≠p and β′∈□S′ by Eβ′[f]:=minp∈AEβ[λs.f(p,s)] (β′ is what we call the pullback of β to S′, i.e we have utter Knightian uncertainty about Omega). This is essentially the usual Nirvana construction.
The new setup produces the same optimal action as before. However, we can now give an alternative description of the decision rule.
For any p∈A, define Ωp∈□S′ by EΩp[f]:=mins∈Sf(p,s). That is, Ωp is an infra-Bayesian representation of the belief “Omega will make prediction p”. For any u∈[0,1], define Ru∈□S′ by ERu[f]:=minμ∈ΔS′:Eμ[r(p,s)]≥uEμ[f(p,s)]. Ru can be interpreted as the belief “assuming Omega is accurate, the expected reward will be at least u”.
We will also need to use the order ⪯ on □X defined by: ϕ⪯ψ when ∀f∈[0,1]X:Eϕ[f]≥Eψ[f]. The reversal is needed to make the analogy to logic intuitive. Indeed, ϕ⪯ψ can be interpreted as ”ϕ implies ψ“[2], the meet operator ∧ can be interpreted as logical conjunction and the join operator ∨ can be interpreted as logical disjunction.
Claim:
a∗=argmaxa∈Amax{u∈[0,1]∣β′∧Ωa⪯Ru}
(Actually I only checked it when we restrict to crisp infradistributions, in which case ∧ is intersection of sets and ⪯ is set containment, but it’s probably true in general.)
Now, β′∧Ωa⪯Ru can be interpreted as “the conjunction of the belief β′ and Ωa implies Ru”. Roughly speaking, “according to β′, if the predicted action is a then the expected reward is at least u”. So, our decision rule says: choose the action that maximizes the value for which this logical implication holds (but “holds” is better thought of as “is provable”, since we’re talking about the agent’s belief). Which is exactly the decision rule of MUDT!
Apologies for the potential confusion between □ as “space of infradistrubutions” and the □ of modal logic (not used in this post).
Technically it’s better to think of it as ”ψ is true in the context of ϕ”, since it’s not another infradistribution so it’s not a genuine implication operator.
Ambidistributions
I believe that all or most of the claims here are true, but I haven’t written all the proofs in detail, so take it with a grain of salt.
Ambidistributions are a mathematical object that simultaneously generalizes infradistributions and ultradistributions. It is useful to represent how much power an agent has over a particular system: which degrees of freedom it can control, which degrees of freedom obey a known probability distribution and which are completely unpredictable.
Definition 1: Let X be a compact Polish space. A (crisp) ambidistribution on X is a function Q:C(X)→R s.t.
(Monotonocity) For any f,g∈C(X), if f≤g then Q(f)≤Q(g).
(Homogeneity) For any f∈C(X) and λ≥0, Q(λf)=λQ(f).
(Constant-additivity) For any f∈C(X) and c∈R, Q(f+c)=Q(f)+c.
Conditions 1+3 imply that Q is 1-Lipschitz. We could introduce non-crisp ambidistributions by dropping conditions 2 and/or 3 (and e.g. requiring 1-Lipschitz instead), but we will stick to crisp ambidistributions in this post.
The space of all ambidistributions on X will be denoted ♡X.[1] Obviously, □X⊆♡X (where □X stands for (crisp) infradistributions), and likewise for ultradistributions.
Examples
Example 1: Consider compact Polish spaces X,Y,Z and a continuous mapping F:X×Y→Z. We can then define F♡∈♡Z by
F♡(u):=maxθ∈ΔXminη∈ΔYEθ×η[u∘F]That is, F♡(u) is the value of the zero-sum two-player game with strategy spaces X and Y and utility function u∘F.
Notice that F in Example 1 can be regarded as a Cartesian frame: this seems like a natural connection to explore further.
Example 2: Let A and O be finite sets representing actions and observations respectively, and Λ:{O∗→A}→□(A×O)∗ be an infra-Bayesian law. Then, we can define Λ♡∈♡(A×O)∗ by
Λ♡(u):=maxπ:O∗→AEΛ(π)[u]In fact, this is a faithful representation: Λ can be recovered from Λ♡.
Example 3: Consider an infra-MDP with finite state set S, initial state s0∈S and transition infrakernel T:S×A→□S. We can then define the “ambikernel” T♡:S→♡S by
T♡(s;u):=maxa∈AET(s,a)[u]Thus, every infra-MDP induces an “ambichain”. Moreover:
Claim 1: ♡ is a monad. In particular, ambikernels can be composed.
This allows us defining
ϕ(γ):=(1−γ)∞∑n=0γn(T♡)n(s0)This object is the infra-Bayesian analogue of the convex polytope of accessible state occupancy measures in an MDP.
Claim 2: The following limit always exists:
ϕ∗:=limγ→1ϕ(γ)Legendre-Fenchel Duality
Definition 3: Let D be a convex space and A1,A2…An,B⊆D. We say that B occludes (A1…An) when for any (a1…an)∈A1×…×An, we have
CH(a1…an)∩B≠∅Here, CH stands for convex hull.
We denote this relation A1…An⊢B. The reason we call this “occlusion” is apparent for the n=2 case.
Here are some properties of occlusion:
For any 1≤i≤n, A1…An⊢Ai.
More generally, if c∈Δ{1…n} then A1…An⊢∑iciAi.
If Φ⊢A and Φ⊆Ψ then Ψ⊢A.
If Φ⊢A and A⊆B then Φ⊢B.
If A1…An⊢B and A′i⊆Ai for all 1≤i≤n, then A′1…A′n⊢B.
If Φ⊢Ai for all 1≤i≤n, and also A1…An⊢B, then Φ⊢B.
Notice that occlusion has similar algebraic properties to logical entailment, if we think of A⊆B as ”B is a weaker proposition than A”.
Definition 4: Let X be a compact Polish space. A cramble set[2] over X is Φ⊆□X s.t.
Φ is non-empty.
Φ is topologically closed.
For any finite Φ0⊆Φ and Θ∈□X, if Φ0⊢Θ then Θ∈Φ. (Here, we interpret elements of □X as credal sets.)
Question: If instead of condition 3, we only consider binary occlusion (i.e. require |Φ0|≤2), do we get the same concept?
Given a cramble set Φ, its Legendre-Fenchel dual ambidistribution is
^Φ(f):=maxΘ∈ΦEΘ[f]Claim 3: Legendre-Fenchel duality is a bijection between cramble sets and ambidistributions.
Lattice Structure
Functionals
The space ♡X is equipped with the obvious partial order: Q≤P when for all f∈C(X), Q(f)≤P(f). This makes ♡X into a distributive lattice, with
(P∧Q)(f)=min(P(f),Q(f))(P∨Q)(f)=max(P(f),Q(f))This is in contrast to □X which is a non-distributive lattice.
The bottom and top elements are given by
⊥(f)=minx∈Xf(x)⊤(f)=maxx∈Xf(x)Ambidistributions are closed under pointwise suprema and infima, and hence ♡X is complete and satisfies both infinite distributive laws, making it a complete Heyting and co-Heyting algebra.
♡X is also a De Morgan algebra with the involution
¯Q(f):=−Q(−f)For X≠∅, ♡X is not a Boolean algebra: ΔX⊆♡X and for any θ∈ΔX we have ¯θ=θ.
One application of this partial order is formalizing the “no traps” condition for infra-MDP:
Definition 2: A finite infra-MDP is quasicommunicating when for any s∈S
limγ→1(1−γ)∞∑n=0γn(T♡)n(s0)≤limγ→1(1−γ)∞∑n=0γn(T♡)n(s)Claim 4: The set of quasicommunicating finite infra-MDP (or even infra-RDP) is learnable.
Cramble Sets
Going to the cramble set representation, ^Φ≤^Ψ iff Φ⊆Ψ.
Φ∧Ψ is just Φ∩Ψ, whereas Φ∨Ψ is the “occlusion hall” of Φ and Ψ.
The bottom and the top cramble sets are
⊥={⊤□}⊤=□XHere, ⊤□ is the top element of □X (corresponding to the credal set ΔX).
The De Morgan involution is
¯Φ={Θ∈□X∣∀Ξ∈Φ:Θ∩Ξ≠∅}Operations
Definition 5: Given X,Y compact Polish spaces and a continuous mapping h:X→Y, we define the pushforward h∗:♡X→♡Y by
h∗(Q;f):=Q(f∘h)When h is surjective, there are both a left adjoint and a right adjoint to h∗, yielding two pullback operators h∗min,h∗max:♡Y→♡X:
h∗min(Q;f):=ming∈C(Y):g∘h≥fQ(g)h∗max(Q;f):=maxg∈C(Y):g∘h≤fQ(g)Given Q∈♡X and P∈♡Y we can define the semidirect product Q⋉P∈♡(X×Y) by
(Q⋉P)(f):=Q(λx.P(λy.f(x,y)))There are probably more natural products, but I’ll stop here for now.
Polytopic Ambidistributions
Definition 6: The polytopic ambidistributions ♡polX are the (incomplete) sublattice of ♡X generated by ΔX.
Some conjectures about this:
For finite X, an ambidistributions Q is polytopic iff there is a finite polytope complex C on RX s.t. for any cell A of C, Q|C is affine.
For finite X, a cramble set Φ is polytopic iff it is the occlusion hall of a finite set of polytopes in ΔX.
ϕ(γ) and ϕ∗ from Example 3 are polytopic.
The non-convex shape ♡ reminds us that ambidistributions need not be convex or concave.
The expression “cramble set” is meant to suggest a combination of “credal set” with “ambi”.
Master post for ideas about infra-Bayesian physicalism.
Other relevant posts:
Incorrigibility in IBP
PreDCA alignment protocol
Here is a modification of the IBP framework which removes the monotonicity principle, and seems to be more natural in other ways as well.
First, let our notion of “hypothesis” be Θ∈□c(Γ×2Γ). The previous framework can be interpreted in terms of hypotheses of this form satisfying the condition
prΓ×2ΓBr(Θ)=Θ(See Proposition 2.8 in the original article.) In the new framework, we replace it by the weaker condition
Br(Θ)⊇(idΓ×diag2Γ)∗ΘThis can be roughly interpreted as requiring that (i) whenever the output of a program P determines whether some other program Q will run, program P has to run as well (ii) whenever programs P and Q are logically equivalent, program P runs iff program Q runs.
The new condition seems to be well-justified, and is also invariant under (i) mixing hypotheses (ii) taking joins/meets of hypotheses. The latter was not the case for the old condition. Moreover, it doesn’t imply that Θ is downward closed, and hence there is no longer a monotonicity principle[1].
The next question is, how do we construct hypotheses satisfying this condition? In the old framework, we could construct hypotheses of the form Ξ∈□c(Γ×Φ) and then apply the bridge transform. In particular, this allows a relatively straightforward translation of physics theories into IBP language (for example our treatment of quantum theory). Luckily, there is an analogous construction in the new framework as well.
First notice that our new condition on Θ can be reformulated as requiring that
suppΘ⊆elΓ
For any s:Γ→Γ define τs:ΔcelΓ→ΔcelΓ by τsθ:=χelΓ(s×id2Γ)∗. Then, we require τsΘ⊆Θ.
For any Φ, we also define τΦs:Δc(elΓ×Φ)→Δc(elΓ×Φ) by
τΦsθ:=χelΓ×Φ(s×id2Γ×Φ)∗Now, for any Ξ∈□c(Γ×Φ), we define the “conservative bridge transform[2]” CBr(Ξ)∈□c(Γ×2Γ×Φ) as the closure of all τΦsθ where θ is a maximal element of Br(Ξ). It is then possible to see that Θ∈□c(Γ×2Γ) is a valid hypothesis if and only if it is of the form prΓ×2ΓCBr(Ξ) for some Φ and Ξ∈□c(Γ×Φ).
I still think the monotonicity principle is saying something about the learning theory of IBP which is still true in the new framework. Namely, it is possible to learn that a program is running but not possible to (confidently) learn that a program is not running, and this limits the sort of frequentist guarantees we can expect.
Intuitively, it can be interpreted as a version of the bridge transform where we postulate that a program doesn’t run unless Ξ contains a reason while it must run.
Two thoughts about the role of quining in IBP:
Quine’s are non-unique (there can be multiple fixed points). This means that, viewed as a prescriptive theory, IBP produces multi-valued prescriptions. It might be the case that this multi-valuedness can resolve problems with UDT such as Wei Dai’s 3-player Prisoner’s Dilemma and the anti-Newcomb problem[1]. In these cases, a particular UDT/IBP (corresponding to a particular quine) loses to CDT. But, a different UDT/IBP (corresponding to a different quine) might do as well as CDT.
What to do about agents that don’t know their own source-code? (Arguably humans are such.) Upon reflection, this is not really an issue! If we use IBP prescriptively, then we can always assume quining: IBP is just telling you to follow a procedure that uses quining to access its own (i.e. the procedure’s) source code. Effectively, you are instantiating an IBP agent inside yourself with your own prior and utility function. On the other hand, if we use IBP descriptively, then we don’t need quining: Any agent can be assigned “physicalist intelligence” (Definition 1.6 in the original post, can also be extended to not require a known utility function and prior, along the lines of ADAM) as long as the procedure doing the assigning knows its source code. The agent doesn’t need to know its own source code in any sense.
@Squark is my own old LessWrong account.
Physicalist agents see themselves as inhabiting an unprivileged position within the universe. However, it’s unclear whether humans should be regarded as such agents. Indeed, monotonicity is highly counterintuitive for humans. Moreover, historically human civilization struggled a lot with accepting the Copernican principle (and is still confused about issues such as free will, anthropics and quantum physics which physicalist agents shouldn’t be confused about). This presents a problem for superimitation.
What if humans are actually cartesian agents? Then, it makes sense to consider a variant of physicalist superimitation where instead of just seeing itself as unprivileged, the AI sees the user as a privileged agent. We call such agents “transcartesian”. Here is how this can be formalized as a modification of IBP.
In IBP, a hypothesis is specified by choosing the state space Φ and the belief Θ∈□(Γ×Φ). In the transcartesian framework, we require that a hypothesis is augmented by a mapping τ:Φ→(A0×O0)≤ω, where A0 is the action set of the reference agent (user) and O0 is the observation set of the reference agent. Given G0 the source code of the reference agent, we require that Θ is supported on the set
{(y,x)∈Γ×Φ∣∣ha⊑τ(x)⟹a=Gy0(h)}That is, the actions of the reference agent are indeed computed by the source code of the reference agent.
Now, instead of using a loss function of the form L:elΓ→R, we can use a loss function of the form L:(A0×O0)≤ω→R which doesn’t have to satisfy any monotonicity constraint. (More generally, we can consider hybrid loss functions of the form L:(A0×O0)≤ω×elΓ→R monotonic in the second argument.) This can also be generalized to reference agents with hidden rewards.
As opposed to physicalist agents, transcartesian agents do suffer from penalties associated with the description complexity of bridge rules (for the reference agent). Such an agent can (for example) come to believe in a simulation hypothesis that is unlikely from a physicalist perspective. However, since such a simulation hypothesis would be compelling for the reference agent as well, this is not an alignment problem (epistemic alignment is maintained).
Up to light editing, the following was written by me during the “Finding the Right Abstractions for healthy systems” research workshop, hosted by Topos Institute in January 2023. However, I invented the idea before.
In order to allow R (the set of programs) to be infinite in IBP, we need to define the bridge transform for infinite Γ. At first, it might seem Γ can be allowed to be any compact Polish space, and the bridge transform should only depend on the topology on Γ, but that runs into problems. Instead, the right structure on Γ for defining the bridge transform seems to be that of a “profinite field space”: a category I came up with that I haven’t seen in the literature so far.
The category PFS of profinite field spaces is defined as follows. An object F of PFS is a set ind(F) and a family of finite sets Fαα∈ind(F). We denote Tot(F):=∏αFα. Given F and G objects of PFS, a morphism from F to G is a mapping f:Tot(F)→Tot(G) such that there exists R⊆ind(F)×ind(G) with the following properties:
For any α∈ind(F), the set R(α):=β∈ind(G)∣(α,β)∈R is finite.
For any β∈ind(G), the set R−1(β):=α∈ind(F)∣(α,β)∈R is finite.
For any β∈ind(G), there exists a mapping fβ:∏α∈R−1(β)Fα→Gβ s.t. for any x∈Tot(F), f(x)β:=fβ(prRβ(x)) where prRβ:Tot(F)→∏α∈R−1(β)Fα is the projection mapping.
The composition of PFS morphisms is just the composition of mappings.
It is easy to see that every PFS morphism is a continuous mapping in the product topology, but the converse is false. However, the converse is true for objects with finite ind (i.e. for such objects any mapping is a morphism). Hence, an object F in PFS can be thought of as Tot(F) equipped with additional structure that is stronger than the topology but weaker than the factorization into Fα.
The name “field space” is inspired by the following observation. Given F an object of PFS, there is a natural condition we can impose on a Borel probability distribution on Tot(F) which makes it a “Markov random field” (MRF). Specifically, μ∈ΔTot(F) is called an MRF if there is an undirected graph G whose vertices are ind(F) and in which every vertex is of finite degree, s.t.μ is an MRF on G in the obvious sense. The property of being an MRF is preserved under pushforwards w.r.t.PFS morphisms.
Infra-Bayesian physicalism is an interesting example in favor of the thesis that the more qualitatively capable an agent is, the less corrigible it is. (a.k.a. “corrigibility is anti-natural to consequentialist reasoning”). Specifically, alignment protocols that don’t rely on value learning become vastly less safe when combined with IBP:
Example 1: Using steep time discount to disincentivize dangerous long-term plans. For IBP, “steep time discount” just means, predominantly caring about your source code running with particular short inputs. Such a goal strongly incentives the usual convergent instrumental goals: first take over the world, then run your source code with whatever inputs you want. IBP agents just don’t have time discount in the usual sense: a program running late in physical time is just as good as one running early in physical time.
Example 2: Debate. This protocol relies on a zero-sum game between two AIs. But, the monotonicity principle rules out the possibility of zero-sum! (If L and −L are both monotonic loss functions then L is a constant). So, in a “debate” between IBP agents, they cooperate to take over the world and then run the source code of each debater with the input “I won the debate”.
Example 3: Forecasting/imitation (an IDA in particular). For an IBP agent, the incentivized strategy is: take over the world, then run yourself with inputs showing you making perfect forecasts.
The conclusion seems to be, it is counterproductive to use IBP to solve the acausal attack problem for most protocols. Instead, you need to do PreDCA or something similar. And, if acausal attack is a serious problem, then approaches that don’t do value learning might be doomed.
Infradistributions admit an information-theoretic quantity that doesn’t exist in classical theory. Namely, it’s a quantity that measures how many bits of Knightian uncertainty an infradistribution has. We define it as follows:
Let X be a finite set and Θ a crisp infradistribution (credal set) on X, i.e. a closed convex subset of ΔX. Then, imagine someone trying to communicate a message by choosing a distribution out of Θ. Formally, let Y be any other finite set (space of messages), θ∈ΔY (prior over messages) and K:Y→Θ (communication protocol). Consider the distribution η:=θ⋉K∈Δ(Y×X). Then, the information capacity of the protocol is the mutual information between the projection on Y and the projection on X according to η, i.e. Iη(prX;prY). The “Knightian entropy” of Θ is now defined to be the maximum of Iη(prX;prY) over all choices of Y, θ, K. For example, if Θ is Bayesian then it’s 0, whereas if Θ=⊤X, it is ln|X|.
Here is one application[1] of this concept, orthogonal to infra-Bayesianism itself. Suppose we model inner alignment by assuming that some portion ϵ of the prior ζ consists of malign hypotheses. And we want to design e.g. a prediction algorithm that will converge to good predictions without allowing the malign hypotheses to attack, using methods like confidence thresholds. Then we can analyze the following metric for how unsafe the algorithm is.
Let O be the set of observations and A the set of actions (which might be “just” predictions) of our AI, and for any environment τ and prior ξ, let Dξτ(n)∈Δ(A×O)n be the distribution over histories resulting from our algorithm starting with prior ξ and interacting with environment τ for n time steps. We have ζ=ϵμ+(1−ϵ)β, where μ is the malign part of the prior and β the benign part. For any μ′, consider Dϵμ′+(1−ϵ)βτ(n). The closure of the convex hull of these distributions for all choices of μ′ (“attacker policy”) is some Θβτ(n)∈Δ(A×O)n. The maximal Knightian entropy of Θβτ(n) over all admissible τ and β is called the malign capacity of the algorithm. Essentially, this is a bound on how much information the malign hypotheses can transmit into the world via the AI during a period of n. The goal then becomes finding algorithms with simultaneously good regret bounds and good (in particular, at most polylogarithmic in n) malign capacity bounds.
This is an idea I’m collaborating on with Johannes Treutlein.
Infra-Bayesianism can be naturally understood as semantics for a certain non-classical logic. This promises an elegant synthesis between deductive/symbolic reasoning and inductive/intuitive reasoning, with several possible applications. Specifically, here we will explain how this can work for higher-order logic. There might be holes and/or redundancies in the precise definitions given here, but I’m quite confident the overall idea is sound.
We will work with homogenous ultracontributions (HUCs). □X will denote the space of HUCs over X. Given μ∈□X, S(μ)⊆ΔcX will denote the corresponding convex set. Given p∈ΔX and μ∈□X, p:μ will mean p∈S(μ). Given μ,ν∈□X, μ⪯ν will mean S(μ)⊆S(ν).
Syntax
Let Tι denote a set which we interpret as the types of individuals (we allow more than one). We then recursively define the full set of types T by:
0∈T (intended meaning: the uninhabited type)
1∈T (intended meaning: the one element type)
If α∈Tι then α∈T
If α,β∈T then α+β∈T (intended meaning: disjoint union)
If α,β∈T then α×β∈T (intended meaning: Cartesian product)
If α∈T then (α)∈T (intended meaning: predicates with argument of type α)
For each α,β∈T, there is a set F0α→β which we interpret as atomic terms of type α→β. We will denote V0α:=F01→α. Among those we distinguish the logical atomic terms:
prαβ∈F0α×β→α
iαβ∈F0α→α+β
Symbols we will not list explicitly, that correspond to the algebraic properties of + and × (commutativity, associativity, distributivity and the neutrality of 0 and 1). For example, given α,β∈T there is a “commutator” of type α×β→β×α.
=α∈V0(α×α)
diagα∈F0α→α×α
()α∈V0((α)×α) (intended meaning: predicate evaluation)
⊥∈V0(1)
⊤∈V0(1)
∨α∈F0(α)×(α)→(α)
∧α∈F0(α)×(α)→(α) [EDIT: Actually this doesn’t work because, except for finite sets, the resulting mapping (see semantics section) is discontinuous. There are probably ways to fix this.]
∃αβ∈F0(α×β)→(β)
∀αβ∈F0(α×β)→(β) [EDIT: Actually this doesn’t work because, except for finite sets, the resulting mapping (see semantics section) is discontinuous. There are probably ways to fix this.]
Assume that for each n∈N there is some Dn⊆□[n]: the set of “describable” ultracontributions [EDIT: it is probably sufficient to only have the fair coin distribution in D2 in order for it to be possible to approximate all ultracontributions on finite sets]. If μ∈Dn then ┌μ┐∈V(∑ni=11)
We recursively define the set of all terms Fα→β. We denote Vα:=F1→α.
If f∈F0α→β then f∈Fα→β
If f1∈Fα1→β1 and f2∈Fα2→β2 then f1×f2∈Fα1×α2→β1×β2
If f1∈Fα1→β1 and f2∈Fα2→β2 then f1+f2∈Fα1+α2→β1+β2
If f∈Fα→β then f−1:F(β)→(α)
If f∈Fα→β and g∈Fβ→γ then g∘f∈Fα→γ
Elements of V(α) are called formulae. Elements of V(1) are called sentences. A subset of V(1) is called a theory.
Semantics
Given T⊆V(1), a model M of T is the following data. To each α∈T, there must correspond some compact Polish space M(t) s.t.:
M(0)=∅
M(1)=pt (the one point space)
M(α+β)=M(α)⊔M(β)
M(α×β)=M(α)×M(β)
M((α))=□M(α)
To each f∈Fα→β, there must correspond a continuous mapping M(f):M(α)→M(β), under the following constraints:
pr, i, diag and the “algebrators” have to correspond to the obvious mappings.
M(=α)=⊤diagM(α). Here, diagX⊆X×X is the diagonal and ⊤C∈□X is the sharp ultradistribution corresponding to the closed set C⊆X.
Consider α∈T and denote X:=M(α). Then, M(()α)=⊤□X⋉id□X. Here, we use the observation that the identity mapping id□X can be regarded as an infrakernel from □X to X.
M(⊥)=⊥pt
M(⊤)=⊤pt
S(M(∨)(μ,ν)) is the convex hull of S(μ)∪S(ν)
S(M(∧)(μ,ν)) is the intersection of S(μ)∪S(ν)
Consider α,β∈T and denote X:=M(α), Y:=M(β) and pr:X×Y→Y the projection mapping. Then, M(∃αβ)(μ)=pr∗μ.
Consider α,β∈T and denote X:=M(α), Y:=M(β) and pr:X×Y→Y the projection mapping. Then, p:M(∀αβ)(μ) iff for all q∈Δc(X×Y), if pr∗q=p then q:μ.
M(f1×f2)=M(f1)×M(f2)
M(f1+f2)=M(f1)⊔M(f2)
M(f−1)(μ)=M(f)∗(μ).
M(g∘f)=M(g)∘M(f)
M(┌μ┐)=μ
Finally, for each ϕ∈T, we require M(ϕ)=⊤pt.
Semantic Consequence
Given ϕ∈V(1), we say M⊨ϕ when M(ϕ)=⊤pt. We say T⊨ϕ when for any model M of T, M⊨ϕ. It is now interesting to ask what is the computational complexity of deciding T⊨ϕ. [EDIT: My current best guess is co-RE]
Applications
As usual, let A be a finite set of actions and O be a finite set of observation. Require that for each o∈O there is σo∈Tι which we interpret as the type of states producing observation o. Denote σ∗:=∑o∈Oσo (the type of all states). Moreover, require that our language has the nonlogical symbols s0∈V0(σ∗) (the initial state) and, for each a∈A, Ka∈F0σ∗→(σ∗) (the transition kernel). Then, every model defines a (pseudocausal) infra-POMDP. This way we can use symbolic expressions to define infra-Bayesian RL hypotheses. It is then tempting to study the control theoretic and learning theoretic properties of those hypotheses. Moreover, it is natural to introduce a prior which weights those hypotheses by length, analogical to the Solomonoff prior. This leads to some sort of bounded infra-Bayesian algorithmic information theory and bounded infra-Bayesian analogue of AIXI.
Let’s also explicitly describe 0th order and 1st order infra-Bayesian logic (although they are should be segments of higher-order).
0-th order
Syntax
Let A be the set of propositional variables. We define the language L:
Any a∈A is also in L
⊥∈L
⊤∈L
Given ϕ,ψ∈L, ϕ∧ψ∈L
Given ϕ,ψ∈L, ϕ∨ψ∈L
Notice there’s no negation or implication. We define the set of judgements J:=L×L. We write judgements as ϕ⊢ψ (”ψ in the context of ϕ”). A theory is a subset of J.
Semantics
Given T⊆J, a model of T consists of a compact Polish space X and a mapping M:L→□X. The latter is required to satisfy:
M(⊥)=⊥X
M(⊤)=⊤X
M(ϕ∧ψ)=M(ϕ)∧M(ψ). Here, we define ∧ of infradistributions as intersection of the corresponding sets
M(ϕ∨ψ)=M(ϕ)∨M(ψ). Here, we define ∨ of infradistributions as convex hull of the corresponding sets
For any ϕ⊢ψ∈T, M(ϕ)⪯M(ψ)
1-st order
Syntax
We define the language using the usual syntax of 1-st order logic, where the allowed operators are ∧, ∨ and the quantifiers ∀ and ∃. Variables are labeled by types from some set T. For simplicity, we assume no constants, but it is easy to introduce them. For any sequence of variables (v1…vn), we denote Lv the set of formulae whose free variables are a subset of v1…vn. We define the set of judgements J:=⋃vLv×Lv.
Semantics
Given T⊆J, a model of T consists of
For every t∈T, a compact Polish space M(t)
For every ϕ∈Lv where v1…vn have types t1…tn, an element Mv(ϕ) of □Xv, where Xv:=(∏ni=1M(ti))
It must satisfy the following:
Mv(⊥)=⊥Xv
Mv(⊤)=⊤Xv
Mv(ϕ∧ψ)=Mv(ϕ)∧Mv(ψ)
Mv(ϕ∨ψ)=Mv(ϕ)∨Mv(ψ)
Consider variables u1…un of types t1…tn and variables v1…vm of types s1…sm. Consider also some σ:{1…m}→{1…n} s.t. si=tσi. Given ϕ∈Lv, we can form the substitution ψ:=ϕ[vi=uσ(i)]∈Lu. We also have a mapping fσ:Xu→Xv given by fσ(x1…xn)=(xσ(1)…xσ(m)). We require Mu(ψ)=f∗(Mv(ϕ))
Consider variables v1…vn and i∈{1…n}. Denote pr:Xv→Xv∖vi the projection mapping. We require Mv∖vi(∃vi:ϕ)=pr∗(Mv(ϕ))
Consider variables v1…vn and i∈{1…n}. Denote pr:Xv→Xv∖vi the projection mapping. We require that p:Mv∖vi(∀vi:ϕ) if an only if, for all q∈ΔXv s.t pr∗q=p, q:pr∗(Mv(ϕ))
For any ϕ⊢ψ∈T, Mv(ϕ)⪯Mv(ψ)
There is a special type of crisp infradistributions that I call “affine infradistributions”: those that, represented as sets, are closed not only under convex linear combinations but also under affine linear combinations. In other words, they are intersections between the space of distributions and some closed affine subspace of the space of signed measures. Conjecture: in 0-th order logic of affine infradistributions, consistency is polynomial-time decidable (whereas for classical logic it is ofc NP-hard).
To produce some evidence for the conjecture, let’s consider a slightly different problem. Specifically, introduce a new semantics in which □X is replaced by the set of linear subspaces of some finite dimensional vector space V. A model M is required to satisfy:
M(⊥)=0
M(⊤)=V
M(ϕ∧ψ)=M(ϕ)∩M(ψ)
M(ϕ∨ψ)=M(ϕ)+M(ψ)
For any ϕ⊢ψ∈T, M(ϕ)⊆M(ψ)
If you wish, this is “non-unitary quantum logic”. In this setting, I have a candidate polynomial-time algorithm for deciding consistency. First, we transform T into an equivalent theory s.t. all judgments are of the following forms:
a=⊥
a=⊤
a⊢b
Pairs of the form c=a∧b, d=a∨b.
Here, a,b,c,d∈A are propositional variables and “ϕ=ψ” is a shorthand for the pair of judgments ϕ⊢ψ and ψ⊢ϕ.
Second, we make sure that our T also satisfies the following “closure” properties:
If a⊢b and b⊢c are in T then so is a⊢c
If c=a∧b is in T then so are c⊢a and c⊢b
If c=a∨b is in T then so are a⊢c and b⊢c
If c=a∧b, d⊢a and d⊢b are in T then so is d⊢c
If c=a∨b, a⊢d and b⊢d are in T then so is c⊢d
Third, we assign to each a∈A a real-valued variable xa. Then we construct a linear program for these variables consisting of the following inequalities:
For any a∈A: 0≤xa≤1
For any a⊢b in T: xa≤xb
For any pair c=a∧b and d=a∨b in T: xc+xd=xa+xb
For any a=⊥: xa=0
For any a=⊤: xa=1
Conjecture: the theory is consistent if and only if the linear program has a solution. To see why it might be so, notice that for any model M we can construct a solution by setting
xa:=dimM(a)dimM(⊤)
I don’t have a full proof for the converse but here are some arguments. If a solution exists, then it can be chosen to be rational. We can then rescale it to get integers which are candidate dimensions of our subspaces. Consider the space of all ways to choose subspaces of these dimensions s.t. the constraints coming from judgments of the form a⊢b are satisfied. This is a moduli space of poset representations. It is easy to see it’s non-empty (just let the subspaces be spans of vectors taken from a fixed basis). By Proposition A.2 in Futorny and Iusenko it is an irreducible algebraic variety. Therefore, to show that we can also satisfy the remaining constraints, it is enough to check that (i) the remaining constraints are open (ii) each of the remaining constraints (considered separately) holds at some point of the variety. The first is highly likely and the second is at least plausible.
The algorithm also seems to have a natural extension to the original infra-Bayesian setting.
When using infra-Bayesian logic to define a simplicity prior, it is natural to use “axiom circuits” rather than plain formulae. That is, when we write the axioms defining our hypothesis, we are allowed to introduce “shorthand” symbols for repeating terms. This doesn’t affect the expressiveness, but it does affect the description length. Indeed, eliminating all the shorthand symbols can increase the length exponentially.
Instead of introducing all the “algebrator” logical symbols, we can define T as the quotient by the equivalence relation defined by the algebraic laws. We then need only two extra logical atomic terms:
For any n∈N and σ∈Sn (permutation), denote n:=∑ni=11 and require σ+∈Fn→n
For any n∈N and σ∈Sn, σ×α∈Fαn→αn
However, if we do this then it’s not clear whether deciding that an expression is a well-formed term can be done in polynomial time. Because, to check that the types match, we need to test the identity of algebraic expressions and opening all parentheses might result in something exponentially long.
Actually the Schwartz–Zippel algorithm can easily be adapted to this case (just imagine that types are variables over Q, and start from testing the identity of the types appearing inside parentheses), so we can validate expressions in randomized polynomial time (and, given standard conjectures, in deterministic polynomial time as well).
Master post for ideas about metacognitive agents.
Sort of obvious but good to keep in mind: Metacognitive regret bounds are not easily reducible to “plain” IBRL regret bounds when we consider the core and the envelope as the “inside” of the agent.
Assume that the action and observation sets factor as A=A0×A1 and O=O0×O1, where (A0,O0) is the interface with the external environment and (A1,O1) is the interface with the envelope.
Let Λ:Π→□(Γ×(A×O)ω) be a metalaw. Then, there are two natural ways to reduce it to an ordinary law:
Marginalizing over Γ. That is, let pr−Γ:Γ×(A×O)ω→(A×O)ω and pr0:(A×O)ω→(A0×O0)ω be the projections. Then, we have the law Λ?:=(pr0pr−Γ)∗∘Λ.
Assuming “logical omniscience”. That is, let τ∗∈Γ be the ground truth. Then, we have the law Λ!:=pr0∗(Λ∣τ∗). Here, we use the conditional defined by Θ∣A:={θ∣A∣θ∈argmaxΘPr[A]}. It’s easy to see this indeed defines a law.
However, requiring low regret w.r.t. neither of these is equivalent to low regret w.r.t Λ:
Learning Λ? is typically no less feasible than learning Λ, however it is a much weaker condition. This is because the metacognitive agents can use policies that query the envelope to get higher guaranteed expected utility.
Learning Λ! is a much stronger condition than learning Λ, however it is typically infeasible. Requiring it leads to AIXI-like agents.
Therefore, metacognitive regret bounds hit a “sweep spot” of stength vs. feasibility which produces a genuinely more powerful agents than IBRL[1].
More precisely, more powerful than IBRL with the usual sort of hypothesis classes (e.g. nicely structured crisp infra-RDP). In principle, we can reduce metacognitive regret bounds to IBRL regret bounds using non-crsip laws, since there’s a very general theorem for representing desiderata as laws. But, these laws would have a very peculiar form that seems impossible to guess without starting with metacognitive agents.
Formalizing the richness of mathematics
Intuitively, it feels that there is something special about mathematical knowledge from a learning-theoretic perspective. Mathematics seems infinitely rich: no matter how much we learn, there is always more interesting structure to be discovered. Impossibility results like the halting problem and Godel incompleteness lend some credence to this intuition, but are insufficient to fully formalize it.
Here is my proposal for how to formulate a theorem that would make this idea rigorous.
(Wrong) First Attempt
Fix some natural hypothesis class for mathematical knowledge, such as some variety of tree automata. Each such hypothesis Θ represents an infradistribution over Γ: the “space of counterpossible computational universes”. We can say that Θ is a “true hypothesis” when there is some θ in the credal set Θ (a distribution over Γ) s.t. the ground truth Υ∗∈Γ “looks” as if it’s sampled from θ. The latter should be formalizable via something like a computationally bounded version of Marin-Lof randomness.
We can now try to say that Υ∗ is “rich” if for any true hypothesis Θ, there is a refinement Ξ⊆Θ which is also a true hypothesis and “knows” at least one bit of information that Θ doesn’t, in some sense. This is clearly true, since there can be no automaton or even any computable hypothesis which fully describes Υ∗. But, it’s also completely boring: the required Ξ can be constructed by “hardcoding” an additional fact into Θ. This doesn’t look like “discovering interesting structure”, but rather just like brute-force memorization.
(Wrong) Second Attempt
What if instead we require that Ξ knows infinitely many bits of information that Θ doesn’t? This is already more interesting. Imagine that instead of metacognition / mathematics, we would be talking about ordinary sequence prediction. In this case it is indeed an interesting non-trivial condition that the sequence contains infinitely many regularities, s.t. each of them can be expressed by a finite automaton but their conjunction cannot. For example, maybe the n-th bit in the sequence depends only the largest k s.t.2k divides n, but the dependence on k is already uncomputable (or at least inexpressible by a finite automaton).
However, for our original application, this is entirely insufficient. This is because in the formal language we use to define Γ (e.g. combinator calculus) has some “easy” equivalence relations. For example, consider the family of programs of the form “if 2+2=4 then output 0, otherwise...”. All of those programs would output 0, which is obvious once you know that 2+2=4. Therefore, once your automaton is able to check some such easy equivalence relations, hardcoding a single new fact (in the example, 2+2=4) generates infinitely many “new” bits of information. Once again, we are left with brute-force memorization.
(Less Wrong) Third Attempt
Here’s the improved condition: For any true hypothesis Θ, there is a true refinement Ξ⊆Θ s.t. conditioning Θ on any finite set of observations cannot produce a refinement of Ξ.
There is a technicality here, because we’re talking about infradistributions, so what is “conditioning” exactly? For credal sets, I think it is sufficient to allow two types of “conditioning”:
For any given observation A and p∈(0,1], we can form {θ∈Θ∣θ(A)≥p}.
For any given observation A s.t. minθ∈Θθ(A)>0, we can form {(θ∣A)∣θ∈Θ}.
This rules-out the counterexample from before: the easy equivalence relation can be represented inside Θ, and then the entire sequence of “novel” bits can be generated by a conditioning.
Alright, so does Υ∗ actually satisfy this condition? I think it’s very probable, but I haven’t proved it yet.
Recording of a talk I gave in VAISU 2023.
Here is the sketch of a simplified model for how a metacognitive agent deals with traps.
Consider some (unlearnable) prior ζ over environments, s.t. we can efficiently compute the distribution ζ(h) over observations given any history h. For example, any prior over a small set of MDP hypotheses would qualify. Now, for each h, we regard ζ(h) as a “program” that the agent can execute and form beliefs about. In particular, we have a “metaprior” ξ consisting of metahypotheses: hypotheses-about-programs.
For example, if we let every metahypothesis be a small infra-RDP satisfying appropriate assumptions, we probably have an efficient “metalearning” algorithm. More generally, we can allow a metahypothesis to be a learnable mixture of infra-RDPs: for instance, there is a finite state machine for specifying “safe” actions, and the infra-RDPs in the mixture guarantee no long-term loss upon taking safe actions.
In this setting, there are two levels of learning algorithms:
The metalearning algorithm, which learns the correct infra-RDP mixture. The flavor of this algorithm is RL in a setting where we have a simulator of the environment (since we can evaluate ζ(h) for any h). In particular, here we don’t worry about exploitation/exploration tradeoffs.
The “metacontrol” algorithm, which given an infra-RDP mixture, approximates the optimal policy. The flavor of this algorithm is “standard” RL with exploitation/exploration tradeoffs.
In the simplest toy model, we can imagine that metalearning happens entirely in advance of actual interaction with the environment. More realistically, the two needs to happen in parallel. It is then natural to apply metalearning to the current environmental posterior rather than the prior (i.e. the histories starting from the history that already occurred). Such an agent satisfies “opportunistic” guarantees: if at any point of time, the posterior admits a useful metahypothesis, the agent can exploit this metahypothesis. Thus, we address both parts of the problem of traps:
The complexity-theoretic part (subproblem 1.2) is addressed by approximating the intractable Bayes-optimality problem by the metacontrol problem of the (coarser) metahypothesis.
The statistical part (subproblem 2.1) is addressed by opportunism: if at some point, we can easily learn something about the physical environment, then we do.
Jobst Heitzig asked me whether infra-Bayesianism has something to say about the absent-minded driver (AMD) problem. Good question! Here is what I wrote in response:
The following was written by me during the “Finding the Right Abstractions for healthy systems” research workshop, hosted by Topos Institute in January 2023. However, I invented the idea before.
Here’s an elegant diagrammatic notation for constructing new infrakernels out of given infrakernels. There is probably some natural category-theoretic way to think about it, but at present I don’t know what it is.
By “infrakernel” we will mean a continuous mapping of the form X→□Y, where X and Y are compact Polish spaces and □Y is the space of credal sets (i.e. closed convex sets of probability distributions) over Y.
Syntax
The diagram consists of child vertices, parent vertices, squiggly lines, arrows, dashed arrows and slashes.
There can be solid arrows incoming into the diagram. Each such arrow a is labeled by a compact Polish space D(a) and ends on a parent vertex t(a). And, s(a)=⊥ (i.e. the arrow has no source vertex).
There can be dashed and solid arrows between vertices. Each such arrow a starts from a child vertex s(a) and ends on a parent vertex t(a). We require that P(s(a))≠t(a) (i.e. they should not be also connected by a squiggly line).
There are two types of vertices: parent vertices (denoted by a letter) and child vertices (denoted by a letter or number in a circle).
Each child vertex v is labeled by a compact Polish space D(v) and connected (by a squiggly line) to a unique parent vertex P(v). It may or may not be crossed-out by a slash.
Each parent vertex p is labeled by an infrakernel Kp with source S1×…×Sk and target T1×…×Tl where each Si is corresponds to a solid arrow a with t(a)=p and each Tj is D(v) for some child vertex v with P(v)=p. We can also add squares with numbers where solid arrows end to keep track of the correspondence between the arguments of Kp and the arrows.
If s(a)=⊥ then the corresponding Si is D(a).
If s(a)=v≠⊥ then the corresponding Si is D(v).
Semantics
Every diagram D represents an infrakernel KD.
The source space of KD is a product X1×…×Xn, where each Xi is D(a) for some solid arrow a with s(a)=⊥.
The target space of KD is a product Y1×…×Ym, where each Yj is D(v) for some non-crossed-out child vertex.
The value of the KD at a given point x is defined as follows. Let ~Y:=∏vD(v) (a product that includes the cross-out vertices). Then, KD(x) is the set of all the marginal distributions of distributions μ∈Δ~Y satisfying the following condition. Consider any parent vertex p. Let a1,a2…ak be the (dashed or solid) arrows s.t.s(ai)≠⊥ and t(ai)=p. For each i s.t., choose any yi∈D(s(ai)). We require that Kp(x,y) contains the marginal distribution of μ∣y. Here, the notation Kp(x,y) means we are using the components of x and y corresponding to solid arrows a with t(a)=p.
Two deterministic toy models for regret bounds of infra-Bayesian bandits. The lesson seems to be that equalities are much easier to learn than inequalities.
Model 1: Let A be the space of arms, O the space of outcomes, r:A×O→R the reward function, X and Y vector spaces, H⊆X the hypothesis space and F:A×O×H→Y a function s.t. for any fixed a∈A and o∈O, F(a,o):H→Y extends to some linear operator Ta,o:X→Y. The semantics of hypothesis h∈H is defined by the equation F(a,o,h)=0 (i.e. an outcome o of action a is consistent with hypothesis h iff this equation holds).
For any h∈H denote by V(h) the reward promised by h:
V(h):=maxa∈Amino∈O:F(a,o,h)=0r(a,o)
Then, there is an algorithm with mistake bound dimX, as follows. On round n∈N, let Gn⊆H be the set of unfalsified hypotheses. Choose hn∈S optimistically, i.e.
hn:=argmaxh∈GnV(h)
Choose the arm an recommended by hypothesis hn. Let on∈O be the outcome we observed, rn:=r(an,on) the reward we received and h∗∈H the (unknown) true hypothesis.
If rn≥V(hn) then also rn≥V(h∗) (since h∗∈Gn and hence V(h∗)≤V(hn)) and therefore an wasn’t a mistake.
If rn<V(hn) then F(an,on,hn)≠0 (if we had F(an,on,hn)=0 then the minimization in the definition of V(hn) would include r(an,on)). Hence, hn∉Gn+1=Gn∩kerTan,on. This implies dimspan(Gn+1)<dimspan(Gn). Obviously this can happen at most dimX times.
Model 2: Let the spaces of arms and hypotheses be
A:=H:=Sd:={x∈Rd+1∣∥x∥=1}
Let the reward r∈R be the only observable outcome, and the semantics of hypothesis h∈Sd be r≥h⋅a. Then, the sample complexity cannot be bound by a polynomial of degree that doesn’t depend on d. This is because Murphy can choose the strategy of producing reward 1−ϵ whenever h⋅a≤1−ϵ. In this case, whatever arm you sample, in each round you can only exclude ball of radius ≈√2ϵ around the sampled arm. The number of such balls that fit into the unit sphere is Ω(ϵ−12d). So, normalized regret below ϵ cannot be guaranteed in less than that many rounds.
One of the postulates of infra-Bayesianism is the maximin decision rule. Given a crisp infradistribution Θ, it defines the optimal action to be:
a∗(Θ):=argmaxaminμ∈ΘEμ[U(a)]
Here U is the utility function.
What if we use a different decision rule? Let t∈[0,1] and consider the decision rule
a∗t(Θ):=argmaxa(tminμ∈ΘEμ[U(a)]+(1−t)maxμ∈ΘEμ[U(a)])
For t=1 we get the usual maximin (“pessimism”), for t=0 we get maximax (“optimism”) and for other values of t we get something in the middle (we can call “t-mism”).
It turns out that, in some sense, this new decision rule is actually reducible to ordinary maximin! Indeed, set
μ∗t:=argmaxμEμ[U(a∗t)]
Θt:=tΘ+(1−t)μ∗t
Then we get
a∗(Θt)=a∗t(Θ)
More precisely, any pessimistically optimal action for Θt is t-mistically optimal for Θ (the converse need not be true in general, thanks to the arbitrary choice involved in μ∗t).
To first approximation it means we don’t need to consider t-mistic agents since they are just special cases of “pessimistic” agents. To second approximation, we need to look at what the transformation of Θ to Θt does to the prior. If we start with a simplicity prior then the result is still a simplicity prior. If U has low description complexity and t is not too small then essentially we get full equivalence between “pessimism” and t-mism. If t is small then we get a strictly “narrower” prior (for t=0 we are back at ordinary Bayesianism). However, if U has high description complexity then we get a rather biased simplicity prior. Maybe the latter sort of prior is worth considering.