The following are my thoughts on the definition of learning in infra-Bayesian physicalism (IBP), which is also a candidate for the ultimate prescriptive agent desideratum.
In general, learning of hypotheses about the physical universe is not possible because of traps. On the other hand, learning of hypotheses about computable mathematics is possible in the limit of ample computing resources, as long as we can ignore side effects of computations. Moreover, learning computable mathematics implies approximating Bayesian planning w.r.t the prior about the physical universe. Hence, we focus on this sort of learning.
We consider an agent comprised of three modules, that we call Simulator, Learner and Controller. The agent’s history consists of two phases. In the Training phase, the Learner interacts with the Simulator, and in the end produces a program for the Controller. In the Deployment phase, the Controller runs the program.
Roughly speaking:
The Simulator is a universal computer whose function is performing computational experiments, which we can think of as “thought experiments” or coarse-grained simulations of potential plans. It receives commands from the Learner (which computations to run / threads to start/stop) and reports to the Learner the results. We denote the Simulator’s input alphabet by IS and output alphabet by OS.
The Learner is the machine learning (training) module. The algorithm whose desideratum we’re specifying resides here.
The Controller (as in “control theory”) is a universal computer connected to the agent’s external interface (environmental actions A and observations O). It’s responsible for real-time interaction with the environment, and we can think of it as the learned policy. It is programmed by the Learner, for which purpose it has input alphabet IC.
We will refer to this as the SiLC architecture.
Let H⊆□Γ be our hypothesis class about computable mathematics. Let ξ:Γ→□2Γ be our prior about the physical universe[1]. These have to satisfy the coherence conditions
Together, these ensure that Θ⋉ξ is a coherent IBP hypothesis. Notice that for any ξ0:Γ→□2Γ satisfying the first condition[2], there is a unique minimal coherent ξ:Γ→□2Γ s.t. ∀y∈Γ:ξ0(y)⊆ξ(y). Moreover, given a coherent ξ and any Θ0∈□Γ, there is a unique minimal coherent Θ∈□Γ s.t. Θ0⊆Θ.
The duration of the Training phase will be denoted by τ∈N[3]. We can think of it as “computational time”.
Let the source codes of the Learner (obtained by quining), the Simulator and the Controller respectively be denoted by
DL:Γ×N×(IS×OS)∗→ΔQ(IS×IC)DS:Γ×I+S→OSDC:Γ×I∗C×O∗→A
Here, the N argument of DL corresponds to τ and ΔQ is a probability distribution in which all probabilities are rational numbers[4].
We assume that the simulator can indeed run any computation, and that any given halting computation would run fast for τ≫0. These are assumptions on DS (or, on some combination of (i) DS, (ii) the definition of Γ, and (iii) the support of all Θ∈H) that we will not spell out here.
We will say that a policy is a mapping of type (IS×IC×OS)∗→ΔQ(IS×IC)and a metapolicy is a mapping of type N×(IS×IC×OS)∗→ΔQ(IS×IC).
Given any D′L:Γ×N×(IS×IC×OS)∗→ΔQ(IS×IC), we can compose it with DS and DC in the obvious way[5] to yield
DS⊗D′L⊗DC:Γ×N×(A×O)∗→ΔQA
In particular, we can take D′L=μ for some metapolicy μ by postulating no dependence on the Γ argument.
Denote by P the set of all policies. Given metapolicy μ and π∈P, we define μπ:[τ+1]×(IS×IC×OS)∗→ΔQ(IS×IC) by
μπ(k,pq):={μ(k,pq) if k<τπ(pq) if k=τ
Given any ν:[τ+1]×(IS×IC×OS)∗→ΔQ(IS×IC), we say that y∈Γ is a ντ-consistent counterpossible when the following conditions hold:
For all k<τ and h∈(IS×IC×OS)<k, DL(y,k,pq)=ν(k,pq)
For all k≤τ and h∈(A×O)∗, (DS⊗DL⊗DC)(y,k,pq)=(DS⊗ν⊗DC)(y,k,pq)
We denote by Cτν⊆Γ the set of ντ-consistent counterpossibles.
A (deterministic) copolicy is a mapping of signature I+S→OS. We denote the set of copolicies by C. Given a policy π and a copolicy ν, we define π^⊗ν∈Δ(IS×IC×OS)∗ in the obvious way. Given policies π1,π2, we define their total variation distance[6] to be
dTV(π1,π2):=maxν∈CdTV(π1^⊗ν,π2^⊗ν)
Given Ξ∈□(Γ×2Γ), f:Γ×2Γ→[0,∞), τ∈N and metapolicy μ, we will use the notation
EΞ[f∣∣μτ]:=minπ∈P(EΞ[f⋅χCτμπ]+dTV(π,μ(τ)))
Intuitively, EΞ[f∣∣μτ] should be thought as the counterfactual expectation of loss function f assuming metapolicy μ, while adding a “bonus” to account for “fair” treatment of randomization by the agent. More on that below.
Given a metapolicy μ and τ∈N, we define Eτμ⊆2Γ by
Eτμ:={α∈2Γ∣∃π∈P:Cτμπ∩α=∅}
Intuitively, Eτtμ is the set of universe states for which at least one copy of the agent exists which followed the metapolicy μ until computational time τ.
Given a loss function l:N×2Γ→[0,1][7] (which we allow to explicitly depend on computational time for greater generality), the learning condition on a metapolicy μ∗ and hypothesis Θ∈H is
Here, ϵ is the “regret bound” function which should vanish in the N→∞ limit.
Some remarks on the particulars of this definition:
There are several reasons we impose (DS⊗DL⊗DC)(y,k,pq)=(DS⊗ν⊗DC)(y,k,pq) rather than DL(y,k,pq)=ν(k,pq):
First, we want to ignore the side effects of running computations on the Simulator (both causal side effects and “mindcrime”, i.e. the direct contribution of those computations to l). Because, taking side effects into account is usually inconsistent with the unlimited experimentation needed for learning.
Second, learning requires trusting the reports of the Simulator, which means we should only impose the policy on copies of DL that are actually connected to DS.
Third, we should also be able to trust the Controller, because otherwise we lose the semantic grounding of the agent’s external interface. (Even though this is not necessary for learning per se.).
On the other hand, we impose DL(y,k,pq)=ν(k,pq) in the computational past because that’s valid additional information that doesn’t interfere with the learning or decision theory.
The learning criterion treats computational time myopically, so that we won’t have to worry about traps in computational time.
The reason we need randomization is, it’s often necessary for efficient learning. In the simplest non-trivial examples, we learn by IID sampling a distribution over computations (e.g. we simulate the interaction between a particular policy and our physical prior ξ). If we sampled deterministically instead, Murphy would be able to fool us by changing behavior precisely at the sampled instances.
The reason we need EΞ[f∣∣μτt] is, randomization only helps if low probability events can be ignored. However, if sufficiently many copies of the agents are instantiated, even a low probability even would be detectable. Hence, we use a “forgiving” metric that assigns low loss even to distributions that technically have high loss but are close to a different distribution with low loss.
We can consider Newcombian problems where Omega makes decisions based on the agent’s action probabilities. I suspect that if Omega’s policy is Lipschitz in the agent policy, the behavior advised by the EΞ[f∣∣μτt] counterfactual will converge to FDT-optimal in the limit of sufficiently many iterations.
Both in the case of ignoring side effects of computations and in the case of the treatment of randomization, we can be accused of departing from priorism (“updatelessness”). However, I think that this departure is justified. In the original TDT paper, Yudkowsky addressed the “Omega rewards irrationality” objection by postulating that, a decision problem is “fair” when it only depends on the agent’s decisions rather than on how the agent makes those decisions. Here, we use the same principle: the agent should not be judged based on its internal thought process (side effects), and it should in some sense be judged based on its decisions rather than the probabilities assigned to those decisions.
Also about priorism, this kind of agents will not endorse iterated-in-computational-time “logical” counterfactual mugging when the same coin is reused indefinitely, but will endorse it when a new coin is used every time, for an appropriate definition of “new” (or e.g. we switch to a new coin every k rounds). Arguably, this solves the tension between priorism and learning observed by Demski. Formulating the precise criterion when Learning-IBP behavior converges to priorist / FDT-optimal is left for further research.
The dependence of ϵ(Θ,N) on Θ should ultimately involve some kind of description complexity. However, it will also involve something in the vein of “what are the computational resource bounds, according to the beliefΘ, for running certain computations, selected for their importance in testing Θ”. In particular, we won’t require the agent to learning anything about non-halting computations. Indeed, any hypothesis about such computations will either assert a time bound on running the non-halting computations (in which case it is false) or will fail to assert any such bound, in which case its learning complexity is known to be infinite.
We could make do without the Eξ∗Θ[χEτμ∗] factor but that would make the learning criterion weaker. The presence of this factor means that, roughly speaking, regret should be low even conditionalon the agent existing, which seems like a reasonable demand.
Given an AI designed along these principles, we might worry about the impact of the side effects that are ignored. Maybe these can produce non-Cartesian daemons. However, during the Training phase, the algorithm has no access to external observation, which arguably makes it unlikely anything inside it can learn how to cyberattack. Moreover, during the Deployment phase, any reason for concern would be mediated by the particular algorithm the Controller runs (rather than the particulars of how it’s implemented), which is what we do take into account in our optimization. Finally, the agent might be able to self-modify to make itself safer: we can even intentionally give it the means to do so (as part of its action space A). This probably requires careful prior-shaping to work well.
This framework assumes all our hypotheses are disintegrable w.r.t. the factorization into Γ and 2Γ. It is an interesting question to understand whether we should or can relax this assumption.
For example, we can imagine ξ0 to be a Solomonoff-like prior along the following lines. Every hypothesis comprising ξ0 is defined by a Turing machine M with access to two oracles representing y,y′∈Γ and two tapes of random and “ambiguous” bits respectively.ξ0(y) is defined by running M with one oracle fielding queries about y (i.e. we given a program P we can request to know its counterpossible output Py) and the other oracle fielding queries about some y′ s.t. we want to decide whether y′∈α for α∼θ∈ξ0(y). M is only allowed to return NO if there was at least one query to which the two oracles gave different answers.
We use the “duration” interpretation for simplicity, but more generally τ can be some parameter controlling the computing resources available in the Training phase, and we can also allow the computing resources of the Controller to scale with τ.
The reason we restrict to rational numbers is because we need a notion of computing the distribution. It is in principle possible to generalize further to computable numbers. On the other hand, it might be more realistic to constrain even further to e.g. dyadic rationals (which can be implemented via fair coinflips). We stick to Q for simplicity.
This is not technically a distance since it is possible to have dTV(π1,π2)=0 if π1≠π2 so long as π1 and π2 only disagree on histories that are inconsistent with these policies. Such π1 and π2 are morally equivalent.
We could also allow l to have a Γ argument, but then we would have to remove the Eξ∗Θ[χEτμ∗] factor from the learning condition, because the choice of policy would matter intrinsically even if the agent doesn’t exist. Alternatively, we could modify the definition of Cτμ to avoid that. Or perhaps use some normalization factor more complicated than Eξ∗Θ[χEτμ∗].
Here is a modification of the IBP framework which removes the monotonicity principle, and seems to be more natural in other ways as well.
First, let our notion of “hypothesis” be Θ∈□c(Γ×2Γ). The previous framework can be interpreted in terms of hypotheses of this form satisfying the condition
prΓ×2ΓBr(Θ)=Θ
(See Proposition 2.8 in the original article.) In the new framework, we replace it by the weaker condition
Br(Θ)⊇(idΓ×diag2Γ)∗Θ
This can be roughly interpreted as requiring that (i) whenever the output of a program P determines whether some other program Q will run, program P has to run as well (ii) whenever programs P and Q are logically equivalent, program P runs iff program Q runs.
The new condition seems to be well-justified, and is also invariant under (i) mixing hypotheses (ii) taking joins/meets of hypotheses. The latter was not the case for the old condition. Moreover, it doesn’t imply that Θ is downward closed, and hence there is no longer a monotonicity principle[1].
The next question is, how do we construct hypotheses satisfying this condition? In the old framework, we could construct hypotheses of the form Ξ∈□c(Γ×Φ) and then apply the bridge transform. In particular, this allows a relatively straightforward translation of physics theories into IBP language (for example our treatment of quantum theory). Luckily, there is an analogous construction in the new framework as well.
First notice that our new condition on Θ can be reformulated as requiring that
suppΘ⊆elΓ
For any s:Γ→Γ define τs:ΔcelΓ→ΔcelΓ by τsθ:=χelΓ(s×id2Γ)∗. Then, we require τsΘ⊆Θ.
For any Φ, we also define τΦs:Δc(elΓ×Φ)→Δc(elΓ×Φ) by
τΦsθ:=χelΓ×Φ(s×id2Γ×Φ)∗
Now, for any Ξ∈□c(Γ×Φ), we define the “conservative bridge transform[2]” CBr(Ξ)∈□c(Γ×2Γ×Φ) as the closure of all τΦsθ where θ is a maximal element of Br(Ξ). It is then possible to see that Θ∈□c(Γ×2Γ) is a valid hypothesis if and only if it is of the form prΓ×2ΓCBr(Ξ) for some Φ and Ξ∈□c(Γ×Φ).
I still think the monotonicity principle is saying something about the learning theory of IBP which is still true in the new framework. Namely, it is possible to learn that a program is running but not possible to (confidently) learn that a program is not running, and this limits the sort of frequentist guarantees we can expect.
Intuitively, it can be interpreted as a version of the bridge transform where we postulate that a program doesn’t run unless Ξ contains a reason while it must run.
Quine’s are non-unique (there can be multiple fixed points). This means that, viewed as a prescriptive theory, IBP produces multi-valued prescriptions. It might be the case that this multi-valuedness can resolve problems with UDT such as Wei Dai’s 3-player Prisoner’s Dilemma and the anti-Newcomb problem[1]. In these cases, a particular UDT/IBP (corresponding to a particular quine) loses to CDT. But, a different UDT/IBP (corresponding to a different quine) might do as well as CDT.
What to do about agents that don’t know their own source-code? (Arguably humans are such.) Upon reflection, this is not really an issue! If we use IBP prescriptively, then we can always assume quining: IBP is just telling you to follow a procedure that uses quining to access its own (i.e. the procedure’s) source code. Effectively, you are instantiating an IBP agent inside yourself with your own prior and utility function. On the other hand, if we use IBP descriptively, then we don’t need quining: Any agent can be assigned “physicalist intelligence” (Definition 1.6 in the original post, can also be extended to not require a known utility function and prior, along the lines of ADAM) as long as the procedure doing the assigning knows its source code. The agent doesn’t need to know its own source code in any sense.
Physicalist agents see themselves as inhabiting an unprivileged position within the universe. However, it’s unclear whether humans should be regarded as such agents. Indeed, monotonicity is highly counterintuitive for humans. Moreover, historically human civilization struggled a lot with accepting the Copernican principle (and is still confused about issues such as free will, anthropics and quantum physics which physicalist agents shouldn’t be confused about). This presents a problem for superimitation.
What if humans are actually cartesian agents? Then, it makes sense to consider a variant of physicalist superimitation where instead of just seeing itself as unprivileged, the AI sees the user as a privileged agent. We call such agents “transcartesian”. Here is how this can be formalized as a modification of IBP.
In IBP, a hypothesis is specified by choosing the state space Φ and the belief Θ∈□(Γ×Φ). In the transcartesian framework, we require that a hypothesis is augmented by a mapping τ:Φ→(A0×O0)≤ω, where A0 is the action set of the reference agent (user) and O0 is the observation set of the reference agent. Given G0 the source code of the reference agent, we require that Θ is supported on the set
{(y,x)∈Γ×Φ∣∣ha⊑τ(x)⟹a=Gy0(h)}
That is, the actions of the reference agent are indeed computed by the source code of the reference agent.
Now, instead of using a loss function of the form L:elΓ→R, we can use a loss function of the form L:(A0×O0)≤ω→R which doesn’t have to satisfy any monotonicity constraint. (More generally, we can consider hybrid loss functions of the form L:(A0×O0)≤ω×elΓ→R monotonic in the second argument.) This can also be generalized to reference agents with hidden rewards.
As opposed to physicalist agents, transcartesian agents do suffer from penalties associated with the description complexity of bridge rules (for the reference agent). Such an agent can (for example) come to believe in a simulation hypothesis that is unlikely from a physicalist perspective. However, since such a simulation hypothesis would be compelling for the reference agent as well, this is not an alignment problem (epistemic alignment is maintained).
Up to light editing, the following was written by me during the “Finding the Right Abstractions for healthy systems” research workshop, hosted by Topos Institute in January 2023. However, I invented the idea before.
In order to allow R (the set of programs) to be infinite in IBP, we need to define the bridge transform for infinite Γ. At first, it might seem Γ can be allowed to be any compact Polish space, and the bridge transform should only depend on the topology on Γ, but that runs into problems. Instead, the right structure on Γ for defining the bridge transform seems to be that of a “profinite field space”: a category I came up with that I haven’t seen in the literature so far.
The category PFS of profinite field spaces is defined as follows. An object F of PFS is a set ind(F) and a family of finite sets Fαα∈ind(F). We denote Tot(F):=∏αFα. Given F and G objects of PFS, a morphism from F to G is a mapping f:Tot(F)→Tot(G) such that there exists R⊆ind(F)×ind(G) with the following properties:
For any α∈ind(F), the set R(α):=β∈ind(G)∣(α,β)∈R is finite.
For any β∈ind(G), the set R−1(β):=α∈ind(F)∣(α,β)∈R is finite.
For any β∈ind(G), there exists a mapping fβ:∏α∈R−1(β)Fα→Gβ s.t. for any x∈Tot(F), f(x)β:=fβ(prRβ(x)) where prRβ:Tot(F)→∏α∈R−1(β)Fα is the projection mapping.
The composition of PFS morphisms is just the composition of mappings.
It is easy to see that every PFS morphism is a continuous mapping in the product topology, but the converse is false. However, the converse is true for objects with finite ind (i.e. for such objects any mapping is a morphism). Hence, an object F in PFS can be thought of as Tot(F) equipped with additional structure that is stronger than the topology but weaker than the factorization into Fα.
The name “field space” is inspired by the following observation. Given F an object of PFS, there is a natural condition we can impose on a Borel probability distribution on Tot(F) which makes it a “Markov random field” (MRF). Specifically, μ∈ΔTot(F) is called an MRF if there is an undirected graph G whose vertices are ind(F) and in which every vertex is of finite degree, s.t.μ is an MRF on G in the obvious sense. The property of being an MRF is preserved under pushforwards w.r.t.PFS morphisms.
Master post for ideas about infra-Bayesian physicalism.
Other relevant posts:
Incorrigibility in IBP
PreDCA alignment protocol
The following are my thoughts on the definition of learning in infra-Bayesian physicalism (IBP), which is also a candidate for the ultimate prescriptive agent desideratum.
In general, learning of hypotheses about the physical universe is not possible because of traps. On the other hand, learning of hypotheses about computable mathematics is possible in the limit of ample computing resources, as long as we can ignore side effects of computations. Moreover, learning computable mathematics implies approximating Bayesian planning w.r.t the prior about the physical universe. Hence, we focus on this sort of learning.
We consider an agent comprised of three modules, that we call Simulator, Learner and Controller. The agent’s history consists of two phases. In the Training phase, the Learner interacts with the Simulator, and in the end produces a program for the Controller. In the Deployment phase, the Controller runs the program.
Roughly speaking:
The Simulator is a universal computer whose function is performing computational experiments, which we can think of as “thought experiments” or coarse-grained simulations of potential plans. It receives commands from the Learner (which computations to run / threads to start/stop) and reports to the Learner the results. We denote the Simulator’s input alphabet by IS and output alphabet by OS.
The Learner is the machine learning (training) module. The algorithm whose desideratum we’re specifying resides here.
The Controller (as in “control theory”) is a universal computer connected to the agent’s external interface (environmental actions A and observations O). It’s responsible for real-time interaction with the environment, and we can think of it as the learned policy. It is programmed by the Learner, for which purpose it has input alphabet IC.
We will refer to this as the SiLC architecture.
Let H⊆□Γ be our hypothesis class about computable mathematics. Let ξ:Γ→□2Γ be our prior about the physical universe[1]. These have to satisfy the coherence conditions
∀y∈Γ,α∈suppξ(y):y∈α∀y,y′∈Γ,θ∈ξ(y):χy′∈αθ∈ξ(y′)∀Θ∈H,θ∈Θ,s:Γ→Γ,ϕ:Γ→ΔcΓ:ϕ⪯ξ⟹prΓχelΓ(s×id2Γ)∗(θ⋉ϕ)∈ΘHere, ϕ⪯ξ means that ∀y∈Γ:ϕ(y)∈ξ(y).
Together, these ensure that Θ⋉ξ is a coherent IBP hypothesis. Notice that for any ξ0:Γ→□2Γ satisfying the first condition[2], there is a unique minimal coherent ξ:Γ→□2Γ s.t. ∀y∈Γ:ξ0(y)⊆ξ(y). Moreover, given a coherent ξ and any Θ0∈□Γ, there is a unique minimal coherent Θ∈□Γ s.t. Θ0⊆Θ.
The duration of the Training phase will be denoted by τ∈N[3]. We can think of it as “computational time”.
Let the source codes of the Learner (obtained by quining), the Simulator and the Controller respectively be denoted by
DL:Γ×N×(IS×OS)∗→ΔQ(IS×IC)DS:Γ×I+S→OSDC:Γ×I∗C×O∗→AHere, the N argument of DL corresponds to τ and ΔQ is a probability distribution in which all probabilities are rational numbers[4].
We assume that the simulator can indeed run any computation, and that any given halting computation would run fast for τ≫0. These are assumptions on DS (or, on some combination of (i) DS, (ii) the definition of Γ, and (iii) the support of all Θ∈H) that we will not spell out here.
We will say that a policy is a mapping of type (IS×IC×OS)∗→ΔQ(IS×IC) and a metapolicy is a mapping of type N×(IS×IC×OS)∗→ΔQ(IS×IC).
Given any D′L:Γ×N×(IS×IC×OS)∗→ΔQ(IS×IC), we can compose it with DS and DC in the obvious way[5] to yield
DS⊗D′L⊗DC:Γ×N×(A×O)∗→ΔQAIn particular, we can take D′L=μ for some metapolicy μ by postulating no dependence on the Γ argument.
Denote by P the set of all policies. Given metapolicy μ and π∈P, we define μπ:[τ+1]×(IS×IC×OS)∗→ΔQ(IS×IC) by
μπ(k,pq):={μ(k,pq) if k<τπ(pq) if k=τGiven any ν:[τ+1]×(IS×IC×OS)∗→ΔQ(IS×IC), we say that y∈Γ is a ντ-consistent counterpossible when the following conditions hold:
For all k<τ and h∈(IS×IC×OS)<k, DL(y,k,pq)=ν(k,pq)
For all k≤τ and h∈(A×O)∗, (DS⊗DL⊗DC)(y,k,pq)=(DS⊗ν⊗DC)(y,k,pq)
We denote by Cτν⊆Γ the set of ντ-consistent counterpossibles.
A (deterministic) copolicy is a mapping of signature I+S→OS. We denote the set of copolicies by C. Given a policy π and a copolicy ν, we define π^⊗ν∈Δ(IS×IC×OS)∗ in the obvious way. Given policies π1,π2, we define their total variation distance[6] to be
dTV(π1,π2):=maxν∈CdTV(π1^⊗ν,π2^⊗ν)Given Ξ∈□(Γ×2Γ), f:Γ×2Γ→[0,∞), τ∈N and metapolicy μ, we will use the notation
EΞ[f∣∣μτ]:=minπ∈P(EΞ[f⋅χCτμπ]+dTV(π,μ(τ)))Intuitively, EΞ[f∣∣μτ] should be thought as the counterfactual expectation of loss function f assuming metapolicy μ, while adding a “bonus” to account for “fair” treatment of randomization by the agent. More on that below.
Given a metapolicy μ and τ∈N, we define Eτμ⊆2Γ by
Eτμ:={α∈2Γ∣∃π∈P:Cτμπ∩α=∅}Intuitively, Eτtμ is the set of universe states for which at least one copy of the agent exists which followed the metapolicy μ until computational time τ.
Given a loss function l:N×2Γ→[0,1][7] (which we allow to explicitly depend on computational time for greater generality), the learning condition on a metapolicy μ∗ and hypothesis Θ∈H is
∑τ<NEξ∗Θ[l(τ)∣∣μ∗τ]≤∑τ<N(minπ∈PEξ∗Θ[l(τ)⋅χCτμ∗π]+Eξ∗Θ[χEτμ∗]ϵ(Θ,N))Here, ϵ is the “regret bound” function which should vanish in the N→∞ limit.
Some remarks on the particulars of this definition:
There are several reasons we impose (DS⊗DL⊗DC)(y,k,pq)=(DS⊗ν⊗DC)(y,k,pq) rather than DL(y,k,pq)=ν(k,pq):
First, we want to ignore the side effects of running computations on the Simulator (both causal side effects and “mindcrime”, i.e. the direct contribution of those computations to l). Because, taking side effects into account is usually inconsistent with the unlimited experimentation needed for learning.
Second, learning requires trusting the reports of the Simulator, which means we should only impose the policy on copies of DL that are actually connected to DS.
Third, we should also be able to trust the Controller, because otherwise we lose the semantic grounding of the agent’s external interface. (Even though this is not necessary for learning per se.).
On the other hand, we impose DL(y,k,pq)=ν(k,pq) in the computational past because that’s valid additional information that doesn’t interfere with the learning or decision theory.
The learning criterion treats computational time myopically, so that we won’t have to worry about traps in computational time.
The reason we need randomization is, it’s often necessary for efficient learning. In the simplest non-trivial examples, we learn by IID sampling a distribution over computations (e.g. we simulate the interaction between a particular policy and our physical prior ξ). If we sampled deterministically instead, Murphy would be able to fool us by changing behavior precisely at the sampled instances.
The reason we need EΞ[f∣∣μτt] is, randomization only helps if low probability events can be ignored. However, if sufficiently many copies of the agents are instantiated, even a low probability even would be detectable. Hence, we use a “forgiving” metric that assigns low loss even to distributions that technically have high loss but are close to a different distribution with low loss.
We can consider Newcombian problems where Omega makes decisions based on the agent’s action probabilities. I suspect that if Omega’s policy is Lipschitz in the agent policy, the behavior advised by the EΞ[f∣∣μτt] counterfactual will converge to FDT-optimal in the limit of sufficiently many iterations.
Both in the case of ignoring side effects of computations and in the case of the treatment of randomization, we can be accused of departing from priorism (“updatelessness”). However, I think that this departure is justified. In the original TDT paper, Yudkowsky addressed the “Omega rewards irrationality” objection by postulating that, a decision problem is “fair” when it only depends on the agent’s decisions rather than on how the agent makes those decisions. Here, we use the same principle: the agent should not be judged based on its internal thought process (side effects), and it should in some sense be judged based on its decisions rather than the probabilities assigned to those decisions.
Also about priorism, this kind of agents will not endorse iterated-in-computational-time “logical” counterfactual mugging when the same coin is reused indefinitely, but will endorse it when a new coin is used every time, for an appropriate definition of “new” (or e.g. we switch to a new coin every k rounds). Arguably, this solves the tension between priorism and learning observed by Demski. Formulating the precise criterion when Learning-IBP behavior converges to priorist / FDT-optimal is left for further research.
The dependence of ϵ(Θ,N) on Θ should ultimately involve some kind of description complexity. However, it will also involve something in the vein of “what are the computational resource bounds, according to the belief Θ, for running certain computations, selected for their importance in testing Θ”. In particular, we won’t require the agent to learning anything about non-halting computations. Indeed, any hypothesis about such computations will either assert a time bound on running the non-halting computations (in which case it is false) or will fail to assert any such bound, in which case its learning complexity is known to be infinite.
We could make do without the Eξ∗Θ[χEτμ∗] factor but that would make the learning criterion weaker. The presence of this factor means that, roughly speaking, regret should be low even conditional on the agent existing, which seems like a reasonable demand.
Given an AI designed along these principles, we might worry about the impact of the side effects that are ignored. Maybe these can produce non-Cartesian daemons. However, during the Training phase, the algorithm has no access to external observation, which arguably makes it unlikely anything inside it can learn how to cyberattack. Moreover, during the Deployment phase, any reason for concern would be mediated by the particular algorithm the Controller runs (rather than the particulars of how it’s implemented), which is what we do take into account in our optimization. Finally, the agent might be able to self-modify to make itself safer: we can even intentionally give it the means to do so (as part of its action space A). This probably requires careful prior-shaping to work well.
This framework assumes all our hypotheses are disintegrable w.r.t. the factorization into Γ and 2Γ. It is an interesting question to understand whether we should or can relax this assumption.
For example, we can imagine ξ0 to be a Solomonoff-like prior along the following lines. Every hypothesis comprising ξ0 is defined by a Turing machine M with access to two oracles representing y,y′∈Γ and two tapes of random and “ambiguous” bits respectively.ξ0(y) is defined by running M with one oracle fielding queries about y (i.e. we given a program P we can request to know its counterpossible output Py) and the other oracle fielding queries about some y′ s.t. we want to decide whether y′∈α for α∼θ∈ξ0(y). M is only allowed to return NO if there was at least one query to which the two oracles gave different answers.
We use the “duration” interpretation for simplicity, but more generally τ can be some parameter controlling the computing resources available in the Training phase, and we can also allow the computing resources of the Controller to scale with τ.
The reason we restrict to rational numbers is because we need a notion of computing the distribution. It is in principle possible to generalize further to computable numbers. On the other hand, it might be more realistic to constrain even further to e.g. dyadic rationals (which can be implemented via fair coinflips). We stick to Q for simplicity.
We let the Learner interact with the Simulator for τ timesteps, producing some output g∈IτC, and then run the Controller with g as an input.
This is not technically a distance since it is possible to have dTV(π1,π2)=0 if π1≠π2 so long as π1 and π2 only disagree on histories that are inconsistent with these policies. Such π1 and π2 are morally equivalent.
We could also allow l to have a Γ argument, but then we would have to remove the Eξ∗Θ[χEτμ∗] factor from the learning condition, because the choice of policy would matter intrinsically even if the agent doesn’t exist. Alternatively, we could modify the definition of Cτμ to avoid that. Or perhaps use some normalization factor more complicated than Eξ∗Θ[χEτμ∗].
Here is a modification of the IBP framework which removes the monotonicity principle, and seems to be more natural in other ways as well.
First, let our notion of “hypothesis” be Θ∈□c(Γ×2Γ). The previous framework can be interpreted in terms of hypotheses of this form satisfying the condition
prΓ×2ΓBr(Θ)=Θ(See Proposition 2.8 in the original article.) In the new framework, we replace it by the weaker condition
Br(Θ)⊇(idΓ×diag2Γ)∗ΘThis can be roughly interpreted as requiring that (i) whenever the output of a program P determines whether some other program Q will run, program P has to run as well (ii) whenever programs P and Q are logically equivalent, program P runs iff program Q runs.
The new condition seems to be well-justified, and is also invariant under (i) mixing hypotheses (ii) taking joins/meets of hypotheses. The latter was not the case for the old condition. Moreover, it doesn’t imply that Θ is downward closed, and hence there is no longer a monotonicity principle[1].
The next question is, how do we construct hypotheses satisfying this condition? In the old framework, we could construct hypotheses of the form Ξ∈□c(Γ×Φ) and then apply the bridge transform. In particular, this allows a relatively straightforward translation of physics theories into IBP language (for example our treatment of quantum theory). Luckily, there is an analogous construction in the new framework as well.
First notice that our new condition on Θ can be reformulated as requiring that
suppΘ⊆elΓ
For any s:Γ→Γ define τs:ΔcelΓ→ΔcelΓ by τsθ:=χelΓ(s×id2Γ)∗. Then, we require τsΘ⊆Θ.
For any Φ, we also define τΦs:Δc(elΓ×Φ)→Δc(elΓ×Φ) by
τΦsθ:=χelΓ×Φ(s×id2Γ×Φ)∗Now, for any Ξ∈□c(Γ×Φ), we define the “conservative bridge transform[2]” CBr(Ξ)∈□c(Γ×2Γ×Φ) as the closure of all τΦsθ where θ is a maximal element of Br(Ξ). It is then possible to see that Θ∈□c(Γ×2Γ) is a valid hypothesis if and only if it is of the form prΓ×2ΓCBr(Ξ) for some Φ and Ξ∈□c(Γ×Φ).
I still think the monotonicity principle is saying something about the learning theory of IBP which is still true in the new framework. Namely, it is possible to learn that a program is running but not possible to (confidently) learn that a program is not running, and this limits the sort of frequentist guarantees we can expect.
Intuitively, it can be interpreted as a version of the bridge transform where we postulate that a program doesn’t run unless Ξ contains a reason while it must run.
Two thoughts about the role of quining in IBP:
Quine’s are non-unique (there can be multiple fixed points). This means that, viewed as a prescriptive theory, IBP produces multi-valued prescriptions. It might be the case that this multi-valuedness can resolve problems with UDT such as Wei Dai’s 3-player Prisoner’s Dilemma and the anti-Newcomb problem[1]. In these cases, a particular UDT/IBP (corresponding to a particular quine) loses to CDT. But, a different UDT/IBP (corresponding to a different quine) might do as well as CDT.
What to do about agents that don’t know their own source-code? (Arguably humans are such.) Upon reflection, this is not really an issue! If we use IBP prescriptively, then we can always assume quining: IBP is just telling you to follow a procedure that uses quining to access its own (i.e. the procedure’s) source code. Effectively, you are instantiating an IBP agent inside yourself with your own prior and utility function. On the other hand, if we use IBP descriptively, then we don’t need quining: Any agent can be assigned “physicalist intelligence” (Definition 1.6 in the original post, can also be extended to not require a known utility function and prior, along the lines of ADAM) as long as the procedure doing the assigning knows its source code. The agent doesn’t need to know its own source code in any sense.
@Squark is my own old LessWrong account.
Physicalist agents see themselves as inhabiting an unprivileged position within the universe. However, it’s unclear whether humans should be regarded as such agents. Indeed, monotonicity is highly counterintuitive for humans. Moreover, historically human civilization struggled a lot with accepting the Copernican principle (and is still confused about issues such as free will, anthropics and quantum physics which physicalist agents shouldn’t be confused about). This presents a problem for superimitation.
What if humans are actually cartesian agents? Then, it makes sense to consider a variant of physicalist superimitation where instead of just seeing itself as unprivileged, the AI sees the user as a privileged agent. We call such agents “transcartesian”. Here is how this can be formalized as a modification of IBP.
In IBP, a hypothesis is specified by choosing the state space Φ and the belief Θ∈□(Γ×Φ). In the transcartesian framework, we require that a hypothesis is augmented by a mapping τ:Φ→(A0×O0)≤ω, where A0 is the action set of the reference agent (user) and O0 is the observation set of the reference agent. Given G0 the source code of the reference agent, we require that Θ is supported on the set
{(y,x)∈Γ×Φ∣∣ha⊑τ(x)⟹a=Gy0(h)}That is, the actions of the reference agent are indeed computed by the source code of the reference agent.
Now, instead of using a loss function of the form L:elΓ→R, we can use a loss function of the form L:(A0×O0)≤ω→R which doesn’t have to satisfy any monotonicity constraint. (More generally, we can consider hybrid loss functions of the form L:(A0×O0)≤ω×elΓ→R monotonic in the second argument.) This can also be generalized to reference agents with hidden rewards.
As opposed to physicalist agents, transcartesian agents do suffer from penalties associated with the description complexity of bridge rules (for the reference agent). Such an agent can (for example) come to believe in a simulation hypothesis that is unlikely from a physicalist perspective. However, since such a simulation hypothesis would be compelling for the reference agent as well, this is not an alignment problem (epistemic alignment is maintained).
Up to light editing, the following was written by me during the “Finding the Right Abstractions for healthy systems” research workshop, hosted by Topos Institute in January 2023. However, I invented the idea before.
In order to allow R (the set of programs) to be infinite in IBP, we need to define the bridge transform for infinite Γ. At first, it might seem Γ can be allowed to be any compact Polish space, and the bridge transform should only depend on the topology on Γ, but that runs into problems. Instead, the right structure on Γ for defining the bridge transform seems to be that of a “profinite field space”: a category I came up with that I haven’t seen in the literature so far.
The category PFS of profinite field spaces is defined as follows. An object F of PFS is a set ind(F) and a family of finite sets Fαα∈ind(F). We denote Tot(F):=∏αFα. Given F and G objects of PFS, a morphism from F to G is a mapping f:Tot(F)→Tot(G) such that there exists R⊆ind(F)×ind(G) with the following properties:
For any α∈ind(F), the set R(α):=β∈ind(G)∣(α,β)∈R is finite.
For any β∈ind(G), the set R−1(β):=α∈ind(F)∣(α,β)∈R is finite.
For any β∈ind(G), there exists a mapping fβ:∏α∈R−1(β)Fα→Gβ s.t. for any x∈Tot(F), f(x)β:=fβ(prRβ(x)) where prRβ:Tot(F)→∏α∈R−1(β)Fα is the projection mapping.
The composition of PFS morphisms is just the composition of mappings.
It is easy to see that every PFS morphism is a continuous mapping in the product topology, but the converse is false. However, the converse is true for objects with finite ind (i.e. for such objects any mapping is a morphism). Hence, an object F in PFS can be thought of as Tot(F) equipped with additional structure that is stronger than the topology but weaker than the factorization into Fα.
The name “field space” is inspired by the following observation. Given F an object of PFS, there is a natural condition we can impose on a Borel probability distribution on Tot(F) which makes it a “Markov random field” (MRF). Specifically, μ∈ΔTot(F) is called an MRF if there is an undirected graph G whose vertices are ind(F) and in which every vertex is of finite degree, s.t.μ is an MRF on G in the obvious sense. The property of being an MRF is preserved under pushforwards w.r.t.PFS morphisms.