We assume a set Ω of hypotheses about the world. We assume the oracle’s beliefs are given by a probability distribution μ∈ΔΩ.
We assume sets Q and A of possible queries and answers respectively. Maybe these are exabyte files, i.e.Q≅A≅{0,1}N for N=260.
Let Φ be the set of mathematical formula that Joe might submit. These formulae are given semantics eval(ϕ):Ω×Q→ΔA for each formula ϕ∈Φ.[1]
We assume a function H:Ω×Q→ΔΦ where H(α,q)(ϕ)∈[0,1] is the probability that Joe submits formula ϕ after reading query q, under hypothesis α.[2]
We define QACI:Ω×Q→ΔA as follows: sample ϕ∼H(α,q), then sample a∼eval(ϕ)(α,q), then return a.
For a fixed hypothesis α, we can interpret the answer a∼QACI(α,‘‘Best utility function?")as a utility function uα:Π→R via some semantics eval-u:A→(Π→R).
Then we define u:Π→R via integrating over μ, i.e. u(π):=∫uα(π)dμ(α).
A policy π∈Π is optimal if and only if π∗∈argmaxΠ(u).
The hope is that μ, eval, eval-u, and H can be defined mathematically. Then the optimality condition can be defined mathematically.
Question 0
What if there’s no policy which maximises u:Π→R? That is, for every policy π there is another policy π′ such that u(π′)>u(π). I suppose this is less worrying, but what if there are multiple policies which maximises u?
Question 1
In Step 7 above, you average all the utility functions together, whereas I suggested sampling a utility function. I think my solution might be safer.
Suppose the oracle puts 5% chance on hypotheses such that QACI(α,−) is malign. I think this is pretty conservative, because Solomonoff predictor is malign, and some of the concerns Evhub raises here. And the QACI amplification might not preserve benignancy. It follows that, under your solution, u:Π→R is influenced by a coalition of malign agents, and similarly π∗∈argmax(u) is influenced by the malign coalition.
By contrast, I suggest sampling α∼μ and then finding π∗∈argmaxΠ(uα). This should give us a benign policy with 95% chance, which is pretty good odds. Is this safer? Not sure.
Question 2
I think the eval function doesn’t work, i.e. there won’t be a way to mathematically define the semantics of the formula language. In particular, the language Φ must be strictly weaker than the meta-language in which you are hoping to define eval:Φ→(Ω×Q→ΔA) itself. This is because of Tarski’s Undefinability of Truth (and other no-go theorems).
This might seem pedantic, but you in practical terms: there’s no formula ϕ whose semantics is QACI itself. You can see this via a diagonal proof: imagine if Joe always writes the formal expression ϕ=‘‘1−QACI(α,q)".
The most elegant solution is probably transfinite induction, but this would give us a QACI for each ordinal.
Question 3
If you have an ideal reasoner, why bother with reward functions when you can just straightforwardly do untractable-to-naively-compute utility functions
I want to understand how QACI and prosaic ML map onto each other. As far as I can tell, issues with QACI will be analogous to issues with prosaic ML and vice-versa.
Question 4
I still don’t understand why we’re using QACI to describe a utility function over policies, rather than using QACI in a more direct approach.
Here’s one approach. We pick a policy which maximises QACI(α,‘‘How good is policy π?").[3] The advantage here is that Joe doesn’t need to reason about utility functions over policies, he just need to reason about a single policy in front of him
Here’s another approach. We use QACI as our policy directly. That is, in each context c that the agent finds themselves in, they sample an action from QACI(α,‘‘What is the best action is context c?") and take the resulting action.[4] The advantage here is that Joe doesn’t need to reason about policies whatsoever, he just needs to reason about a single context in front of him. This is also the most “human-like”, because there’s no argmax’s (except if Joe submits a formula with an argmax).
Here’s another approach. In each context c, the agent takes an action y which maximises QACI(α,‘‘How good is action y in context c?").
I think you would say eval:Ω×Q→A. I’ve added the Δ, which simply amounts to giving Joe access to a random number generator. My remarks apply if eval:Ω×Q→A also.
I think you would say H:Ω×Q→Φ. I’ve added the Δ, which simply amount to including hypotheses that Joe is stochastic. But my remarks apply if H:Ω×Q→Φ also.
I would prefer the agent samples α∼μ once at the start of deployment, and reuses the same hypothesis α at each time-step. I suspect this is safer than resampling α at each time-step, for reasons discussed before.
Thanks Tamsin! Okay, round 2.
My current understanding of QACI:
We assume a set Ω of hypotheses about the world. We assume the oracle’s beliefs are given by a probability distribution μ∈ΔΩ.
We assume sets Q and A of possible queries and answers respectively. Maybe these are exabyte files, i.e.Q≅A≅{0,1}N for N=260.
Let Φ be the set of mathematical formula that Joe might submit. These formulae are given semantics eval(ϕ):Ω×Q→ΔA for each formula ϕ∈Φ.[1]
We assume a function H:Ω×Q→ΔΦ where H(α,q)(ϕ)∈[0,1] is the probability that Joe submits formula ϕ after reading query q, under hypothesis α.[2]
We define QACI:Ω×Q→ΔA as follows: sample ϕ∼H(α,q), then sample a∼eval(ϕ)(α,q), then return a.
For a fixed hypothesis α, we can interpret the answer a∼QACI(α,‘‘Best utility function?")as a utility function uα:Π→R via some semantics eval-u:A→(Π→R).
Then we define u:Π→R via integrating over μ, i.e. u(π):=∫uα(π)dμ(α).
A policy π∈Π is optimal if and only if π∗∈argmaxΠ(u).
The hope is that μ, eval, eval-u, and H can be defined mathematically. Then the optimality condition can be defined mathematically.
Question 0
What if there’s no policy which maximises u:Π→R? That is, for every policy π there is another policy π′ such that u(π′)>u(π). I suppose this is less worrying, but what if there are multiple policies which maximises u?
Question 1
In Step 7 above, you average all the utility functions together, whereas I suggested sampling a utility function. I think my solution might be safer.
Suppose the oracle puts 5% chance on hypotheses such that QACI(α,−) is malign. I think this is pretty conservative, because Solomonoff predictor is malign, and some of the concerns Evhub raises here. And the QACI amplification might not preserve benignancy. It follows that, under your solution, u:Π→R is influenced by a coalition of malign agents, and similarly π∗∈argmax(u) is influenced by the malign coalition.
By contrast, I suggest sampling α∼μ and then finding π∗∈argmaxΠ(uα). This should give us a benign policy with 95% chance, which is pretty good odds. Is this safer? Not sure.
Question 2
I think the eval function doesn’t work, i.e. there won’t be a way to mathematically define the semantics of the formula language. In particular, the language Φ must be strictly weaker than the meta-language in which you are hoping to define eval:Φ→(Ω×Q→ΔA) itself. This is because of Tarski’s Undefinability of Truth (and other no-go theorems).
This might seem pedantic, but you in practical terms: there’s no formula ϕ whose semantics is QACI itself. You can see this via a diagonal proof: imagine if Joe always writes the formal expression ϕ=‘‘1−QACI(α,q)".
The most elegant solution is probably transfinite induction, but this would give us a QACI for each ordinal.
Question 3
I want to understand how QACI and prosaic ML map onto each other. As far as I can tell, issues with QACI will be analogous to issues with prosaic ML and vice-versa.
Question 4
I still don’t understand why we’re using QACI to describe a utility function over policies, rather than using QACI in a more direct approach.
Here’s one approach. We pick a policy which maximises QACI(α,‘‘How good is policy π?").[3] The advantage here is that Joe doesn’t need to reason about utility functions over policies, he just need to reason about a single policy in front of him
Here’s another approach. We use QACI as our policy directly. That is, in each context c that the agent finds themselves in, they sample an action from QACI(α,‘‘What is the best action is context c?") and take the resulting action.[4] The advantage here is that Joe doesn’t need to reason about policies whatsoever, he just needs to reason about a single context in front of him. This is also the most “human-like”, because there’s no argmax’s (except if Joe submits a formula with an argmax).
Here’s another approach. In each context c, the agent takes an action y which maximises QACI(α,‘‘How good is action y in context c?").
E.t.c.
Happy to jump on a call if that’s easier.
I think you would say eval:Ω×Q→A. I’ve added the Δ, which simply amounts to giving Joe access to a random number generator. My remarks apply if eval:Ω×Q→A also.
I think you would say H:Ω×Q→Φ. I’ve added the Δ, which simply amount to including hypotheses that Joe is stochastic. But my remarks apply if H:Ω×Q→Φ also.
By this I mean either:
(1) Sample α∼μ, then maximise the function π↦QACI(α,‘‘How good is policy π?").
(2) Maximise the function π↦∫QACI(α,‘‘How good is policy π?")dμ(α).
For reasons I mentioned in Question 1, I suspect (1) is safer, but (2) is closer to your original approach.
I would prefer the agent samples α∼μ once at the start of deployment, and reuses the same hypothesis α at each time-step. I suspect this is safer than resampling α at each time-step, for reasons discussed before.