(oops, this ended up being fairly long-winded! hope you don’t mind. feel free to ask for further clarifications.)
There’s a bunch of things wrong with your description, so I’ll first try to rewrite it in my own words, but still as close to the way you wrote it (so as to try to bridge the gap to your ontology) as possible. Note that I might post QACI 2 somewhat soon, which simplifies a bunch of QACI by locating the user as {whatever is interacting with the computer the AI is running on} rather than by using a beacon.
A first pass is to correct your description to the following:
We find a competent honourable human at a particular point in timeH, like Joe Carlsmith or Wei Dai, and give them a rock engraved with a 1GB secret key, large enough that in counterfactuals it could replace with an entire snapshot of . We also give them the ability to express a 1GB output, eg by writing a 1GB key somewhere which is somehow “signed” as the only . This is part of H — H is not just the human being queried at a particular point in time, it’s also the human producing an answer in some way. So H is a function from 1GB bitstring to 1GB bitstring. We define H+ as H, followed by whichever new process H describes in its output — typically another instance of H except with a different 1GB payload.
We want a model M of the agent H+. In QACI, we get M by asking a Solomonoff-like ideal reasoner for their best guess about H+ after feeding them a bunch of data about the world and the secret key.
We then ask M the question q, “What’s the best utility-function-over-policies to maximise?” to get a utility function U:(O×A)∗→R. We then **ask our solomonoff-like ideal reasoner for their best guess about which action A maximizes U.
Indeed, as you ask in question 3, in this description there’s not really a reason to make step 3 an extra thing. The important thing to notice here is that model M might get pretty good, but it’ll still have uncertainty.
When you say “we get M by asking a Solomonoff-like ideal reasoner for their best guess about H+”, you’re implying that — positing U(M,A) to be the function that says how much utility the utility function returned by model M attributes to action A (in the current history-so-far) — we do something like:
let M ← oracle(argmax { for model M } 𝔼 { over uncertainty } P(M))
let A ← oracle(argmax { for action A } U(M, A))
perform(A)
Indeed, in this scenario, the second line is fairly redundant.
The reason we ask for a utility function is because we want to get a utility function within the counterfactual — we don’t want to collapse the uncertainty with an argmax before extracting a utility function, but after. That way, we can do expected-given-uncertainty utility maximization over the full distribution of model-hypotheses, rather than over our best guess about M. We do:
let A ← oracle(argmax { for A } 𝔼 { for M, over uncertainty } P(M) · U(M, A))
perform(A)
That is, we ask our ideal reasoner (oracle) for the action with the best utility given uncertainty — not just logical uncertainty, but also uncertainty about which M. This contrasts with what you describe, in which we first pick the most probable M and then calculate the action with the best utility according only to that most-probable pick.
To answer the rest of your questions:
Is this basically IDA, where Step 1 is serial amplification, Step 2 is imitative distillation, and Step 3 is reward modelling?
Unclear! I’m not familiar enough with IDA, and I’ve bounced off explanations for it I’ve seen in the past. QACI doesn’t feel to me like it particularly involves the concepts of distillation or amplification, but I guess it does involve the concept of iteration, sure. But I don’t get the thing called IDA.
Why not replace Step 1 with Strong HCH or some other amplification scheme?
It’s unclear to me how one would design an amplification scheme — see concerns of the general shape expressed here. The thing I like about my step 1 is that the QACI loop (well, really, graph (well, really, arbitrary computation, but most of the time the user will probably just call themself in sequence)) is that its setup doesn’t involve any AI at all — you could go back in time before the industrial revolution and explain the core QACI idea and it would make sense assuming time-travelling-messages magic, and the magic wouldn’t have to do any extrapolating. Just tell someone the idea is that they could send a message to {their past self at a particular fixed point in time}. If there’s any amplification scheme, it’ll be one designed by the user, inside QACI, with arbitrarily long to figure it out.
What does “bajillion” actually mean in Step 1?
As described above, we don’t actually pre-determine the length of the sequence, or in fact the shape of the graph at all. Each iteration decides whether to spawn one or several next iteration, or indeed to spawn an arbitrarily different long-reflection process.
Why are we doing Step 3? Wouldn’t it be better to just use M directly as our superintelligence? It seems sufficient to achieve radical abundance, life extension, existential security, etc.
Why not ask M for the policy π directly? Or some instruction for constructing π? The instruction could be “Build the policy using our super-duper RL algo with the following reward function...” but it could be anything.
Hopefully my correction above answers these.
What if there’s no reward function that should be maximised? Presumably the reward function would need to be “small”, i.e. less than a Exabyte, which imposes a maybe-unsatisfiable constraint.
(Again, untractable-to-naively-compute utility function*, not easily-trained-on reward function. If you have an ideal reasoner, why bother with reward functions when you can just straightforwardly do untractable-to-naively-compute utility functions?)
I guess this is kinda philosophical? I have some short thoughts on here. If an exabyte is enough to describe to describe {a communication channel with a human-on-earth} to an AI-on-earth, which I think seems likely, then it’s enough to build “just have a nice corrigible assistant ask the humans what they want”-type channels.
Put another way: if there are actions which are preferable to other actions, then it seems to me like utility function are a fully lossless way for counterfactual QACI users to express which kinds of actions they want the AI to perform, which is all we need. If there’s something wrong with utility function over worlds, then counterfactual QACI users can output a utility function which favors actions which lead to something other than utility maximization over worlds, for example actions which lead to the construction of a superintelligent corrigible assistant which will help the humans come up with a better scheme.
Why is there no iteration, like in IDA? For example, after Step 2, we could loop back to Step 1 but reassign H as H with oracle access to M.
Again, I don’t get IDA. Iteration doesn’t seem particularly needed? Note that inside QACI, the user does have access to an oracle and to all relevant pieces of hypothesis about which hypothesis it is inhabiting in — this is what, in the QACI math, this line does:
QACI0’s distribution over answers demands that the answer payload πr, when interpreted as math and with all required contextual variables passed as input (q,μ1,μ2,α,γq,ξ).
Notably, α is the hypothesis for which world the user is being considered in, and γq,ξ for their location within that world. Those are sufficient to fully characterize the hypothesis-for-H that describes them. And because the user doesn’t really return just a string but a math function which takes q,μ1,μ2,α,γq,ξ as input and returns a string, they can have that math function do arbitrary work — including rederive H. In fact, rediriving H is how they call a next iteration: they say (except in math) “call H again (rederived using q,μ1,μ2,α,γq,ξ), but with this string, and return the result of that.” See also this illustration, which is kinda wrong in places but gets the recursion call graph thing right.
Another reason to do “iteration” like this inside the counterfactual rather than in the actual factual world (if that’s what IDA does, which I’m only guessing here) is that we don’t have as many iteration steps as we want in the factual world — eventually OpenAI or someone else kills everyone, whereas in the counterfactual, the QACI users are the only ones who can make progress, so the QACI users essentially have as long as they want, so long as they don’t take too long in each individual counterfactual step or other somewhat easily avoided actions like that.
Why isn’t Step 3 recursive reward modelling? i.e. we could collect a bunch of trajectories from π and ask M to use those trajectories to improve the reward function.
Unclear if this still means anything given the rest of this post. Ask me again if it does.
We assume a set Ω of hypotheses about the world. We assume the oracle’s beliefs are given by a probability distribution μ∈ΔΩ.
We assume sets Q and A of possible queries and answers respectively. Maybe these are exabyte files, i.e.Q≅A≅{0,1}N for N=260.
Let Φ be the set of mathematical formula that Joe might submit. These formulae are given semantics eval(ϕ):Ω×Q→ΔA for each formula ϕ∈Φ.[1]
We assume a function H:Ω×Q→ΔΦ where H(α,q)(ϕ)∈[0,1] is the probability that Joe submits formula ϕ after reading query q, under hypothesis α.[2]
We define QACI:Ω×Q→ΔA as follows: sample ϕ∼H(α,q), then sample a∼eval(ϕ)(α,q), then return a.
For a fixed hypothesis α, we can interpret the answer a∼QACI(α,‘‘Best utility function?")as a utility function uα:Π→R via some semantics eval-u:A→(Π→R).
Then we define u:Π→R via integrating over μ, i.e. u(π):=∫uα(π)dμ(α).
A policy π∈Π is optimal if and only if π∗∈argmaxΠ(u).
The hope is that μ, eval, eval-u, and H can be defined mathematically. Then the optimality condition can be defined mathematically.
Question 0
What if there’s no policy which maximises u:Π→R? That is, for every policy π there is another policy π′ such that u(π′)>u(π). I suppose this is less worrying, but what if there are multiple policies which maximises u?
Question 1
In Step 7 above, you average all the utility functions together, whereas I suggested sampling a utility function. I think my solution might be safer.
Suppose the oracle puts 5% chance on hypotheses such that QACI(α,−) is malign. I think this is pretty conservative, because Solomonoff predictor is malign, and some of the concerns Evhub raises here. And the QACI amplification might not preserve benignancy. It follows that, under your solution, u:Π→R is influenced by a coalition of malign agents, and similarly π∗∈argmax(u) is influenced by the malign coalition.
By contrast, I suggest sampling α∼μ and then finding π∗∈argmaxΠ(uα). This should give us a benign policy with 95% chance, which is pretty good odds. Is this safer? Not sure.
Question 2
I think the eval function doesn’t work, i.e. there won’t be a way to mathematically define the semantics of the formula language. In particular, the language Φ must be strictly weaker than the meta-language in which you are hoping to define eval:Φ→(Ω×Q→ΔA) itself. This is because of Tarski’s Undefinability of Truth (and other no-go theorems).
This might seem pedantic, but you in practical terms: there’s no formula ϕ whose semantics is QACI itself. You can see this via a diagonal proof: imagine if Joe always writes the formal expression ϕ=‘‘1−QACI(α,q)".
The most elegant solution is probably transfinite induction, but this would give us a QACI for each ordinal.
Question 3
If you have an ideal reasoner, why bother with reward functions when you can just straightforwardly do untractable-to-naively-compute utility functions
I want to understand how QACI and prosaic ML map onto each other. As far as I can tell, issues with QACI will be analogous to issues with prosaic ML and vice-versa.
Question 4
I still don’t understand why we’re using QACI to describe a utility function over policies, rather than using QACI in a more direct approach.
Here’s one approach. We pick a policy which maximises QACI(α,‘‘How good is policy π?").[3] The advantage here is that Joe doesn’t need to reason about utility functions over policies, he just need to reason about a single policy in front of him
Here’s another approach. We use QACI as our policy directly. That is, in each context c that the agent finds themselves in, they sample an action from QACI(α,‘‘What is the best action is context c?") and take the resulting action.[4] The advantage here is that Joe doesn’t need to reason about policies whatsoever, he just needs to reason about a single context in front of him. This is also the most “human-like”, because there’s no argmax’s (except if Joe submits a formula with an argmax).
Here’s another approach. In each context c, the agent takes an action y which maximises QACI(α,‘‘How good is action y in context c?").
I think you would say eval:Ω×Q→A. I’ve added the Δ, which simply amounts to giving Joe access to a random number generator. My remarks apply if eval:Ω×Q→A also.
I think you would say H:Ω×Q→Φ. I’ve added the Δ, which simply amount to including hypotheses that Joe is stochastic. But my remarks apply if H:Ω×Q→Φ also.
I would prefer the agent samples α∼μ once at the start of deployment, and reuses the same hypothesis α at each time-step. I suspect this is safer than resampling α at each time-step, for reasons discussed before.
(oops, this ended up being fairly long-winded! hope you don’t mind. feel free to ask for further clarifications.)
There’s a bunch of things wrong with your description, so I’ll first try to rewrite it in my own words, but still as close to the way you wrote it (so as to try to bridge the gap to your ontology) as possible. Note that I might post QACI 2 somewhat soon, which simplifies a bunch of QACI by locating the user as {whatever is interacting with the computer the AI is running on} rather than by using a beacon.
A first pass is to correct your description to the following:
We find a competent honourable human at a particular point in time H, like Joe Carlsmith or Wei Dai, and give them a rock engraved with a 1GB secret key, large enough that in counterfactuals it could replace with an entire snapshot of . We also give them the ability to express a 1GB output, eg by writing a 1GB key somewhere which is somehow “signed” as the only . This is part of H — H is not just the human being queried at a particular point in time, it’s also the human producing an answer in some way. So H is a function from 1GB bitstring to 1GB bitstring. We define H+ as H, followed by whichever new process H describes in its output — typically another instance of H except with a different 1GB payload.
We want a model M of the agent H+. In QACI, we get M by asking a Solomonoff-like ideal reasoner for their best guess about H+ after feeding them a bunch of data about the world and the secret key.
We then ask M the question q, “What’s the best utility-function-over-policies to maximise?” to get a utility function U:(O×A)∗→R. We then **ask our solomonoff-like ideal reasoner for their best guess about which action A maximizes U.
Indeed, as you ask in question 3, in this description there’s not really a reason to make step 3 an extra thing. The important thing to notice here is that model M might get pretty good, but it’ll still have uncertainty.
When you say “we get M by asking a Solomonoff-like ideal reasoner for their best guess about H+”, you’re implying that — positing
U(M,A)
to be the function that says how much utility the utility function returned by modelM
attributes to actionA
(in the current history-so-far) — we do something like:Indeed, in this scenario, the second line is fairly redundant.
The reason we ask for a utility function is because we want to get a utility function within the counterfactual — we don’t want to collapse the uncertainty with an argmax before extracting a utility function, but after. That way, we can do expected-given-uncertainty utility maximization over the full distribution of model-hypotheses, rather than over our best guess about M. We do:
That is, we ask our ideal reasoner (
oracle
) for the action with the best utility given uncertainty — not just logical uncertainty, but also uncertainty about which M. This contrasts with what you describe, in which we first pick the most probable M and then calculate the action with the best utility according only to that most-probable pick.To answer the rest of your questions:
Unclear! I’m not familiar enough with IDA, and I’ve bounced off explanations for it I’ve seen in the past. QACI doesn’t feel to me like it particularly involves the concepts of distillation or amplification, but I guess it does involve the concept of iteration, sure. But I don’t get the thing called IDA.
It’s unclear to me how one would design an amplification scheme — see concerns of the general shape expressed here. The thing I like about my step 1 is that the QACI loop (well, really, graph (well, really, arbitrary computation, but most of the time the user will probably just call themself in sequence)) is that its setup doesn’t involve any AI at all — you could go back in time before the industrial revolution and explain the core QACI idea and it would make sense assuming time-travelling-messages magic, and the magic wouldn’t have to do any extrapolating. Just tell someone the idea is that they could send a message to {their past self at a particular fixed point in time}. If there’s any amplification scheme, it’ll be one designed by the user, inside QACI, with arbitrarily long to figure it out.
As described above, we don’t actually pre-determine the length of the sequence, or in fact the shape of the graph at all. Each iteration decides whether to spawn one or several next iteration, or indeed to spawn an arbitrarily different long-reflection process.
Hopefully my correction above answers these.
(Again, untractable-to-naively-compute utility function*, not easily-trained-on reward function. If you have an ideal reasoner, why bother with reward functions when you can just straightforwardly do untractable-to-naively-compute utility functions?)
I guess this is kinda philosophical? I have some short thoughts on here. If an exabyte is enough to describe to describe {a communication channel with a human-on-earth} to an AI-on-earth, which I think seems likely, then it’s enough to build “just have a nice corrigible assistant ask the humans what they want”-type channels.
Put another way: if there are actions which are preferable to other actions, then it seems to me like utility function are a fully lossless way for counterfactual QACI users to express which kinds of actions they want the AI to perform, which is all we need. If there’s something wrong with utility function over worlds, then counterfactual QACI users can output a utility function which favors actions which lead to something other than utility maximization over worlds, for example actions which lead to the construction of a superintelligent corrigible assistant which will help the humans come up with a better scheme.
Again, I don’t get IDA. Iteration doesn’t seem particularly needed? Note that inside QACI, the user does have access to an oracle and to all relevant pieces of hypothesis about which hypothesis it is inhabiting in — this is what, in the QACI math, this line does:
Notably, α is the hypothesis for which world the user is being considered in, and γq,ξ for their location within that world. Those are sufficient to fully characterize the hypothesis-for-H that describes them. And because the user doesn’t really return just a string but a math function which takes q,μ1,μ2,α,γq,ξ as input and returns a string, they can have that math function do arbitrary work — including rederive H. In fact, rediriving H is how they call a next iteration: they say (except in math) “call H again (rederived using q,μ1,μ2,α,γq,ξ), but with this string, and return the result of that.” See also this illustration, which is kinda wrong in places but gets the recursion call graph thing right.
Another reason to do “iteration” like this inside the counterfactual rather than in the actual factual world (if that’s what IDA does, which I’m only guessing here) is that we don’t have as many iteration steps as we want in the factual world — eventually OpenAI or someone else kills everyone, whereas in the counterfactual, the QACI users are the only ones who can make progress, so the QACI users essentially have as long as they want, so long as they don’t take too long in each individual counterfactual step or other somewhat easily avoided actions like that.
Unclear if this still means anything given the rest of this post. Ask me again if it does.
Thanks Tamsin! Okay, round 2.
My current understanding of QACI:
We assume a set Ω of hypotheses about the world. We assume the oracle’s beliefs are given by a probability distribution μ∈ΔΩ.
We assume sets Q and A of possible queries and answers respectively. Maybe these are exabyte files, i.e.Q≅A≅{0,1}N for N=260.
Let Φ be the set of mathematical formula that Joe might submit. These formulae are given semantics eval(ϕ):Ω×Q→ΔA for each formula ϕ∈Φ.[1]
We assume a function H:Ω×Q→ΔΦ where H(α,q)(ϕ)∈[0,1] is the probability that Joe submits formula ϕ after reading query q, under hypothesis α.[2]
We define QACI:Ω×Q→ΔA as follows: sample ϕ∼H(α,q), then sample a∼eval(ϕ)(α,q), then return a.
For a fixed hypothesis α, we can interpret the answer a∼QACI(α,‘‘Best utility function?")as a utility function uα:Π→R via some semantics eval-u:A→(Π→R).
Then we define u:Π→R via integrating over μ, i.e. u(π):=∫uα(π)dμ(α).
A policy π∈Π is optimal if and only if π∗∈argmaxΠ(u).
The hope is that μ, eval, eval-u, and H can be defined mathematically. Then the optimality condition can be defined mathematically.
Question 0
What if there’s no policy which maximises u:Π→R? That is, for every policy π there is another policy π′ such that u(π′)>u(π). I suppose this is less worrying, but what if there are multiple policies which maximises u?
Question 1
In Step 7 above, you average all the utility functions together, whereas I suggested sampling a utility function. I think my solution might be safer.
Suppose the oracle puts 5% chance on hypotheses such that QACI(α,−) is malign. I think this is pretty conservative, because Solomonoff predictor is malign, and some of the concerns Evhub raises here. And the QACI amplification might not preserve benignancy. It follows that, under your solution, u:Π→R is influenced by a coalition of malign agents, and similarly π∗∈argmax(u) is influenced by the malign coalition.
By contrast, I suggest sampling α∼μ and then finding π∗∈argmaxΠ(uα). This should give us a benign policy with 95% chance, which is pretty good odds. Is this safer? Not sure.
Question 2
I think the eval function doesn’t work, i.e. there won’t be a way to mathematically define the semantics of the formula language. In particular, the language Φ must be strictly weaker than the meta-language in which you are hoping to define eval:Φ→(Ω×Q→ΔA) itself. This is because of Tarski’s Undefinability of Truth (and other no-go theorems).
This might seem pedantic, but you in practical terms: there’s no formula ϕ whose semantics is QACI itself. You can see this via a diagonal proof: imagine if Joe always writes the formal expression ϕ=‘‘1−QACI(α,q)".
The most elegant solution is probably transfinite induction, but this would give us a QACI for each ordinal.
Question 3
I want to understand how QACI and prosaic ML map onto each other. As far as I can tell, issues with QACI will be analogous to issues with prosaic ML and vice-versa.
Question 4
I still don’t understand why we’re using QACI to describe a utility function over policies, rather than using QACI in a more direct approach.
Here’s one approach. We pick a policy which maximises QACI(α,‘‘How good is policy π?").[3] The advantage here is that Joe doesn’t need to reason about utility functions over policies, he just need to reason about a single policy in front of him
Here’s another approach. We use QACI as our policy directly. That is, in each context c that the agent finds themselves in, they sample an action from QACI(α,‘‘What is the best action is context c?") and take the resulting action.[4] The advantage here is that Joe doesn’t need to reason about policies whatsoever, he just needs to reason about a single context in front of him. This is also the most “human-like”, because there’s no argmax’s (except if Joe submits a formula with an argmax).
Here’s another approach. In each context c, the agent takes an action y which maximises QACI(α,‘‘How good is action y in context c?").
E.t.c.
Happy to jump on a call if that’s easier.
I think you would say eval:Ω×Q→A. I’ve added the Δ, which simply amounts to giving Joe access to a random number generator. My remarks apply if eval:Ω×Q→A also.
I think you would say H:Ω×Q→Φ. I’ve added the Δ, which simply amount to including hypotheses that Joe is stochastic. But my remarks apply if H:Ω×Q→Φ also.
By this I mean either:
(1) Sample α∼μ, then maximise the function π↦QACI(α,‘‘How good is policy π?").
(2) Maximise the function π↦∫QACI(α,‘‘How good is policy π?")dμ(α).
For reasons I mentioned in Question 1, I suspect (1) is safer, but (2) is closer to your original approach.
I would prefer the agent samples α∼μ once at the start of deployment, and reuses the same hypothesis α at each time-step. I suspect this is safer than resampling α at each time-step, for reasons discussed before.