I think this is brilliant as a direction to think in, but I’m object-level skeptical. I could be missing important details.
Summary of what I think I understand
A superintelligent AI is built[1] to optimise for QACI(A,B,“what should the utility function be?").
That function effectively tells the AI to figure out: “If you extrapolate from the assumption that uniquely-identifiable-B was actually “what should the utility function be?" (ceteris paribus), what would uniquely-identifiable-A have been?” And then take its own best guess about that as its objective function.
In practice, that may look something like this: The AI starts looking for referents of A and B, and learns that the only knowable instances are in the past. Assuming it does counterfactual reasoning somewhat like humans, it will then try to reconstruct/simulate the situation in as much heuristically relevant detail as possible. Finally, it runs the experiment forwards from the earliest possible time the counterfactual assumption can be inserted (i.e. when Cindy ran the program that produced B).
(Depending on how it’s built, the AI has already acquired a probability distribution over what its utility function could be, and this includes some expectancy over Cindy’s values. Therefore, it plausibly tries to minimise side-effects.)
At some point in this process, it notices that the contents of A are extremely sensitive to what went on in Cindy’s brain at time T. So brainscanning her is obvious for accuracy and repeatability.
In the simulation, Cindy sees with joy that B is “what should the utility function be?" so she gets to work. Not willing to risk delaying more than 24-ish hours (for reasons), she finally writes into A: QACI(A,B,“Hi! Cindy-2 here. These are my notes: [...]. You can do it!").
As long as that is the AI’s best guess, it is now motivated to repeat the experiment with the new message. This allows successive Cindys to work[2] on the problem until one of them declares success and writes a function plainly into A.
Implementation details
The AI might be uncertain about how precise its simulations are, in which case it could want to run a series of these experiments with varying seeds before adopting whatever function the plurality converges to. The uncertainty compounds, however, so simulation-batches which output answers in fewer iterations (for whatever reason) will weigh more.
I’m not sure QACI will be interpreted as transitive between simulations by default. I think it depends on preferences regarding degrees of freedom in the logic used to interpret QACI(QACI()), if both the inner and outer function depend on mutually exclusive counterfactuals over the same state of reality (or variable). Each step is locally coherent, but you could get stuck in a repeating loop.
We can’t take for granted that arbitrary superintelligences will have human heuristics for what counts as “correct” counterfactual reasoning. It seems shaky to rely on it. (I notice you discuss this in the comments.)
Why I don’t think it works
It does not seem to do anything to inner alignment afaict, and it seems too demanding and leaky to solve outer alignment.
I don’t see how to feasibly translate QACI() into actual code that causes an AI to use it as a target for all its goal-seeking abilities.
Even if it were made into a loss function, you can’t train a transformer on it without relevant training data.
If the plan is to first train a transformer-ish AI on normal data and only later swap its objective function (assuming it were made into one), then the network will already have encoded proxies for its old function, and its influence will (optimistically) see long-tailed exponential decay with training time.
If instead the plan is to first train an instruction-executing language model with sufficient understanding of human-symbolic math or English, this seems risky for traditional reasons but might be the most feasible way to try to implement it. I think this direction is worth exploring.
A mathematically precise (though I disagree the term is applicable here) objective function doesn’t matter when you have a neural net trying its best to translate it into imprecise proxies which actually work in its environment.
I recommend going all-in on building one. I suspect this is the bottleneck, and going full speed at your current course is often the best way to discover that you need to find a better course—or, indeed, win. Uncertainty does not imply slowing down.
I think this is brilliant as a direction to think in, but I’m object-level skeptical. I could be missing important details.
Summary of what I think I understand
A superintelligent AI is built[1] to optimise for QACI(A,B,“what should the utility function be?").
That function effectively tells the AI to figure out: “If you extrapolate from the assumption that uniquely-identifiable-B was actually “what should the utility function be?" (ceteris paribus), what would uniquely-identifiable-A have been?” And then take its own best guess about that as its objective function.
In practice, that may look something like this: The AI starts looking for referents of A and B, and learns that the only knowable instances are in the past. Assuming it does counterfactual reasoning somewhat like humans, it will then try to reconstruct/simulate the situation in as much heuristically relevant detail as possible. Finally, it runs the experiment forwards from the earliest possible time the counterfactual assumption can be inserted (i.e. when Cindy ran the program that produced B).
(Depending on how it’s built, the AI has already acquired a probability distribution over what its utility function could be, and this includes some expectancy over Cindy’s values. Therefore, it plausibly tries to minimise side-effects.)
At some point in this process, it notices that the contents of A are extremely sensitive to what went on in Cindy’s brain at time T. So brainscanning her is obvious for accuracy and repeatability.
In the simulation, Cindy sees with joy that B is “what should the utility function be?" so she gets to work. Not willing to risk delaying more than 24-ish hours (for reasons), she finally writes into A: QACI(A,B,“Hi! Cindy-2 here. These are my notes: [...]. You can do it!").
As long as that is the AI’s best guess, it is now motivated to repeat the experiment with the new message. This allows successive Cindys to work[2] on the problem until one of them declares success and writes a function plainly into A.
Implementation details
The AI might be uncertain about how precise its simulations are, in which case it could want to run a series of these experiments with varying seeds before adopting whatever function the plurality converges to. The uncertainty compounds, however, so simulation-batches which output answers in fewer iterations (for whatever reason) will weigh more.
I’m not sure QACI will be interpreted as transitive between simulations by default. I think it depends on preferences regarding degrees of freedom in the logic used to interpret QACI(QACI()), if both the inner and outer function depend on mutually exclusive counterfactuals over the same state of reality (or variable). Each step is locally coherent, but you could get stuck in a repeating loop.
We can’t take for granted that arbitrary superintelligences will have human heuristics for what counts as “correct” counterfactual reasoning. It seems shaky to rely on it. (I notice you discuss this in the comments.)
Why I don’t think it works
It does not seem to do anything to inner alignment afaict, and it seems too demanding and leaky to solve outer alignment.
I don’t see how to feasibly translate QACI() into actual code that causes an AI to use it as a target for all its goal-seeking abilities.
Even if it were made into a loss function, you can’t train a transformer on it without relevant training data.
If the plan is to first train a transformer-ish AI on normal data and only later swap its objective function (assuming it were made into one), then the network will already have encoded proxies for its old function, and its influence will (optimistically) see long-tailed exponential decay with training time.
If instead the plan is to first train an instruction-executing language model with sufficient understanding of human-symbolic math or English, this seems risky for traditional reasons but might be the most feasible way to try to implement it. I think this direction is worth exploring.
A mathematically precise (though I disagree the term is applicable here) objective function doesn’t matter when you have a neural net trying its best to translate it into imprecise proxies which actually work in its environment.
I recommend going all-in on building one. I suspect this is the bottleneck, and going full speed at your current course is often the best way to discover that you need to find a better course—or, indeed, win. Uncertainty does not imply slowing down.
This ends up looking like a sort of serial processor constrained by a sharp information bottleneck between iterations.