Here’s a recent attempt of mine at a distillation of a fragment of this plan, copied over from a discussion elsewhere:
goal: make there be a logical statement such that a proof of that statement solves the strawberries-on-a-plate problem (or w/e).
summary of plan:
the humans put in a herculean effort to build a multi-level world-model that is interpretable to them (ranging from quantum chemistry at the lowest level, to strawberries and plates at the top)
we interpret this in a very conservative way, as a convex set of models that hopefully contains something pretty close to reality.
we can then form the logical statement “this policy puts two cellularly-but-not-molecularly identical strawberries on a plate (and has no other high-level effect) with pretty-high probability across all of those models simultaneously”
background context: there’s some fancy tools with very nice mathematical properties for combining probabilistic reasoning and worst-case reasoning.
key hope: these tools will let us interpret this “manual multi-level world-model” in a way that keeps enough of the probabilistic nature for tractable policies to exist, and enough adversarial nature for this constraint to be pretty tight.
in more concrete terms: for any given UFAI, somewhere in the convex hull of all the interpretations of the manual multi-level world model, there’s a model that sees (in its high-level) the shady shit that the UFAI was hoping to slip past us. So such “shady” policies fail in the worst-case, and fail to satisfy the theorem. But also enough of the probabilistic nature is retained that your policies don’t need to handle the literal worst-cases of thermodynamic heat, and so there are some “reasonable” policies that could satisfy the theorem.
capabilities requirements: the humans need to be able to construct the world model; something untrusted and quite capable needs to search for proofs of the theorem; the policy extracted from said theorem is then probably an AGI with high capabilities but you’ve (putatively) proven that all it does is put strawberries on a plate and shut down so \shrug :crossed_fingers: hopefully that proof bound to reality.
(note: I’m simply attempting to regurgitate the idea here; not defend it. obvious difficulties are obvious, like “the task of finding such a policy is essentially the task of building and aligning an AGI” and “something that can find that policy is putting adversarial pressure on your theorem”. even if proving the theorem requires finding a sufficiently-corrigible AGI, it would still be rad to have a logical statement of this form (and perhaps there’s even some use to it if it winds up not quite rated for withstanding superintelligent adversaries?).)
Anticipating an obvious question: yes, I observed to Davidad that the part where we imagine convex sets of distributions that contain enough of the probabilistic nature to admit tractable policies and enough of the worst-case nature to prevent UFAI funny business is where a bunch of the work is being done, and that if it works then there should be a much smaller example of it working, and probably some minimal toy example where it’s easy to see that the only policies that satisfy the analogous theorem are doing some new breed of optimization, that is neither meliorization nor satisfaction and that is somehow more mild. And (either I’m under the illusion of transparency or) Davidad agreed that this should be possible, and claims it is on his list of things to demonstrate.
(note: this is the fragment of Davidad’s plan that I was able to distill out into something that made sense to me; i suspect he thinks of this as just one piece among many. I welcome corrections :-))
Here’s a recent attempt of mine at a distillation of a fragment of this plan, copied over from a discussion elsewhere:
goal: make there be a logical statement such that a proof of that statement solves the strawberries-on-a-plate problem (or w/e).
summary of plan:
the humans put in a herculean effort to build a multi-level world-model that is interpretable to them (ranging from quantum chemistry at the lowest level, to strawberries and plates at the top)
we interpret this in a very conservative way, as a convex set of models that hopefully contains something pretty close to reality.
we can then form the logical statement “this policy puts two cellularly-but-not-molecularly identical strawberries on a plate (and has no other high-level effect) with pretty-high probability across all of those models simultaneously”
background context: there’s some fancy tools with very nice mathematical properties for combining probabilistic reasoning and worst-case reasoning.
key hope: these tools will let us interpret this “manual multi-level world-model” in a way that keeps enough of the probabilistic nature for tractable policies to exist, and enough adversarial nature for this constraint to be pretty tight.
in more concrete terms: for any given UFAI, somewhere in the convex hull of all the interpretations of the manual multi-level world model, there’s a model that sees (in its high-level) the shady shit that the UFAI was hoping to slip past us. So such “shady” policies fail in the worst-case, and fail to satisfy the theorem. But also enough of the probabilistic nature is retained that your policies don’t need to handle the literal worst-cases of thermodynamic heat, and so there are some “reasonable” policies that could satisfy the theorem.
capabilities requirements: the humans need to be able to construct the world model; something untrusted and quite capable needs to search for proofs of the theorem; the policy extracted from said theorem is then probably an AGI with high capabilities but you’ve (putatively) proven that all it does is put strawberries on a plate and shut down so \shrug :crossed_fingers: hopefully that proof bound to reality.
(note: I’m simply attempting to regurgitate the idea here; not defend it. obvious difficulties are obvious, like “the task of finding such a policy is essentially the task of building and aligning an AGI” and “something that can find that policy is putting adversarial pressure on your theorem”. even if proving the theorem requires finding a sufficiently-corrigible AGI, it would still be rad to have a logical statement of this form (and perhaps there’s even some use to it if it winds up not quite rated for withstanding superintelligent adversaries?).)
Anticipating an obvious question: yes, I observed to Davidad that the part where we imagine convex sets of distributions that contain enough of the probabilistic nature to admit tractable policies and enough of the worst-case nature to prevent UFAI funny business is where a bunch of the work is being done, and that if it works then there should be a much smaller example of it working, and probably some minimal toy example where it’s easy to see that the only policies that satisfy the analogous theorem are doing some new breed of optimization, that is neither meliorization nor satisfaction and that is somehow more mild. And (either I’m under the illusion of transparency or) Davidad agreed that this should be possible, and claims it is on his list of things to demonstrate.
(note: this is the fragment of Davidad’s plan that I was able to distill out into something that made sense to me; i suspect he thinks of this as just one piece among many. I welcome corrections :-))