In this post, I will focus on the way Davidad conceives using «boundaries» within his Open Agency Architecture safety paradigm. Essentially, Davidad’s hope is that «boundaries» can be used to formalize a sort of MVP morality for the first AI systems.
Update: Davidad left a comment endorsing this post. He also later tweeted about it in a twitter reply.[1]
Why «boundaries»?
So, in an ideal future, we would get CEV alignment in the first AGI.
However, this seems really hard, and it might be easier to get AI x-risk off the table first (thus ending the “acute risk period”), and then figure out how to do the rest of alignment later.[2]
Inwhich case, we don’t actually need the first AGI to understand all of human values/ethics— we only need it to understand a minimum subset that ensures safety.
But which subset? And how could it be formalized in a consistent manner?
This is where the concept of «boundaries» comes in, because the concept has two nice properties:
«boundaries» seem to explain what’s bad about a bunch of actions that are otherwise difficult to explain why they’re bad.
I.e.: Actions are bad when they violate an agent’s boundaries or autonomy.
The hope, then, is that the «boundaries» concept could be formalized into a sort of MVP morality that could be used in the first AI system(s).
Concretely, one way Davidad envisions implementing «boundaries» is by tasking an AI system to minimize the occurrence of ~objective «boundary» violations for its citizens.
That said, I disagree with such an implementation and I will propose an alternative in another post.
Post-acute-risk-period, I think there ought to be a “night watchman Singleton”: an AGI which technically satisfies Bostrom’s definition of a Singleton, but which does no more and no less than ensuring a baseline level of security for its citizens (which may include humans & AIs).
If and only if a night-watchman singleton is in place, then everyone can have their own AI if they want. The night-watchman will ensure they can’t go to war. The price of this is that if the night-watchman ever suffers a robustness failure it’s game
The utility function of a night-watchman singleton is the minimum over all citizens of the extent to which their «boundaries» are violated (with violations being negative and no violations being zero) and the extent to which they fall short of baseline access to natural resources
For me the core question of existential safety is this:
Under these conditions, what would bethe best strategy for building an AI systemthat helps us ethically end the acute risk periodwithout creating its own catastrophic risksthat would be worse than the status quo?
It is not, for example, “how can we build an AI that is aligned with human values, including all that is good and beautiful?” or “how can we build an AI that optimises the world for whatever the operators actually specified?” Those could be useful subproblems, but they are not the top-level problem about AI risk (and, in my opinion, given current timelines and a quasi-worst-case assumption, they are probably not on the critical path at all).
Deontic Sufficiency Hypothesis: There exists a human-understandable set of features of finite trajectories in such a world-model, taking values in (−∞,0], such that we can be reasonably confident that all these features being near 0 implies high probability of existential safety, and such that saturating them at 0 is feasible[2] with high probability, using scientifically-accessible technologies.
I am optimistic about this largely because of recent progress toward formalizing a natural abstraction of boundariesby Critch and Garrabrant. I find it quite plausible that there is some natural abstraction property Q of world-model trajectories that lies somewhere strictly within the vast moral gulf of
All Principles That Human CEV Would Endorse⇒Q⇒Don't Kill Everyone
In the situation where new powerful AIs with alien minds may arise (if not just between humans), I believe that a “night watchman” which can credibly threaten force is necessary, although perhaps all it should do is to defend such boundaries (including those of aggressors).
Getting traction on the deontic feasibility hypothesis
Davidad believes that using formalisms such as Markov Blankets would be crucial in encoding the desiderata that the AI should not cross boundary lines at various levels of the world-model. We only need to “imply high probability of existential safety”, so according to davidad, “we do not need to load much ethics or aesthetics in order to satisfy this claim (e.g. we probably do not get to use OAA to make sure people don’t die of cancer, because cancer takes place inside the Markov Blanket, and that would conflict with boundary preservation; but it would work to make sure people don’t die of violence or pandemics)”. Discussing this hypothesis more thoroughly seems important.
Also:
(*) Elicitors: Language models assist humans in expressing their desires using the formal language of the world model. […] Davidad proposes to represent most of these desiderata as violations of Markov blankets. Most of those desiderata are formulated as negative constraints because we just want to avoid a catastrophe, not solve the full value problem. But some of the desiderata will represent the pivotal process that we want the model to accomplish.
(The post also explains that the “(*)” prefix means “Important”, as distinct from “not essential”.)
Not listed among your potential targets is “end the acute risk period” or more specifically “defend the boundaries of existing sentient beings,” which is my current favourite. It’s nowhere near as ambitious or idiosyncratic as “human values”, yet nowhere near as anti-natural or buck-passing as corrigibility.
I’m also excited about Boundaries as a tool for specifying a core safety property to model-check policies against—one which would imply (at least) nonfatality—relative to alien and shifting predictive ontologies.
9. Humans cannot be first-class parties to a superintelligence values handshake.
[…]
OAA Solution: (9.1) Instead of becoming parties to a values handshake, keep superintelligent capabilities in a box and only extract plans that solve bounded tasks for finite time horizons and verifiably satisfy safety criteria that include not violating the natural boundaries of humans. This can all work without humans ever being terminally valued by AI systems as ends in themselves.
«Boundaries» for formalizing an MVP morality
Update: For a better exposition of the same core idea, see Agent membranes and formalizing “safety”.
Here’s one specific way that «boundaries» could directly apply to AI safety.
For context on what “«boundaries»” are, see «Boundaries» and AI safety compilation (also see the tag for this concept).
In this post, I will focus on the way Davidad conceives using «boundaries» within his Open Agency Architecture safety paradigm. Essentially, Davidad’s hope is that «boundaries» can be used to formalize a sort of MVP morality for the first AI systems.
Update: Davidad left a comment endorsing this post. He also later tweeted about it in a twitter reply.[1]
Why «boundaries»?
So, in an ideal future, we would get CEV alignment in the first AGI.
However, this seems really hard, and it might be easier to get AI x-risk off the table first (thus ending the “acute risk period”), and then figure out how to do the rest of alignment later.[2]
In which case, we don’t actually need the first AGI to understand all of human values/ethics— we only need it to understand a minimum subset that ensures safety.
But which subset? And how could it be formalized in a consistent manner?
This is where the concept of «boundaries» comes in, because the concept has two nice properties:
«boundaries» seem to explain what’s bad about a bunch of actions that are otherwise difficult to explain why they’re bad.
I.e.: Actions are bad when they violate an agent’s boundaries or autonomy.
This is most directly described in «Boundaries» Sequence, and other (indirect) examples can be found in https://arbital.com/p/low_impact/
«Boundaries» might result in a framework looks like deontology but applies more generally.
«boundaries» seem possible to formalize algorithmically
The hope, then, is that the «boundaries» concept could be formalized into a sort of MVP morality that could be used in the first AI system(s).
Concretely, one way Davidad envisions implementing «boundaries» is by tasking an AI system to minimize the occurrence of ~objective «boundary» violations for its citizens.
That said, I disagree with such an implementation and I will propose an alternative in another post.
Also related: Acausal normalcy
Quotes from Davidad that support this view
(All bolding below is mine.)
Davidad tweeted in 2022 Aug:
next tweet:
later in the thread:
Davidad in AI Neorealism: a threat model & success criterion for existential safety (2022 Dec):
Davidad in An Open Agency Architecture for Safe Transformative AI (2022 Dec):
Also see this tweet from Davidad in 2023 Feb:
Further explanation of the OAA’s Deontic Sufficiency Hypothesis in Davidad’s Bold Plan for Alignment: An In-Depth Explanation (2023 Apr) by Charbel-Raphaël and Gabin:
Also:
(The post also explains that the “(*)” prefix means “Important”, as distinct from “not essential”.)
This comment by Davidad (2023 Jan):
From Reframing inner alignment by Davidad (2022 Dec):
From A list of core AI safety problems and how I hope to solve them (2023 Aug):
FWIW he left this comment before I simplified this post a lot on 2023 Sept 15.
P.S.: Davidad explains this directly in A list of core AI safety problems and how I hope to solve them.