«Boundaries» for formalizing an MVP morality
Update: For a better exposition of the same core idea, see Agent membranes and formalizing “safety”.
Here’s one specific way that «boundaries» could directly apply to AI safety.
For context on what “«boundaries»” are, see «Boundaries» and AI safety compilation (also see the tag for this concept).
In this post, I will focus on the way Davidad conceives using «boundaries» within his Open Agency Architecture safety paradigm. Essentially, Davidad’s hope is that «boundaries» can be used to formalize a sort of MVP morality for the first AI systems.
Update: Davidad left a comment endorsing this post. He also later tweeted about it in a twitter reply.[1]
Why «boundaries»?
So, in an ideal future, we would get CEV alignment in the first AGI.
However, this seems really hard, and it might be easier to get AI x-risk off the table first (thus ending the “acute risk period”), and then figure out how to do the rest of alignment later.[2]
In which case, we don’t actually need the first AGI to understand all of human values/ethics— we only need it to understand a minimum subset that ensures safety.
But which subset? And how could it be formalized in a consistent manner?
This is where the concept of «boundaries» comes in, because the concept has two nice properties:
«boundaries» seem to explain what’s bad about a bunch of actions that are otherwise difficult to explain why they’re bad.
I.e.: Actions are bad when they violate an agent’s boundaries or autonomy.
This is most directly described in «Boundaries» Sequence, and other (indirect) examples can be found in https://arbital.com/p/low_impact/
«Boundaries» might result in a framework looks like deontology but applies more generally.
«boundaries» seem possible to formalize algorithmically
The hope, then, is that the «boundaries» concept could be formalized into a sort of MVP morality that could be used in the first AI system(s).
Concretely, one way Davidad envisions implementing «boundaries» is by tasking an AI system to minimize the occurrence of ~objective «boundary» violations for its citizens.
That said, I disagree with such an implementation and I will propose an alternative in another post.
Also related: Acausal normalcy
Quotes from Davidad that support this view
(All bolding below is mine.)
Davidad tweeted in 2022 Aug:
Post-acute-risk-period, I think there ought to be a “night watchman Singleton”: an AGI which technically satisfies Bostrom’s definition of a Singleton, but which does no more and no less than ensuring a baseline level of security for its citizens (which may include humans & AIs).
If and only if a night-watchman singleton is in place, then everyone can have their own AI if they want. The night-watchman will ensure they can’t go to war. The price of this is that if the night-watchman ever suffers a robustness failure it’s game
The utility function of a night-watchman singleton is the minimum over all citizens of the extent to which their «boundaries» are violated (with violations being negative and no violations being zero) and the extent to which they fall short of baseline access to natural resources
Davidad in AI Neorealism: a threat model & success criterion for existential safety (2022 Dec):
For me the core question of existential safety is this:
It is not, for example, “how can we build an AI that is aligned with human values, including all that is good and beautiful?” or “how can we build an AI that optimises the world for whatever the operators actually specified?” Those could be useful subproblems, but they are not the top-level problem about AI risk (and, in my opinion, given current timelines and a quasi-worst-case assumption, they are probably not on the critical path at all).
Davidad in An Open Agency Architecture for Safe Transformative AI (2022 Dec):
Deontic Sufficiency Hypothesis: There exists a human-understandable set of features of finite trajectories in such a world-model, taking values in , such that we can be reasonably confident that all these features being near 0 implies high probability of existential safety, and such that saturating them at 0 is feasible[2] with high probability, using scientifically-accessible technologies.
I am optimistic about this largely because of recent progress toward formalizing a natural abstraction of boundaries by Critch and Garrabrant. I find it quite plausible that there is some natural abstraction property of world-model trajectories that lies somewhere strictly within the vast moral gulf of
Also see this tweet from Davidad in 2023 Feb:
In the situation where new powerful AIs with alien minds may arise (if not just between humans), I believe that a “night watchman” which can credibly threaten force is necessary, although perhaps all it should do is to defend such boundaries (including those of aggressors).
Further explanation of the OAA’s Deontic Sufficiency Hypothesis in Davidad’s Bold Plan for Alignment: An In-Depth Explanation (2023 Apr) by Charbel-Raphaël and Gabin:
Getting traction on the deontic feasibility hypothesis
Davidad believes that using formalisms such as Markov Blankets would be crucial in encoding the desiderata that the AI should not cross boundary lines at various levels of the world-model. We only need to “imply high probability of existential safety”, so according to davidad, “we do not need to load much ethics or aesthetics in order to satisfy this claim (e.g. we probably do not get to use OAA to make sure people don’t die of cancer, because cancer takes place inside the Markov Blanket, and that would conflict with boundary preservation; but it would work to make sure people don’t die of violence or pandemics)”. Discussing this hypothesis more thoroughly seems important.
Also:
(*) Elicitors: Language models assist humans in expressing their desires using the formal language of the world model. […] Davidad proposes to represent most of these desiderata as violations of Markov blankets. Most of those desiderata are formulated as negative constraints because we just want to avoid a catastrophe, not solve the full value problem. But some of the desiderata will represent the pivotal process that we want the model to accomplish.
(The post also explains that the “(*)” prefix means “Important”, as distinct from “not essential”.)
This comment by Davidad (2023 Jan):
Not listed among your potential targets is “end the acute risk period” or more specifically “defend the boundaries of existing sentient beings,” which is my current favourite. It’s nowhere near as ambitious or idiosyncratic as “human values”, yet nowhere near as anti-natural or buck-passing as corrigibility.
From Reframing inner alignment by Davidad (2022 Dec):
I’m also excited about Boundaries as a tool for specifying a core safety property to model-check policies against—one which would imply (at least) nonfatality—relative to alien and shifting predictive ontologies.
From A list of core AI safety problems and how I hope to solve them (2023 Aug):
9. Humans cannot be first-class parties to a superintelligence values handshake.
[…]
OAA Solution: (9.1) Instead of becoming parties to a values handshake, keep superintelligent capabilities in a box and only extract plans that solve bounded tasks for finite time horizons and verifiably satisfy safety criteria that include not violating the natural boundaries of humans. This can all work without humans ever being terminally valued by AI systems as ends in themselves.
- ^
FWIW he left this comment before I simplified this post a lot on 2023 Sept 15.
- ^
P.S.: Davidad explains this directly in A list of core AI safety problems and how I hope to solve them.
- Shallow review of live agendas in alignment & safety by 27 Nov 2023 11:10 UTC; 332 points) (
- Shallow review of live agendas in alignment & safety by 27 Nov 2023 11:33 UTC; 76 points) (EA Forum;
- «Boundaries/Membranes» and AI safety compilation by 3 May 2023 21:41 UTC; 57 points) (
- Potential alignment targets for a sovereign superintelligent AI by 3 Oct 2023 15:09 UTC; 29 points) (
- 2 Jun 2023 1:09 UTC; 8 points) 's comment on Is Deontological AI Safe? [Feedback Draft] by (
- 4 May 2023 17:10 UTC; 4 points) 's comment on Davidad’s Bold Plan for Alignment: An In-Depth Explanation by (
- 18 Jul 2023 2:40 UTC; 3 points) 's comment on A central AI alignment problem: capabilities generalization, and the sharp left turn by (
- 29 May 2023 12:05 UTC; 3 points) 's comment on “Membranes” is better terminology than “boundaries” alone by (
- 29 May 2023 2:00 UTC; 3 points) 's comment on Is Deontological AI Safe? [Feedback Draft] by (
- 15 Sep 2023 23:56 UTC; 3 points) 's comment on Davidad’s Bold Plan for Alignment: An In-Depth Explanation by (
- 18 Jul 2023 1:40 UTC; 1 point) 's comment on The Pointers Problem: Clarifications/Variations by (
- 7 Jun 2023 23:59 UTC; 1 point) 's comment on Take 6: CAIS is actually Orwellian. by (
- 21 Jul 2023 16:03 UTC; 1 point) 's comment on Boundary Placement Rebellion by (
- 31 May 2023 14:26 UTC; 1 point) 's comment on My current research questions for «membranes/boundaries» by (
- 4 May 2023 17:13 UTC; 1 point) 's comment on «Boundaries», Part 3b: Alignment problems in terms of boundaries by (
- 19 Jul 2023 4:51 UTC; 1 point) 's comment on Focus on the places where you feel shocked everyone’s dropping the ball by (
- 28 Sep 2023 10:15 UTC; 1 point) 's comment on Resources that (I think) new alignment researchers should know about by (
- 2 Jan 2024 22:48 UTC; 1 point) 's comment on Formalizing «Boundaries» with Markov blankets by (
- 28 Jun 2023 2:44 UTC; 1 point) 's comment on Are ethical asymmetries from property rights? by (
Thanks for bringing all of this together—I think this paints a fine picture of my current best hope for deontic sufficiency. If we can do better than that, great!
Update: I just simplified this post a bit to be more clear and easier to read.
Update: For a better exposition of the same idea, see Agent membranes and formalizing “safety”.
(Nate Soares assumes specifying morality is impossible?)
Related: On the Computational Complexity of Ethics: Moral Tractability for Minds and Machines. H/t @Roman Leventov
Related: “Is Deontological AI Safe? [Feedback Draft]”
Ty; also this comment there