Safety First: safety before full alignment. The deontic sufficiency hypothesis.
It could be the case that these two goals are separable and independent:
“AI safety”: avoiding existential risk, s-risk, actively negative outcomes
“AI getting-everything-we-want” (CEV)
This is what Davidad calls this the Deontic Sufficiency Hypothesis.
If the hypothesis is true, it should be possible to de-pessimize and mitigate the urgent risk from AI without necessarily ensuring that AI creates actively positive outcomes. Because, for safety, it is only necessary to ensure that actively harmful outcomes do not occur. And hopefully this is easier than achieving “full alignment”.
Safety first! We can figure out the rest later.
Quotes from Davidad’s The Open Agency Architecture plans
This is Davidad’s plan with the Open Agency Architecture (OAA).
A list of core AI safety problems and how I hope to solve them (2023 August)
1.1. First, instead of trying to specify “value”, instead “de-pessimize” and specify the absence of a catastrophe, and maybe a handful of bounded constructive tasks like supplying clean water. A de-pessimizing OAA would effectively buy humanity some time, and freedom to experiment with less risk, for tackling the CEV-style alignment problem—which is harder than merely mitigating extinction risk.
Davidad’s Bold Plan for Alignment: An In-Depth Explanation — LessWrong (2023 April)
Deontic Sufficiency Hypothesis: This hypothesis posits that it is possible to identify desiderata that are adequate to ensure the model doesn’t engage in undesirable behavior. Davidad is optimistic that it’s feasible to find desiderata ensuring safety for a few weeks before a better solution is discovered, making this a weaker approach than solving outer alignment. For instance, Davidad suggests that even without a deep understanding of music, you can be confident your hearing is safe by ensuring the sound pressure level remains below 80 decibels. However, since the model would still be executing a pivotal process with significant influence, relying on a partial solution for decades could be risky.
Getting traction on the deontic feasibility [sic] hypothesis
Davidad believes that using formalisms such as Markov Blankets would be crucial in encoding the desiderata that the AI should not cross boundary lines at various levels of the world-model. We only need to “imply high probability of existential safety”, so according to davidad, “we do not need to load much ethics or aesthetics in order to satisfy this claim (e.g. we probably do not get to use OAA to make sure people don’t die of cancer, because cancer takes place inside the Markov Blanket, and that would conflict with boundary preservation; but it would work to make sure people don’t die of violence or pandemics)”. Discussing this hypothesis more thoroughly seems important.
An Open Agency Architecture for Safe Transformative AI (2022 December)
Deontic Sufficiency Hypothesis: There exists a human-understandable set of features of finite trajectories in such a world-model, taking values in , such that we can be reasonably confident that all these features being near 0 implies high probability of existential safety, and such that saturating them at 0 is feasible[2] with high probability, using scientifically-accessible technologies.
I am optimistic about this largely because of recent progress toward formalizing a natural abstraction of boundaries by Critch and Garrabrant. I find it quite plausible that there is some natural abstraction property of world-model trajectories that lies somewhere strictly within the vast moral gulf of
AI Neorealism: a threat model & success criterion for existential safety (2022 December)
For me the core question of existential safety is this:
It is not, for example, “how can we build an AI that is aligned with human values, including all that is good and beautiful?” or “how can we build an AI that optimises the world for whatever the operators actually specified?” Those could be useful subproblems, but they are not the top-level problem about AI risk (and, in my opinion, given current timelines and a quasi-worst-case assumption, they are probably not on the critical path at all).
How to formalize safety?
If the deontic sufficiency hypothesis is true, there should be an independent/separable way to formalize what “safety” is. This is why I think boundaries/membranes could be helpful for AI safety: See Agent membranes and formalizing “safety”.
Thanks to Jonathan Ng for reviewing a draft of this post and to Alexander Gietelink Oldenziel for encouraging me to post it.
Note that Davidad has not reviewed or verified this post.
For the record, as this post mostly consists of quotes from me, I can hardly fail to endorse it.
If we get TAI in the next decade or so, it will almost certainly contain an LLM, at least as a component. Human values are complex and fragile, and we spend a huge amount of our time writing about them: roughly half the Dewey Decimal system consists of many different subfields of “How to Make Humans Happy 101”, including virtually all of the soft sciences (Anthropology, Medicine, Ergonomics, Economics…), arts, and crafts. Current LLMs have read tens of trillions of tokens of our content, including terrabytes of this material, and as a result even GPT-4 (definitely less than TAI) can do a pretty good job of answering moral questions and commenting on possible undesirable side effects and downsides of plans. So if we have sufficient control of our TAI to ensure that it is extremely unlikely to kill us all, then presumably we can also tell it “also don’t do anything that your LLM says is a bad idea or we wouldn’t like, at least not without checking carefully with us first”, and get a passable take on human values and impact regularization as well. So if we have enough control to block your red arrow, we can also take at least a passable first cut at the green arrow as well. Which by itself probably isn’t enough to stand up to many bits of optimization pressure without Goodharting, but is a lot better then ignoring the green arrow entirely. Also any TAI that can do STEM can understand and avoid Goodharting.
I agree that just not killing everyone is a much easier problem. Consider zoos: the manual for “How Not to Kill Everything in Your Care: The Orangutan Edition” is probably only a few hundred pages or less, and has a significant overlap with the corresponding editions for all of the other primates, including Homo sapiens. However, LLMs can handle datasets vastly larger than that, so this compactness is only relevant if you’re trying to add some sort of mathematical or software framework on top of it that can handle O(100kB) of data, but not terabytes.
The more recent Safeguarded AI document has some parts that seem to me to go against the interpretation I had, which seems to go along the lines of this post.
Namely, that davidad’s proposal was not “CEV full alignment on AI that can be safely scaled without limit” but rather “sufficient control of AI that is as little more powerful as possible than sufficiently powerful for ethical global non-proliferation”.
In other words:
A) “this doesn’t guarantee a positive future but buys us time to solve alignment”
B) “a sufficiently powerful superintelligence would blow right through these constraints but they hold at the power level we think is enough for A”, thus implying “we also need boundedness somehow”.
The Safeguarded AI document says this though:
and
I’m probably missing something, but that seems to imply a claim that the control approach would be resilient against arbitrarily powerful misaligned AI?
A related thing I’m confused about is the part that says:
Whereas I previously thought that the point of the proposal was to create AI powerful-enough and controlled-enough to ethically establish global non-proliferation (so that “potential future rogue AIs” wouldn’t exist in the first place), it now seems to go in the direction of Good(-enough) AI defending against potential Bad AI?
The “unconstrained AI” in this sentence seems to be about how much value would be achieved from adoption of the safe/constrained design versus the counterfactual value of mainstream/unconstrained AI. My mistake.
The “constrained” still seems to refer to whether there’s a “box” around the AI, with all output funneled through formal verification checks on their predicted consequences. It does not seem to refer to a constraint on the “power level” (“boundedness”) of the AI within the box.