Agent membranes/boundaries and formalizing “safety”

ChipmonkJan 3, 2024, 5:55 PM

26 points

Boundaries / Membranes [technical]AI Human Values

It might be possible to formalize what it means for an agent/moral patient to be safe via membranes/boundaries. This post tells one (just one) story for how the membranes idea could be useful for thinking about existential risk and AI safety.

Formalizing “safety” using agent membranes

A few examples:

A bacterium uses its membrane to protect its internal processes from external influences.

AI generated image: microscope view of a amoeba

A nation maintains its sovereignty by defending its borders.

A human protects their mental integrity by selectively filtering the information that comes in and out of their mind.

AI generated image: a brain surrounded by a fence

A natural abstraction for agent safety?

Agent boundaries/membranes seem to be a natural abstraction representing the safety and autonomy of agents.

A bacterium survives only if its membrane is preserved.
A nation maintains its sovereignty only if its borders aren’t invaded.
A human mind maintains mental integrity only if it can hold off informational manipulation.

Maybe the safety of agents could be largely formalized as the preservation of their membranes.

Distinct from preferences!

Boundaries are also cool because they show a way to respect agents without needing to talking about their preferences or utility functions. Andrew Critch has said the following about this idea:

my goal is to treat boundaries as more fundamental than preferences, rather than as merely a feature of them. In other words, I think boundaries are probably better able to carve reality at the joints than either preferences or utility functions, for the purpose of creating a good working relationship between humanity and AI technology («Boundaries» Sequence, Part 3b)

For instance, respecting the boundary of bacterium would probably mean “preserving or not disrupting its membrane” (as opposed to knowing its preferences and satisfying them).

Protecting agents and infrastructure

By formalizing and preserving the important boundaries in the world, we could be in a better position to protect humanity from AI threats. Examples:

Critical computing infrastructure could be secured by creating strong boundaries around them. This can be enforced by cryptography and formal methods such that only the subprocesses that need to have read and/or write access to a particular resource (like memory) have the encryption keys to do so.
- Related: Object-capability model, Principle of least privilege, Evan Miyazono’s Atlas Computing, Davidad’s Open Agency Architecture.

It may be possible to specify or agree on a minimal “membrane” for each agent/moral patients humanity values, such that when each membrane is preserved, that agent largely stays safe and maintains its autonomy over the inside of its membrane.
- If your physical boundary isn’t violated, you don’t die. If your mental boundary isn’t violated aren’t manipulated. Etc…
- Note: if your membrane is preserved, this just means that you stay safe and that you maintain autonomy over everything with your membrane. It does not necessarily mean that you get actively positive outcomes to occur in the outside world. This is all about bare minimum safety.
- See this thread below and davidad’s comment

Similarly, it may be possible to formalize and enforce the boundaries of physical property rights.

This is for safety, not full alignment

Note that this is only about specifying safety, not full alignment.

See: Safety First: safety before full alignment. The deontic sufficiency hypothesis.

Caveats

I don’t think the absence of membrane piercing formalizes all of safety, but I think it gets at a good chunk of what “safety” should mean. Future thinking will have to determine what more is required.

What are examples of violations of agent safety that do not involve membrane piercing?

Markov blankets

How might membranes/boundaries be formalized mathematically? Markov blankets seem to be a fitting abstraction.

Diagram:

Notice that there are no arrows directly between the agent and its environment. Ideally, all influence from one to the other flows through the boundary/membrane (e.g.: your skin).

In which case,

Infiltration of information across this Markov blanket measures membrane piercing, and low infiltration indicates the absence of such piercing.
(And it may also be useful to keep track of exfiltration across the Markov blanket?^[1])

For more details, see distillation Formalizing «Boundaries» with Markov blankets.

Also, there are probably other information-theoretic measures that are useful for formalizing membranes/boundaries.

Protecting agent membranes/boundaries

See: Protecting agent boundaries.

Subscribe to the boundaries/membranes LessWrong tag to get notified of new developments.

Thanks to Jonathan Ng, Alexander Gietelink Oldenziel, Alex Zhu, and Evan Miyazono for reviewing a draft of this post.

^
exfiltration, i.e.: privacy and the absence of mind-reading. But I need to think more about this. Related section: “Maintaining Boundaries is about Maintaining Free Will and Privacy” by Scott Garrabrant.

What links here?