davidad comments on Agent membranes/boundaries and formalizing “safety”

davidad 7 Jan 2024 0:33 UTC
11 points
6
These are very good questions. First, two general clarifications:

A. «Boundaries» are not partitions of physical space; they are partitions of a causal graphical model that is an abstraction over the concrete physical world-model.

B. To “pierce” a «boundary» is to counterfactually (with respect to the concrete physical world-model) cause the abstract model that represents the boundary to increase in prediction error (relative to the best augmented abstraction that uses the same state-space factorization but permits arbitrary causal dependencies crossing the boundary).

So, to your particular cases:
1. Probably not. There is no fundamental difference between sound and contact. Rather, the fundamental difference is between the usual flow of information through the senses and other flows of information that are possible in the concrete physical world-model but not represented in the abstraction. An interaction that pierces the membrane is one which breaks the abstraction barrier of perception. Ordinary speech acts do not. Only sounds which cause damage (internal state changes that are not well-modelled as mental states) or which otherwise exceed the “operating conditions” in the state space of the «boundary» layer (e.g. certain kinds of superstimuli) would pierce the «boundary».
2. Almost surely not. This is why, as an agenda for AI safety, it will be necessary to specify a handful of constructive goals, such as provision of clean water and sustenance and the maintenance of hospitable atmospheric conditions, in addition to the «boundary»-based safety prohibitions.
3. Definitely not. Omission of beneficial actions is not a counterfactual impact.
4. Probably. This causes prediction error because the abstraction of typical human spatial positions is that they have substantial ability to affect their position between nearby streets by simple locomotory action sequences. But if a human is already effectively imprisoned, then adding more concrete would not create additional/counterfactual prediction error.
5. Probably not. Provision of resources (that are within “operating conditions”, i.e. not “out-of-distribution”) is not a «boundary» violation as long as the human has the typical amount of control of whether to accept them.
6. Definitely not. Exploiting behavioural tendencies which are not counterfactually corrupted is not a «boundary» violation.
7. Maybe. If the ad’s effect on decision-making tendencies is well modelled by the abstraction of typical in-distribution human interactions, then using that channel does not violate the «boundary». Unprecedented superstimuli would, but the precedented patterns in advertising are already pretty bad. This is a weak point of the «boundaries» concept, in my view. We need additional criteria for avoiding psychological harm, including superpersuasion. One is simply to forbid autonomous superhuman systems from communicating to humans at all: any proposed actions which can be meaningfully interpreted by sandboxed human-level supervisory AIs as messages with nontrivial semantics could be rejected. Another approach is Mariven’s criterion for deception, but applying this criterion requires modelling human mental states as beliefs about the world (which is certainly not 100% scientifically accurate). I would like to see more work here, and more different proposed approaches.
What links here?
- Chipmonk's comment on Agent membranes/boundaries and formalizing “safety” by Chipmonk (5 Jan 2024 1:45 UTC; 1 point)
- the gears to ascension 7 Jan 2024 5:52 UTC
  3 points
  0
  Parent
  
  Definitely not. Omission of beneficial actions is not a counterfactual impact.
  
  You’re sure this is the case even if the disease is about to violate the <<boundary>> and the cure will prevent that?
- the gears to ascension 7 Jan 2024 5:55 UTC
  2 points
  0
  Parent
  
  We need additional criteria for avoiding psychological harm, including superpersuasion. One is simply to forbid autonomous superhuman systems from communicating to humans at all
  
  Unfortunately this is probably not on the table, as they are currently being used as weapons in economic warfare between the USA, China, and everyone else. tiktok primarily educational inside china. Advertisers have direct incentive to violate. We need a way to use <<membranes>> that will, on the margin, help protect against anyone violating them, not just avoid doing so itself.
  - Chipmonk 7 Jan 2024 6:11 UTC
    1 point
    0
    Parent
    he says a bit in this direction- see my other comment
- Chipmonk 15 Jan 2024 20:51 UTC
  1 point
  0
  Parent
  Here’s a tricky example I’ve been thinking about:
  Is a cell getting infected by a virus a boundary violation?
  What I think makes this tricky is that viruses generally don’t physically penetrate cell membranes. Instead, cells just “let in” some viruses (albeit against their better judgement).
  Then once you answer the above, please also consider:
  Is a cell taking in nutrients from its environment a boundary violation?
  I don’t know what makes this different from the virus example (at least as long as we’re not allowed to refer to preferences).
- Chipmonk 7 Jan 2024 6:10 UTC
  1 point
  0
  Parent
  any proposed actions which can be meaningfully interpreted by sandboxed human-level supervisory AIs as messages with nontrivial semantics could be rejected.
  I want to give a big +1 on preventing membrane piercing not just by having AIs respect membranes, but also by using technology to empower membranes to be stronger and better at self-defense.
  What links here?
  - Chipmonk's comment on Agent membranes/boundaries and formalizing “safety” by Chipmonk (7 Jan 2024 6:11 UTC; 1 point)
- Chipmonk 7 Jan 2024 6:10 UTC
  1 point
  0
  Parent
  Thanks for writing this! I largely agree (and the rest I need to think more about)

davidad comments on Agent membranes/​boundaries and formalizing “safety”

davidad comments on Agent membranes/boundaries and formalizing “safety”