Vladimir_Nesov comments on Dalcy’s Shortform

Vladimir_Nesov 11 Aug 2024 17:03 UTC
9 points
10
I don’t see much hope in capturing a technical definition that doesn’t fall out of some sort of game theory, and even the latter won’t directly work for boundaries as representation of respect for autonomy helpful for alignment (as it needs to apply to radically weaker parties).

Boundaries seem more like a landmark feature of human-like preferences that serves as a test case for whether toy models of preference are reasonable. If a moral theory insists on tiling the universe with something, it fails the test. Imperative to merge all agents fails the test unless the agents end up essentially reconstructed. And with computronium, we’d need to look at the shape of things it’s computing rather than at the computing substrate.
- Dalcy 11 Aug 2024 18:03 UTC
  3 points
  0
  Parent
  I think it’s plausible that the general concept of boundaries can possibly be characterized somewhat independently of preferences, but at the same time have boundary-preservation be a quality that agents mostly satisfy (discussion here. very unsure about this). I see Critch’s definition as a first iteration of an operationalization for boundaries in the general, somewhat-preference-independent sense.
  But I do agree that ultimately all of this should tie back to game theory. I find Discovering Agents most promising in this regards, though there are still a lot of problems—some of which I suspect might be easier to solve if we treat systems-with-high-boundaryness as a sort of primitive for the kind-of-thing that we can associate agency and preferences with in the first place.
  - Vladimir_Nesov 11 Aug 2024 19:52 UTC
    3 points
    0
    Parent
    There are two different points here, boundaries as a formulation of agency, and boundaries as a major component of human values (which might be somewhat sufficient by itself for some alignment purposes). In the first role, boundaries are an acausal norm that many agents end up adopting, so that it’s natural to consider a notion of agency that implies boundaries (after the agent had an opportunity for sufficient reflection). But this use of boundaries is probably open to arbitrary ruthlessness, it’s not respect for autonomy of someone the powers that be wouldn’t sufficiently care about. Instead, boundaries would be a convenient primitive for describing interactions with other live players, a Schelling concept shared by agents in this sense.
    
    The second role as an aspect of values expresses that the agent does care about autonomy of others outside game theoretic considerations, so it only ties back to game theory by similarity, or through the story of formation of such values that involved game theory. A general definition might be useful here, if pointing AIs at it could instill it into their values. But technical definitions don’t seem to work when you consider what happens if you try to protect humanity’s autonomy using a boundary according to such definitions. It’s like machine translation, the problem could well be well-defined, but impossible to formally specify, other than by gesturing at a learning process.