Corrigibility are not Composable

leogao 11 Aug 2022 6:19 UTC
LW: 6 AF: 3
5
AF
(Mostly just stating my understanding of your take back at you to see if I correctly got what you’re saying:)

I agree this argument is obviously true in the limit, with the transistor case as an existence proof. I think things get weird at the in-between scales. The smaller the network of aligned components, the more likely it is to be aligned (obviously, in the limit if you have only one aligned thing, the entire system of that one thing is aligned); and also the more modular each component is (or I guess you would say the better the interfaces between the components), the more likely it is to be aligned. And in particular if the interfaces are good and have few weird interactions, then you can probably have a pretty big network of components without it implementing something egregiously misaligned (like actually secretly plotting to kill everyone).

And people who are optimistic about HCH-like things generally believe that language is a good interface and so conditional on that it makes sense to think that trees of humans would not implement egregiously misaligned cognition, whereas you’re less optimistic about this and so your research agenda is trying to pin down the general theory of Where Good Interfaces/Abstractions Come From or something else more deconfusion-y along those lines.

Does this seem about right?
- johnswentworth 11 Aug 2022 16:08 UTC
  LW: 4 AF: 3
  3
  AF Parent
  Good description.
  Also I had never actually floated the hypothesis that “people who are optimistic about HCH-like things generally believe that language is a good interface” before; natural language seems like such an obviously leaky and lossy API that I had never actually considered that other people might think it’s a good idea.
  - Antoine de Scorraille 3 Jun 2023 13:21 UTC
    5 points
    0
    Parent
    Natural language is lossy because the communication channel is narrow, hence the need for lower-dimensional representation (see ML embeddings) of what we’re trying to convey. Lossy representations is also what Abstractions are about.
    But in practice, you expect Natural Abstractions (if discovered) cannot be expressed in natural language?
    - johnswentworth 3 Jun 2023 17:30 UTC
      2 points
      0
      Parent
      I expect words are usually pointers to natural abstractions, so that part isn’t the main issue—e.g. when we look at how natural language fails all the time in real-world coordination problems, the issue usually isn’t that two people have different ideas of what “tree” means. (That kind of failure does sometimes happen, but it’s unusual enough to be funny/notable.) The much more common failure mode is that a person is unable to clearly express what they want—e.g. a client failing to communicate what they want to a seller. That sort of thing is one reason why I’m highly uncertain about the extent to which human values (or other variations of “what humans want”) are a natural abstraction.
      - [ ]
        [deleted]

leogao comments on Interpretability/​Tool-ness/​Alignment/​Corrigibility are not Composable

leogao comments on Interpretability/Tool-ness/Alignment/Corrigibility are not Composable