johnswentworth comments on Alignment By Default

johnswentworth Aug 13, 2020, 4:54 PM
4 points
0
This came out of the discussion you had with John Maxwell, right?
Sort of? That was one significant factor which made me write it up now, and there’s definitely a lot of overlap. But this isn’t intended as a response/continuation to that discussion, it’s a standalone piece, and I don’t think I specifically address his thoughts from that conversation.
A lot of the material is ideas from the abstraction project which I’ve been meaning to write up for a while, as well as material from discussions with Rohin that I’ve been meaning to write up for a while.
How do we know that the unsupervised learner won’t have learnt a large number of other embeddings closer to the proxy? If it has, then why should we expect human values to do well?
Two brief comments here. First, I claim that natural abstraction space is quite discrete (i.e. there usually aren’t many concepts very close to each other), though this is nonobvious and I’m not ready to write up a full explanation of the claim yet. Second, for most proxies there probably are natural abstractions closer to the proxy, because most simple proxies are really terrible—for instance, if our proxy is “things people say are ethical on twitter”, then there’s probably some sort of natural abstraction involving signalling which is closer.
Assuming we get the chance to iterate, this is the sort of thing which people hopefully solve by trying stuff and seeing what works. (Not that I give that a super-high chance of success, but it’s not out of the question.)
Depending on what types the unsupervised learner provides the supervised, it may not be able to reach the proxy type by virtue of issues with NN learning processes.
Strongly agree with this, and your explanation is solid. Worth mentioning that we do have some universality results for neural nets, but it’s still the case that the neural net structure has implicit priors/biases which could make it hard to learn certain data structures. This is one of several reasons why I see “figuring out what sort-of-thing human values are” as one of the higher-expected-value subproblems on the theoretical side of alignment research.
- algon33 Aug 15, 2020, 12:15 AM
  3 points
  0
  Parent
  Based off what you’ve said in the comments, I’m guessing you’d say the various forms of corrigibility are natural abstractions. Would you say we can use the strategy you outline here to get “corrigibility by default”?
  Regarding iterations, the common objection is that we’re introducing optimisation pressure. So we should expect the usual alignment issues anyway. Under your theory, is this not an issue because of the sparsity of natural abstractions near human values?
  - johnswentworth Aug 15, 2020, 2:05 AM
    2 points
    0
    Parent
    I’m not sure about whether corrigibility is a natural abstraction. It’s at least plausible, and if it is, then corrigibility by default should work under basically-similar assumptions.
    Under your theory, is this not an issue because of the sparsity of natural abstractions near human values?
    Basically, yes. We want the system to use its actual model of human values as a proxy for its objective, which is itself a proxy for human values. So the whole strategy will fall apart in situations where the system converges to the true optimum of its objective. But in situations where a proxy for the system’s true optimum would be used (e.g. weak optimization or insufficient data to separate proxy from true), the model of human values may be the best available proxy.