This came out of the discussion you had with John Maxwell, right? Does he think this is a good presentation of his proposal?
How do we know that the unsupervised learner won’t have learnt a large number of other embeddings closer to the proxy? If it has, then why should we expect human values to do well?
Some rough thoughts on the data type issue. Depending on what types the unsupervised learner provides the supervised, it may not be able to reach the proxy type by virtue of issues with NN learning processes.
Recall that tata types can be viewed as homotopic spaces, and construction of types can be viewed as generating new spaces off the old e.g. tangent spaces or path spaces etc. We can view neural nets as a type corresponding to a particular homotopic space. But getting neural nets to learn certain functions is hard. For example, learning a function which is 0 except in two sub spaces A and B. It has different values on A and B. But A and B are shaped like intelocked rings. In other words, a non-linear classification problem. So plausibly, neural nets have trouble constructing certain types from others. Maybe this depends on architecture or learning algorithm, maybe not.
If the proxy and human values have very different types, it may be the case that the supervised learner won’t be able to get from one type to another. Supposing the unsupervised learner presents it with types “reachable” from human values, then the proxy which optimises performance on the data set is just unavailable to the system even though its relatively simple in comparison.
Because of this, checking which simple homotopies neural nets can move between would be useful. Depending on the results, we could use this as an arguement that unsupervised NNs will never embed the human values type because we’ve found out it has some simple properties it won’t be able to construct de novo. Unless we do something like feed the unsupervised learner human biases/start with an EM and modify it.
Does he think this is a good presentation of his proposal?
I’m very glad johnswentworth wrote this, but there are a lot of little details where we seem to disagree—see my other comments in this thread. There are also a few key parts of my proposal not discussed in this post, such as active learning and using an ensemble to fight Goodharting and be more failure-tolerant. I don’t think there’s going to be a single natural abstraction for “human values” like johnswentworth seems to imply with this post, but I also think that’s a solvable problem.
This came out of the discussion you had with John Maxwell, right?
Sort of? That was one significant factor which made me write it up now, and there’s definitely a lot of overlap. But this isn’t intended as a response/continuation to that discussion, it’s a standalone piece, and I don’t think I specifically address his thoughts from that conversation.
A lot of the material is ideas from the abstraction project which I’ve been meaning to write up for a while, as well as material from discussions with Rohin that I’ve been meaning to write up for a while.
How do we know that the unsupervised learner won’t have learnt a large number of other embeddings closer to the proxy? If it has, then why should we expect human values to do well?
Two brief comments here. First, I claim that natural abstraction space is quite discrete (i.e. there usually aren’t many concepts very close to each other), though this is nonobvious and I’m not ready to write up a full explanation of the claim yet. Second, for most proxies there probably are natural abstractions closer to the proxy, because most simple proxies are really terrible—for instance, if our proxy is “things people say are ethical on twitter”, then there’s probably some sort of natural abstraction involving signalling which is closer.
Assuming we get the chance to iterate, this is the sort of thing which people hopefully solve by trying stuff and seeing what works. (Not that I give that a super-high chance of success, but it’s not out of the question.)
Depending on what types the unsupervised learner provides the supervised, it may not be able to reach the proxy type by virtue of issues with NN learning processes.
Strongly agree with this, and your explanation is solid. Worth mentioning that we do have some universality results for neural nets, but it’s still the case that the neural net structure has implicit priors/biases which could make it hard to learn certain data structures. This is one of several reasons why I see “figuring out what sort-of-thing human values are” as one of the higher-expected-value subproblems on the theoretical side of alignment research.
Based off what you’ve said in the comments, I’m guessing you’d say the various forms of corrigibility are natural abstractions. Would you say we can use the strategy you outline here to get “corrigibility by default”?
Regarding iterations, the common objection is that we’re introducing optimisation pressure. So we should expect the usual alignment issues anyway. Under your theory, is this not an issue because of the sparsity of natural abstractions near human values?
I’m not sure about whether corrigibility is a natural abstraction. It’s at least plausible, and if it is, then corrigibility by default should work under basically-similar assumptions.
Under your theory, is this not an issue because of the sparsity of natural abstractions near human values?
Basically, yes. We want the system to use its actual model of human values as a proxy for its objective, which is itself a proxy for human values. So the whole strategy will fall apart in situations where the system converges to the true optimum of its objective. But in situations where a proxy for the system’s true optimum would be used (e.g. weak optimization or insufficient data to separate proxy from true), the model of human values may be the best available proxy.
This came out of the discussion you had with John Maxwell, right? Does he think this is a good presentation of his proposal?
How do we know that the unsupervised learner won’t have learnt a large number of other embeddings closer to the proxy? If it has, then why should we expect human values to do well?
Some rough thoughts on the data type issue. Depending on what types the unsupervised learner provides the supervised, it may not be able to reach the proxy type by virtue of issues with NN learning processes.
Recall that tata types can be viewed as homotopic spaces, and construction of types can be viewed as generating new spaces off the old e.g. tangent spaces or path spaces etc. We can view neural nets as a type corresponding to a particular homotopic space. But getting neural nets to learn certain functions is hard. For example, learning a function which is 0 except in two sub spaces A and B. It has different values on A and B. But A and B are shaped like intelocked rings. In other words, a non-linear classification problem. So plausibly, neural nets have trouble constructing certain types from others. Maybe this depends on architecture or learning algorithm, maybe not.
If the proxy and human values have very different types, it may be the case that the supervised learner won’t be able to get from one type to another. Supposing the unsupervised learner presents it with types “reachable” from human values, then the proxy which optimises performance on the data set is just unavailable to the system even though its relatively simple in comparison.
Because of this, checking which simple homotopies neural nets can move between would be useful. Depending on the results, we could use this as an arguement that unsupervised NNs will never embed the human values type because we’ve found out it has some simple properties it won’t be able to construct de novo. Unless we do something like feed the unsupervised learner human biases/start with an EM and modify it.
I’m very glad johnswentworth wrote this, but there are a lot of little details where we seem to disagree—see my other comments in this thread. There are also a few key parts of my proposal not discussed in this post, such as active learning and using an ensemble to fight Goodharting and be more failure-tolerant. I don’t think there’s going to be a single natural abstraction for “human values” like johnswentworth seems to imply with this post, but I also think that’s a solvable problem.
(previous discussion for reference)
Sort of? That was one significant factor which made me write it up now, and there’s definitely a lot of overlap. But this isn’t intended as a response/continuation to that discussion, it’s a standalone piece, and I don’t think I specifically address his thoughts from that conversation.
A lot of the material is ideas from the abstraction project which I’ve been meaning to write up for a while, as well as material from discussions with Rohin that I’ve been meaning to write up for a while.
Two brief comments here. First, I claim that natural abstraction space is quite discrete (i.e. there usually aren’t many concepts very close to each other), though this is nonobvious and I’m not ready to write up a full explanation of the claim yet. Second, for most proxies there probably are natural abstractions closer to the proxy, because most simple proxies are really terrible—for instance, if our proxy is “things people say are ethical on twitter”, then there’s probably some sort of natural abstraction involving signalling which is closer.
Assuming we get the chance to iterate, this is the sort of thing which people hopefully solve by trying stuff and seeing what works. (Not that I give that a super-high chance of success, but it’s not out of the question.)
Strongly agree with this, and your explanation is solid. Worth mentioning that we do have some universality results for neural nets, but it’s still the case that the neural net structure has implicit priors/biases which could make it hard to learn certain data structures. This is one of several reasons why I see “figuring out what sort-of-thing human values are” as one of the higher-expected-value subproblems on the theoretical side of alignment research.
Based off what you’ve said in the comments, I’m guessing you’d say the various forms of corrigibility are natural abstractions. Would you say we can use the strategy you outline here to get “corrigibility by default”?
Regarding iterations, the common objection is that we’re introducing optimisation pressure. So we should expect the usual alignment issues anyway. Under your theory, is this not an issue because of the sparsity of natural abstractions near human values?
I’m not sure about whether corrigibility is a natural abstraction. It’s at least plausible, and if it is, then corrigibility by default should work under basically-similar assumptions.
Basically, yes. We want the system to use its actual model of human values as a proxy for its objective, which is itself a proxy for human values. So the whole strategy will fall apart in situations where the system converges to the true optimum of its objective. But in situations where a proxy for the system’s true optimum would be used (e.g. weak optimization or insufficient data to separate proxy from true), the model of human values may be the best available proxy.