One subtlety which approximately 100% of people I’ve talked to about this post apparently missed: I am pretty confident that the inputs to human values are natural abstractions, i.e. we care about things like trees, cars, humans, etc, not about quantum fields or random subsets of atoms. I am much less confident that “human values” themselves are natural abstractions; values vary a lot more across cultures than e.g. agreement on “trees” as a natural category.
Relatedly, the author is too optimistic (IMO) in his comparison of this technique to alternatives: …
In the particular section you quoted, I’m explicitly comparing the best-case of abstraction by default to the the other two strategies, assuming that the other two work out about-as-well as they could realistically be expected to work. For instance, learning a human utility function is usually a built-in assumption of IRL formulations, so such formulations can’t do any better than a utility function approximation even in the best case. Alignment by default does not need to assume humans have a utility function; it just needs whatever-humans-do-have to have low marginal complexity in a system which has learned lots of natural abstractions.
Obviously alignment by default has analogous assumptions/flaws; much of the OP is spent discussing them. The particular section you quote was just talking about the best-case where those assumptions work out well.
The potential for malign hypotheses (learning of hypotheses / models containing malign subagents) exists in any learning system, and in particular malign simulation hypotheses are a serious concern. …
I partially agree with this, though I do think there are good arguments that malign simulation issues will not be a big deal (or to the extent that they are, they’ll look more like Dr Nefarious than pure inner daemons), and by historical accident those arguments have not been circulated in this community to nearly the same extent as the arguments that malign simulations will be a big deal. Some time in the next few weeks I plan to write a review of The Solomonoff Prior Is Malign which will talk about one such argument.
I am pretty confident that the inputs to human values are natural abstractions, i.e. we care about things like trees, cars, humans, etc, not about quantum fields or random subsets of atoms. I am much less confident that “human values” themselves are natural abstractions
That’s fair, but it’s still perfectly in line with the learning-theoretic perspective: human values are simpler to express through the features acquired by unsupervised learning than through the raw data, which translates to a reduction in sample complexity.
...learning a human utility function is usually a built-in assumption of IRL formulations, so such formulations can’t do any better than a utility function approximation even in the best case. Alignment by default does not need to assume humans have a utility function; it just needs whatever-humans-do-have to have low marginal complexity in a system which has learned lots of natural abstractions.
This seems wrong to me. If you do IRL with the correct type signature for human values then in the best case you get the true human values. IRL is not mutually exclusive with your approach: e.g. you can do unsupervised learning and IRL with shared weights. I guess you might be defining “IRL” as something very narrow, whereas I define it “any method based on revealed preferences”.
...to the extent that they are, they’ll look more like Dr Nefarious than pure inner daemons
Malign simulation hypotheses already look like “Dr. Nefarious” where the role of Dr. Nefarious is played by the masters of the simulation, so I’m not sure what exactly is the distinction you’re drawing here.
That’s fair, but it’s still perfectly in line with the learning-theoretic perspective: human values are simpler to express through the features acquired by unsupervised learning than through the raw data, which translates to a reduction in sample complexity.
Yup, that’s right. I still agree with your general understanding, just wanted to clarify the subtlety.
If you do IRL with the correct type signature for human values then in the best case you get the true human values. IRL is not mutually exclusive with your approach: e.g. you can do unsupervised learning and IRL with shared weights.
Yup, I agree with all that. I was specifically talking about IRL approaches which try to learn a utility function, not the more general possibility space.
Malign simulation hypotheses already look like “Dr. Nefarious” where the role of Dr. Nefarious is played by the masters of the simulation, so I’m not sure what exactly is the distinction you’re drawing here.
The distinction there is about whether or not there’s an actual agent in the external environment which coordinates acausally with the malign inner agent, or some structure in the environment which allows for self-fulfilling prophecies, or something along those lines. The point is that there has to be some structure in the external environment which allows a malign inner agent to gain influence over time by making accurate predictions. Otherwise, the inner agent will only have whatever limited influence it has from the prior, and every time it deviates from its actual best predictions (or is just out-predicted by some other model), some of that influence will be irreversibly spent; it will end up with zero influence in the long run.
...the inner agent will only have whatever limited influence it has from the prior, and every time it deviates from its actual best predictions (or is just out-predicted by some other model), some of that influence will be irreversibly spent
Of course, but this in itself is no consolation, because it can spend its finite influence to make the AI perform an irreversible catastrophic action: for example, self-modifying into something explicitly malign.
In e.g. IDA-type protocols you can defend by using a good prior (such as IB physicalism) plus confidence thresholds (i.e. every time the hypotheses have a major disagreement you query the user). You also have to do something about non-Cartesian attack vectors (I have some ideas), but that doesn’t depend much on the protocol.
In value learning things are worse, because of the possibility of corruption (i.e. the AI hacking the user or its own input channels). As a consequence, it is no longer clear you can infer the correct values even if you make correct predictions about everything observable. Protocols based on extrapolating from observables to unobservables fail, because malign hypotheses can attack the extrapolation with impunity (e.g. a malign hypothesis can assign some kind of “Truman show” interpretation to the behavior of the user, where the user’s true values are completely alien and they are just pretending to be human because of the circumstances of the simulation).
One subtlety which approximately 100% of people I’ve talked to about this post apparently missed: I am pretty confident that the inputs to human values are natural abstractions, i.e. we care about things like trees, cars, humans, etc, not about quantum fields or random subsets of atoms. I am much less confident that “human values” themselves are natural abstractions; values vary a lot more across cultures than e.g. agreement on “trees” as a natural category.
In the particular section you quoted, I’m explicitly comparing the best-case of abstraction by default to the the other two strategies, assuming that the other two work out about-as-well as they could realistically be expected to work. For instance, learning a human utility function is usually a built-in assumption of IRL formulations, so such formulations can’t do any better than a utility function approximation even in the best case. Alignment by default does not need to assume humans have a utility function; it just needs whatever-humans-do-have to have low marginal complexity in a system which has learned lots of natural abstractions.
Obviously alignment by default has analogous assumptions/flaws; much of the OP is spent discussing them. The particular section you quote was just talking about the best-case where those assumptions work out well.
I partially agree with this, though I do think there are good arguments that malign simulation issues will not be a big deal (or to the extent that they are, they’ll look more like Dr Nefarious than pure inner daemons), and by historical accident those arguments have not been circulated in this community to nearly the same extent as the arguments that malign simulations will be a big deal. Some time in the next few weeks I plan to write a review of The Solomonoff Prior Is Malign which will talk about one such argument.
That’s fair, but it’s still perfectly in line with the learning-theoretic perspective: human values are simpler to express through the features acquired by unsupervised learning than through the raw data, which translates to a reduction in sample complexity.
This seems wrong to me. If you do IRL with the correct type signature for human values then in the best case you get the true human values. IRL is not mutually exclusive with your approach: e.g. you can do unsupervised learning and IRL with shared weights. I guess you might be defining “IRL” as something very narrow, whereas I define it “any method based on revealed preferences”.
Malign simulation hypotheses already look like “Dr. Nefarious” where the role of Dr. Nefarious is played by the masters of the simulation, so I’m not sure what exactly is the distinction you’re drawing here.
Yup, that’s right. I still agree with your general understanding, just wanted to clarify the subtlety.
Yup, I agree with all that. I was specifically talking about IRL approaches which try to learn a utility function, not the more general possibility space.
The distinction there is about whether or not there’s an actual agent in the external environment which coordinates acausally with the malign inner agent, or some structure in the environment which allows for self-fulfilling prophecies, or something along those lines. The point is that there has to be some structure in the external environment which allows a malign inner agent to gain influence over time by making accurate predictions. Otherwise, the inner agent will only have whatever limited influence it has from the prior, and every time it deviates from its actual best predictions (or is just out-predicted by some other model), some of that influence will be irreversibly spent; it will end up with zero influence in the long run.
Of course, but this in itself is no consolation, because it can spend its finite influence to make the AI perform an irreversible catastrophic action: for example, self-modifying into something explicitly malign.
In e.g. IDA-type protocols you can defend by using a good prior (such as IB physicalism) plus confidence thresholds (i.e. every time the hypotheses have a major disagreement you query the user). You also have to do something about non-Cartesian attack vectors (I have some ideas), but that doesn’t depend much on the protocol.
In value learning things are worse, because of the possibility of corruption (i.e. the AI hacking the user or its own input channels). As a consequence, it is no longer clear you can infer the correct values even if you make correct predictions about everything observable. Protocols based on extrapolating from observables to unobservables fail, because malign hypotheses can attack the extrapolation with impunity (e.g. a malign hypothesis can assign some kind of “Truman show” interpretation to the behavior of the user, where the user’s true values are completely alien and they are just pretending to be human because of the circumstances of the simulation).
It’s up.