But secondly, I’m not sure about the fragility argument: that if there is basically any distance between your description and what is truly good, you will lose everything.
This seems to be a) based on a few examples of discrepancies between written-down values and real values where the written down values entirely exclude something, and b) assuming that there is a fast takeoff so that the relevant AI has its values forever, and takes over the world.
When I think of the fragility argument, I usually think in terms of Goodhart’s Taxonomy. In particular, we might deal with--
Extremal Goodhart—Human values are already unusually well-satisfied relative to what is normal for this universe and pushing proxies of our values to the extremes might inadvertently move the universe away from that in some way we didn’t consider
Adversial Goodhart—The thing that matters which is absent from our proxy is absolutely critical for satsifying our values and requires the same kinds of resources that our proxy relies on
My impression is that our values are complex enough that they have a lot of distinct absolutely critical pieces that hard to pin down even if you try really hard. I mainly think this because I once tried imagining how to make an AGI that optimizes for ‘fulfilling human requests’ and realized that fulfill, human and request all had such complicated and fragile definitions that it would take me an extremely long time to pin-down what I meant. And I wouldn’t be confident in the result I made after pinning things down.
While I don’t find this kind of argument fully convincing, I think it’s more powerful than ′ a) based on a few examples of discrepancies between written-down values and real values where the written down values entirely exclude something’.
That being said, I agree with b). I also lean toward the view that Slow Take-Off plus Machine-Learning may allow a non-catastrophic “good enough” solutions to human value problems.
My guess is that values that are got using ML but still somewhat off from human values are much closer in terms of not destroying all value of the universe, than ones that a person tries to write down. Like, the kinds of errors people have used to illustrate this problem (forget to put in, ‘consciousness is good’) are like forgetting to say faces have nostrils in trying to specify what a face is like, whereas a modern ML system’s imperfect impression of a face seems more likely to meet my standards for ‘very facelike’ (most of the time).
I agree that Machine-Learning will probably give us better estimations of human-flourishing than trying to write-down our values themselves. However, I’m still very apprehensive about it unless we’re also being very careful about slow take-off. The main reasons for this apprehensiveness comes from Rohin Shah’s work sequence on Value Learning (particularly ambitious value-learning). My main take-away from this was: Learning human values from examples of humans is hard without writing down some extra assumptions about human values (which may leave something important out).
Here’s a practical example of this: If you create an AI that learns human values from a lot of examples of humans, what do you think its stance will be on Person-Affecting Views? What will its stance be on value-lexicality responses to Torture vs. Dust-Specks? My impression is that you’ll have to write down something to tell the AI how to decide these cases (when should we categorize human behaviors as irrational vs when should we not). And a lot of people may regard the ultimate decision as catastrophic.
There are other complications too. If the AI can interact with the world in ways that change human values and then updates to care about those changed values, strange things might happen. For instance, the AI might pressure humanity to adopt simpler, easier to learn values if it’s agential. This might not be so bad but I suspect there are things the AI might do that could potentially be very bad.
So, because I’m not that confident in ML value-learning and because I’m not that confident in human values in general, I’m pretty skeptical of the idea that machine-learning will avert extreme risks associated with value mispecification.
When I think of the fragility argument, I usually think in terms of Goodhart’s Taxonomy. In particular, we might deal with--
Extremal Goodhart—Human values are already unusually well-satisfied relative to what is normal for this universe and pushing proxies of our values to the extremes might inadvertently move the universe away from that in some way we didn’t consider
Adversial Goodhart—The thing that matters which is absent from our proxy is absolutely critical for satsifying our values and requires the same kinds of resources that our proxy relies on
My impression is that our values are complex enough that they have a lot of distinct absolutely critical pieces that hard to pin down even if you try really hard. I mainly think this because I once tried imagining how to make an AGI that optimizes for ‘fulfilling human requests’ and realized that fulfill, human and request all had such complicated and fragile definitions that it would take me an extremely long time to pin-down what I meant. And I wouldn’t be confident in the result I made after pinning things down.
While I don’t find this kind of argument fully convincing, I think it’s more powerful than ′ a) based on a few examples of discrepancies between written-down values and real values where the written down values entirely exclude something’.
That being said, I agree with b). I also lean toward the view that Slow Take-Off plus Machine-Learning may allow a non-catastrophic “good enough” solutions to human value problems.
I agree that Machine-Learning will probably give us better estimations of human-flourishing than trying to write-down our values themselves. However, I’m still very apprehensive about it unless we’re also being very careful about slow take-off. The main reasons for this apprehensiveness comes from Rohin Shah’s work sequence on Value Learning (particularly ambitious value-learning). My main take-away from this was: Learning human values from examples of humans is hard without writing down some extra assumptions about human values (which may leave something important out).
Here’s a practical example of this: If you create an AI that learns human values from a lot of examples of humans, what do you think its stance will be on Person-Affecting Views? What will its stance be on value-lexicality responses to Torture vs. Dust-Specks? My impression is that you’ll have to write down something to tell the AI how to decide these cases (when should we categorize human behaviors as irrational vs when should we not). And a lot of people may regard the ultimate decision as catastrophic.
There are other complications too. If the AI can interact with the world in ways that change human values and then updates to care about those changed values, strange things might happen. For instance, the AI might pressure humanity to adopt simpler, easier to learn values if it’s agential. This might not be so bad but I suspect there are things the AI might do that could potentially be very bad.
So, because I’m not that confident in ML value-learning and because I’m not that confident in human values in general, I’m pretty skeptical of the idea that machine-learning will avert extreme risks associated with value mispecification.