1 There is a procedure/algorithm which doesn’t seem biased towards a particular value system such that a class of AI systems that implement it end up having a common set of values, and they endorse the same values upon reflection.
2 This set of values might have something in common with what we, humans, call values.
If 1 and 2 seem at least plausible or conceivable, why can’t we use them as a basis to design aligned AI? Is it because of skepticism towards 1 or 2?
It seems very hard for me to imagine how one could create a procedure that wasn’t biased towards a particular value system. E.g. Stuart Armstrong has written about how humans can be assigned any values whatsoever—you have to decide what parts of their behavior are because of genuine preferences and what parts are because of irrationality, and what values that implies. And the way you decide what’s correct behavior and what’s irrationality seems like the kind of a choice that will depend on your own values. Even something like “this seems like the simplest way of assigning preferences” presupposes that it is valuable to pick a procedure based on its simplicity—though the post argues that even simplicity would fail to distinguish between several alternative ways of assigning preferences.
Of course, just because we can’t be truly unbiased doesn’t mean we couldn’t be less biased, so maybe something like “pick the simplest system that produces sensible agents, distinguishing between ties at random” could arguably be the least biased alternative. But human values seem quite complex; if there was some simple and unbiased solution that would produce convergent values to all AIs that implemented it, it might certainly have something in common with what we call values, but that’s not a very high bar. There’s a sense in which all the bacteria share the same goal, “making more (surviving) copies of yourself is the only thing that matters”, and I’d expect the convergent value system to end up as being something like that. That has some resemblance to human values, since many humans also care about having offspring, but not very much.
How do you feel about:
1
There is
a procedure/algorithm which doesn’t seem biased towards a particular value system
such that
a class of AI systems that implement it end up having a common set of values, and they endorse the same values upon reflection.
2
This set of values might have something in common with what we, humans, call values.
If 1 and 2 seem at least plausible or conceivable, why can’t we use them as a basis to design aligned AI? Is it because of skepticism towards 1 or 2?
The “might” in 2. Implies a “might not”.
It seems very hard for me to imagine how one could create a procedure that wasn’t biased towards a particular value system. E.g. Stuart Armstrong has written about how humans can be assigned any values whatsoever—you have to decide what parts of their behavior are because of genuine preferences and what parts are because of irrationality, and what values that implies. And the way you decide what’s correct behavior and what’s irrationality seems like the kind of a choice that will depend on your own values. Even something like “this seems like the simplest way of assigning preferences” presupposes that it is valuable to pick a procedure based on its simplicity—though the post argues that even simplicity would fail to distinguish between several alternative ways of assigning preferences.
Of course, just because we can’t be truly unbiased doesn’t mean we couldn’t be less biased, so maybe something like “pick the simplest system that produces sensible agents, distinguishing between ties at random” could arguably be the least biased alternative. But human values seem quite complex; if there was some simple and unbiased solution that would produce convergent values to all AIs that implemented it, it might certainly have something in common with what we call values, but that’s not a very high bar. There’s a sense in which all the bacteria share the same goal, “making more (surviving) copies of yourself is the only thing that matters”, and I’d expect the convergent value system to end up as being something like that. That has some resemblance to human values, since many humans also care about having offspring, but not very much.