Interesting! I’m still concerned that, since you need to aggregate these things in the end anyhow (because everything is commensurable in the metric of affecting decisions), the aggregation function is going to be allowed to be very complicated and dependent on factors that don’t respect the separation of this trichotomy.
But it does make me consider how one might try to import this into value learning. I don’t think it would work to take these categories as given and then try to learn meta-preferences to sew them together, but most (particularly more direct) value learning schemes have to start with some “seed” of examples. If we draw that seed only from “approving,” does that mean that the trained AI isn’t going to value wanting or liking enough? Or would everything probably be fine, because we wouldn’t approve of bad stuff?
Interesting! I’m still concerned that, since you need to aggregate these things in the end anyhow (because everything is commensurable in the metric of affecting decisions), the aggregation function is going to be allowed to be very complicated and dependent on factors that don’t respect the separation of this trichotomy.
But it does make me consider how one might try to import this into value learning. I don’t think it would work to take these categories as given and then try to learn meta-preferences to sew them together, but most (particularly more direct) value learning schemes have to start with some “seed” of examples. If we draw that seed only from “approving,” does that mean that the trained AI isn’t going to value wanting or liking enough? Or would everything probably be fine, because we wouldn’t approve of bad stuff?