We want AI to take up human values, in a way that generalizes out of distribution better than our capabilities. In other words, if our AI meets friendly aliens, we want its gun skills to fail before its morals.
Human values—as expressed by actions—seem messy, inconsistent, and hypocritical. If we forced an AI to regularise them, would it be wiser, more consistent, and simply good? Or would it oversimplify and make a brittle set of values.
I would predict that a simplified and more consistent form of human values would extrapolate better to a new situation. But if they are too simplified, they will be brittle and fail to generalize or perform how we want. I expect there to be quite a lot of simplification that can be performed initially, and I expect it to be interesting and helpful.
Perhaps we could even remove the parts of human values that are to an individual’s advantage. Removing self-preservation, self advantage, then ensuring consistency would lead to an interesting set of values.
It would be an interesting experimental direction. Perhaps an early experiment could make LLM answers conditional on a latent space which encodes values. It would be much like the image generators that are conditional on style. As a bonus, you would have this nice regular space of values where you can examine and play with.
If we regularise human values, what would happen?
We want AI to take up human values, in a way that generalizes out of distribution better than our capabilities. In other words, if our AI meets friendly aliens, we want its gun skills to fail before its morals.
Human values—as expressed by actions—seem messy, inconsistent, and hypocritical. If we forced an AI to regularise them, would it be wiser, more consistent, and simply good? Or would it oversimplify and make a brittle set of values.
I would predict that a simplified and more consistent form of human values would extrapolate better to a new situation. But if they are too simplified, they will be brittle and fail to generalize or perform how we want. I expect there to be quite a lot of simplification that can be performed initially, and I expect it to be interesting and helpful.
Perhaps we could even remove the parts of human values that are to an individual’s advantage. Removing self-preservation, self advantage, then ensuring consistency would lead to an interesting set of values.
It would be an interesting experimental direction. Perhaps an early experiment could make LLM answers conditional on a latent space which encodes values. It would be much like the image generators that are conditional on style. As a bonus, you would have this nice regular space of values where you can examine and play with.
It’s an interesting thought.
I can see regularisation playing something of a role here, but it’s hard to say.
I would love to see a project here with philosophers and technical folk collaborating to make progress on this question.