I think we can’t solve the problem of aligning AI systems with human values unless we have a very fine-grained, nitty-gritty, psychologically realistic description of the whole range and depth of human values we’re trying to align with.
This would make sense to me if we wanted to explicitly code human values into an AI. But (afaik) no alignment researcher advocates this approach. Instead, some alignment research directions aim to implicitly describe human values. This does not require us to understand human values on a fine-grained, nitty-gritty detailed level.
Even if the top-down approach seems to work, and we think we’ve solved the general problem of AI alignment for any possible human values, we can’t be sure we’ve done that until we test it on the whole range of relevant values, and demonstrate alignment success across that test set
If we test our possible alignment solution on a test set, then we run into the issue of deceptive alignment anyways. And if we are not sure about our alignment solution, but run our AGI for a real-world test, then there is the risk that all humans are killed.
I know that AI alignment researchers don’t aim to hand-code human values into AI systems, and most aim to ‘implicitly describe human values’. Agreed.
The issue is, which human values are you trying to implicitly incorporate into the AI system?
I guess if you think that all human values are generic, computationally interchangeable, extractible (from humans) by the same methods, and can be incorporated into AIs using the same methods, then that could work, in principle. But if we don’t explicitly consider the whole range of human value types, how would we even test whether our generic methods could work for all relevant value types?
This would make sense to me if we wanted to explicitly code human values into an AI. But (afaik) no alignment researcher advocates this approach. Instead, some alignment research directions aim to implicitly describe human values. This does not require us to understand human values on a fine-grained, nitty-gritty detailed level.
If we test our possible alignment solution on a test set, then we run into the issue of deceptive alignment anyways. And if we are not sure about our alignment solution, but run our AGI for a real-world test, then there is the risk that all humans are killed.
I know that AI alignment researchers don’t aim to hand-code human values into AI systems, and most aim to ‘implicitly describe human values’. Agreed.
The issue is, which human values are you trying to implicitly incorporate into the AI system?
I guess if you think that all human values are generic, computationally interchangeable, extractible (from humans) by the same methods, and can be incorporated into AIs using the same methods, then that could work, in principle. But if we don’t explicitly consider the whole range of human value types, how would we even test whether our generic methods could work for all relevant value types?