As a toy model, let’s say the friendly utility function U has a hundred valuable components—friendship, love, autonomy, etc… - assumed to have positive numeric values. Then to ensure that we don’t lose any of these, U is defined as the minimum of all those hundred components.
This sounds like a cure that might be worse than the disease. (“Oh, the AI has access to thousands of interventions which could reliably increase the values of 999 of the different components by a factor of a thousand… but it turns out that they would all end up decreasing the value of one component by .001%, so none of them can be implemented.”)
This sounds like a cure that might be worse than the disease. (“Oh, the AI has access to thousands of interventions which could reliably increase the values of 999 of the different components by a factor of a thousand… but it turns out that they would all end up decreasing the value of one component by .001%, so none of them can be implemented.”)