I strongly agree, but I think the format of the thing we get, and how to apply it, are still going to require more thought.
Human values as they exist inside humans are going to exist natively as several different, perhaps conflicting, ways of judging human internal ways of representing the world. So first you have to make a model of a human, and figure out how you’re going to locate intentional-stance elements like “representation of the world.” Then you run into ontological crises from moving the human’s models and judgments into some common, more accurate model (that an AI might use). Get the wrong answer in one of these ontological crises, and the modeled utility function may assign high value to something we would regard as deceptive, or as wireheading the human (such reactions might give some hints towards how we want to resolve such ontological crises).
Once we’re comparing human judgments on a level playing field, we can still run into problems of conflicts, problems of circularity, and other weird meta-level conflicts where we don’t value some values that I’m not sure how to address in a principled way. But suppose we compress these judgments into one utility function within the larger model. Are we then done? I’m not sure.
I strongly agree, but I think the format of the thing we get, and how to apply it, are still going to require more thought.
Human values as they exist inside humans are going to exist natively as several different, perhaps conflicting, ways of judging human internal ways of representing the world. So first you have to make a model of a human, and figure out how you’re going to locate intentional-stance elements like “representation of the world.” Then you run into ontological crises from moving the human’s models and judgments into some common, more accurate model (that an AI might use). Get the wrong answer in one of these ontological crises, and the modeled utility function may assign high value to something we would regard as deceptive, or as wireheading the human (such reactions might give some hints towards how we want to resolve such ontological crises).
Once we’re comparing human judgments on a level playing field, we can still run into problems of conflicts, problems of circularity, and other weird meta-level conflicts where we don’t value some values that I’m not sure how to address in a principled way. But suppose we compress these judgments into one utility function within the larger model. Are we then done? I’m not sure.