Charlie Steiner comments on We Don’t Know Our Own Values, but Reward Bridges The Is-Ought Gap

Charlie Steiner 20 Sep 2024 4:48 UTC
5 points
0
This is related to the question “Are human values in humans, or are they in models of humans?”

Suppose you’re building an AI to learn human values and apply them to a novel situation.

The “human values are in humans”, ethos is that the way humans compute values is the thing AI should learn, and maybe it can abstract away many kinds of noise, but it shouldn’t be making any big algorithmic overhauls. It should just find the value-computation inside the human (probably with human feedback) and then apply it to the novel situation.

The “human values are in models of humans” take is that the AI can throw away a lot of information about the actual human brain, and instead should find good models (probably with human feedback) that have “values” as a component of a coarse-graining of human psychology, and then apply those “good” models to the novel situation.