I’d also add that having a system that uses abstractions that are close to humans is insufficient for safety, because you’re putting those abstractions under stress by optimizing them.
I do think it’s plausible that any AI modelling humans will model humans as having preferences, but 1) I’d imagine these preference models as calibrated on normal world states, and not extend property off distribution (IE, as soon as the AI starts doing things with nanomachines that humans can’t reason properly about) and 2) “pointing” at the right part of the AI’s world model to yield preferences, instead of a proxy that’s a better model of your human feedback mechanism, is still an unsolved problem. (The latter point is outlined in the post in some detail, I think?) I also think that 3) there’s a possibility that there is no simple, natural core of human values, simpler than “model the biases of people in detail”, for an AI to find.
I’d also add that having a system that uses abstractions that are close to humans is insufficient for safety, because you’re putting those abstractions under stress by optimizing them.
I do think it’s plausible that any AI modelling humans will model humans as having preferences, but 1) I’d imagine these preference models as calibrated on normal world states, and not extend property off distribution (IE, as soon as the AI starts doing things with nanomachines that humans can’t reason properly about) and 2) “pointing” at the right part of the AI’s world model to yield preferences, instead of a proxy that’s a better model of your human feedback mechanism, is still an unsolved problem. (The latter point is outlined in the post in some detail, I think?) I also think that 3) there’s a possibility that there is no simple, natural core of human values, simpler than “model the biases of people in detail”, for an AI to find.