The issue might be that humans are too incoherent and philosophically confused for their “values” to stand for anything concrete — e. g., we almost certainly don’t have concrete utility functions.
One of the most basic concepts in natural language processing is valence/sentiment extraction: “am I happy or sad about this?”. This a direct measurement of “how well does the situation conform to my human values?”: what we’d want the model to optimize. Even tiny Natural Language Processing networks have clearly interpret able signals (neurons, activations, linear probes etc) of valence/sentiment extraction. So this is really not hard to find: it stands out like a sore thumb as soon as you start analyzing human text. Humans are adaption-executing agents that try to optimize a complex mess of things, and “how optimal is this, and why?” is one of the main things that we talk/complain about all the time. Whether this system is in places incoherent or Dutch-bookable so doesn’t match the theoretical requirements for a utility function is a separate question (and humans have numerous perceptual biases and often-unhelpful mental heuristics so the answer is almost certainly “yes”), but the basic signal is really easy to find.
One of the most basic concepts in natural language processing is valence/sentiment extraction: “am I happy or sad about this?”. This a direct measurement of “how well does the situation conform to my human values?”: what we’d want the model to optimize. Even tiny Natural Language Processing networks have clearly interpret able signals (neurons, activations, linear probes etc) of valence/sentiment extraction. So this is really not hard to find: it stands out like a sore thumb as soon as you start analyzing human text. Humans are adaption-executing agents that try to optimize a complex mess of things, and “how optimal is this, and why?” is one of the main things that we talk/complain about all the time. Whether this system is in places incoherent or Dutch-bookable so doesn’t match the theoretical requirements for a utility function is a separate question (and humans have numerous perceptual biases and often-unhelpful mental heuristics so the answer is almost certainly “yes”), but the basic signal is really easy to find.