RogerDearnaley comments on World-Model Interpretability Is All We Need

RogerDearnaley 10 Jan 2024 10:32 UTC
1 point
0
The issue might be that humans are too incoherent and philosophically confused for their “values” to stand for anything concrete — e. g., we almost certainly don’t have concrete utility functions.
One of the most basic concepts in natural language processing is valence/sentiment extraction: “am I happy or sad about this?”. This a direct measurement of “how well does the situation conform to my human values?”: what we’d want the model to optimize. Even tiny Natural Language Processing networks have clearly interpret able signals (neurons, activations, linear probes etc) of valence/sentiment extraction. So this is really not hard to find: it stands out like a sore thumb as soon as you start analyzing human text. Humans are adaption-executing agents that try to optimize a complex mess of things, and “how optimal is this, and why?” is one of the main things that we talk/complain about all the time. Whether this system is in places incoherent or Dutch-bookable so doesn’t match the theoretical requirements for a utility function is a separate question (and humans have numerous perceptual biases and often-unhelpful mental heuristics so the answer is almost certainly “yes”), but the basic signal is really easy to find.