There are 2 senses in which I agree that we don’t need full on “capital V value alignment”:
We can build things that aren’t utility maximizers (e.g. consider the humble MNIST classifier)
There are some utility functions that aren’t quite right, but are still safe enough to optimize in practice (e.g. see “Value Alignment Verification”, but see also, e.g. “Defining and Characterizing Reward Hacking” for negative results)
But also:
Some amount of alignment is probably necessary in order to build safe agenty things (the more agenty, the higher the bar for alignment, since you start to increasingly encounter perverse instatiation-type concerns—CAVEAT: agency is not a unidimensional quantity, cf: “Harms from Increasingly Agentic Algorithmic Systems”).
Note that my statement was about the relative requirements for alignment in text domains vs. real-world. I don’t really see how your arguments are relevant to this question.
Concretely, in domains with vision, we should probably be significantly more worried that an AI system learns something more like an adversarial “hack” on it’s values leading to behavior that significantly diverges from things humans would endorse.
There are 2 senses in which I agree that we don’t need full on “capital V value alignment”:
We can build things that aren’t utility maximizers (e.g. consider the humble MNIST classifier)
There are some utility functions that aren’t quite right, but are still safe enough to optimize in practice (e.g. see “Value Alignment Verification”, but see also, e.g. “Defining and Characterizing Reward Hacking” for negative results)
But also:
Some amount of alignment is probably necessary in order to build safe agenty things (the more agenty, the higher the bar for alignment, since you start to increasingly encounter perverse instatiation-type concerns—CAVEAT: agency is not a unidimensional quantity, cf: “Harms from Increasingly Agentic Algorithmic Systems”).
Note that my statement was about the relative requirements for alignment in text domains vs. real-world. I don’t really see how your arguments are relevant to this question.
Concretely, in domains with vision, we should probably be significantly more worried that an AI system learns something more like an adversarial “hack” on it’s values leading to behavior that significantly diverges from things humans would endorse.