I’m an alignment researcher, and I think the problem is tractable. Briefly:
Humans manage to learn human values, despite us not having an exact formalism of our values. However, humans in different cultures learn different values, and humans raised outside of civilisation don’t learn any values, so it’s not like our values are that strongly determined by our biology. Thus, there exists at least one class of general, agentic learning systems that usually end up aligned to human values, as a result of a value formation process that adds human values to a thing which initially did not have such values. Fortunately, the human value formation processes is available to study, and even to introspect on!
Deep learning has an excellent track record of getting models to represent fuzzy human concepts that we can’t explicitly describe. E.g., GPT-3 can model human language usage far more accurately than any of our explicit linguistic theories. Similarly, stable diffusion has clearly learned to produce art, and at no point did philosophical confusion about the “true nature of art” interfere with its training process. Deep learning successes are typically heralded by people confidently insisting that deep models will never learn the domain in question, whether that domain be language, art, music, Go, poetry, etc. Why should “human values” be special in this regard?
I don’t buy any of the arguments for doom. E.g., the “evolution failed its version of the alignment problem” analogy tells us essentially nothing about how problematic inner alignment will be for us because training AIs is not like evolution, and there are evolution-specific details that fully explain evolution’s failure to align humans to maximizing inclusive genetic fitness. I think most other arguments for doom rest on similarly shaky foundations.
But what is stopping any of those “general, agentic learning systems” in the class “aligned to human values” from going meta — at any time — about its values and picking different values to operate with? Is the hope to align the agent and then constantly monitor it to prevent deviancy? If so, why wouldn’t preventing deviancy by monitoring be practically impossible, given that we’re dealing with an agent that will supposedly be able to out-calculate us at every step?
If there’s no ultimate set of shared values , that could lead to a situation where different cultures build AIs with their own values...liberalBot, ConfucianBot, ChristianBot and do on.
I’m an alignment researcher, and I think the problem is tractable. Briefly:
Humans manage to learn human values, despite us not having an exact formalism of our values. However, humans in different cultures learn different values, and humans raised outside of civilisation don’t learn any values, so it’s not like our values are that strongly determined by our biology. Thus, there exists at least one class of general, agentic learning systems that usually end up aligned to human values, as a result of a value formation process that adds human values to a thing which initially did not have such values. Fortunately, the human value formation processes is available to study, and even to introspect on!
Deep learning has an excellent track record of getting models to represent fuzzy human concepts that we can’t explicitly describe. E.g., GPT-3 can model human language usage far more accurately than any of our explicit linguistic theories. Similarly, stable diffusion has clearly learned to produce art, and at no point did philosophical confusion about the “true nature of art” interfere with its training process. Deep learning successes are typically heralded by people confidently insisting that deep models will never learn the domain in question, whether that domain be language, art, music, Go, poetry, etc. Why should “human values” be special in this regard?
I don’t buy any of the arguments for doom. E.g., the “evolution failed its version of the alignment problem” analogy tells us essentially nothing about how problematic inner alignment will be for us because training AIs is not like evolution, and there are evolution-specific details that fully explain evolution’s failure to align humans to maximizing inclusive genetic fitness. I think most other arguments for doom rest on similarly shaky foundations.
But what is stopping any of those “general, agentic learning systems” in the class “aligned to human values” from going meta — at any time — about its values and picking different values to operate with? Is the hope to align the agent and then constantly monitor it to prevent deviancy? If so, why wouldn’t preventing deviancy by monitoring be practically impossible, given that we’re dealing with an agent that will supposedly be able to out-calculate us at every step?
If there’s no ultimate set of shared values , that could lead to a situation where different cultures build AIs with their own values...liberalBot, ConfucianBot, ChristianBot and do on.