Natural language exists as a low-bandwidth communication channel for imprinting one person’s mental map onto another person’s. The mental maps themselves are formed through direct interactions with an external environment.
It doesn’t seem impossible to create a mental map just from language: in this case, language itself would play the role of the external environment. But overall I agree with you, it’s uncertain whether we can reach a good level of world understanding just from natural language inputs.
Regarding your second paragraph:
even if this AI had a complete understanding of human emotions and moral systems, it would not necessarily be aligned.
I’ll quote the last paragraph under the heading “Error”:
Regarding other possible failure modes, note that I am not trying to produce a safety module that, when attached to a language model, will make that language model safe. What I have in mind is more similar to an independent-ethical-thinking module: if the resulting AI states something about morality, we’ll still have to look at the code and try to understand what’s happening, e.g. what the AI exactly means with the term “morality”, and whether it is communicating honestly or is trying to persuade us. This is also why doing multiple tests will be practically mandatory.
Well, if it’s a language model anything like GPT-3, then any discussions about morality that it engages in will likely be permutations and rewordings of what it has seen in its training data. Such models aren’t even guaranteed to produce text that is self-consistent over time, so I would expect to see conflicting moral stances from the AI that derive from conflicting moral stances of humans whose words it trained on. (Hopefully it was at least trained more on the Stanford Encyclopedia of Philosophy and less on Reddit/Twitter/Facebook.)
It would be interesting, though, if we could design a “language model” AI that continuously seeks self-consistency upon internal reflection. Maybe it would continuously generate moral statements, use them to predict policies under hypothetical scenarios, look for any conflicting predictions, develop moral statements that minimize the conflict, and retrain on the coherent moral statements. I would expect a process like this to converge over time, especially if we are starting from large sample of human moral opinions like a typical language model would, since all human moralities form a relatively tight cluster in behavioral policy space. Then maybe we would be one step closer to achieving the C in CEV.
Regardless, I agree with you overall in the sense that sophisticated language models will be necessary for aligning AGI with human morality at all the relevant levels of abstraction. I just don’t think it will be anywhere near sufficient.
It doesn’t seem impossible to create a mental map just from language: in this case, language itself would play the role of the external environment. But overall I agree with you, it’s uncertain whether we can reach a good level of world understanding just from natural language inputs.
Regarding your second paragraph:
I’ll quote the last paragraph under the heading “Error”:
Well, if it’s a language model anything like GPT-3, then any discussions about morality that it engages in will likely be permutations and rewordings of what it has seen in its training data. Such models aren’t even guaranteed to produce text that is self-consistent over time, so I would expect to see conflicting moral stances from the AI that derive from conflicting moral stances of humans whose words it trained on. (Hopefully it was at least trained more on the Stanford Encyclopedia of Philosophy and less on Reddit/Twitter/Facebook.)
It would be interesting, though, if we could design a “language model” AI that continuously seeks self-consistency upon internal reflection. Maybe it would continuously generate moral statements, use them to predict policies under hypothetical scenarios, look for any conflicting predictions, develop moral statements that minimize the conflict, and retrain on the coherent moral statements. I would expect a process like this to converge over time, especially if we are starting from large sample of human moral opinions like a typical language model would, since all human moralities form a relatively tight cluster in behavioral policy space. Then maybe we would be one step closer to achieving the C in CEV.
Regardless, I agree with you overall in the sense that sophisticated language models will be necessary for aligning AGI with human morality at all the relevant levels of abstraction. I just don’t think it will be anywhere near sufficient.