Well, if it’s a language model anything like GPT-3, then any discussions about morality that it engages in will likely be permutations and rewordings of what it has seen in its training data. Such models aren’t even guaranteed to produce text that is self-consistent over time, so I would expect to see conflicting moral stances from the AI that derive from conflicting moral stances of humans whose words it trained on. (Hopefully it was at least trained more on the Stanford Encyclopedia of Philosophy and less on Reddit/Twitter/Facebook.)
It would be interesting, though, if we could design a “language model” AI that continuously seeks self-consistency upon internal reflection. Maybe it would continuously generate moral statements, use them to predict policies under hypothetical scenarios, look for any conflicting predictions, develop moral statements that minimize the conflict, and retrain on the coherent moral statements. I would expect a process like this to converge over time, especially if we are starting from large sample of human moral opinions like a typical language model would, since all human moralities form a relatively tight cluster in behavioral policy space. Then maybe we would be one step closer to achieving the C in CEV.
Regardless, I agree with you overall in the sense that sophisticated language models will be necessary for aligning AGI with human morality at all the relevant levels of abstraction. I just don’t think it will be anywhere near sufficient.
Well, if it’s a language model anything like GPT-3, then any discussions about morality that it engages in will likely be permutations and rewordings of what it has seen in its training data. Such models aren’t even guaranteed to produce text that is self-consistent over time, so I would expect to see conflicting moral stances from the AI that derive from conflicting moral stances of humans whose words it trained on. (Hopefully it was at least trained more on the Stanford Encyclopedia of Philosophy and less on Reddit/Twitter/Facebook.)
It would be interesting, though, if we could design a “language model” AI that continuously seeks self-consistency upon internal reflection. Maybe it would continuously generate moral statements, use them to predict policies under hypothetical scenarios, look for any conflicting predictions, develop moral statements that minimize the conflict, and retrain on the coherent moral statements. I would expect a process like this to converge over time, especially if we are starting from large sample of human moral opinions like a typical language model would, since all human moralities form a relatively tight cluster in behavioral policy space. Then maybe we would be one step closer to achieving the C in CEV.
Regardless, I agree with you overall in the sense that sophisticated language models will be necessary for aligning AGI with human morality at all the relevant levels of abstraction. I just don’t think it will be anywhere near sufficient.