Thanks for the reply. To comment on your reply, let me be more precise in what I believe are the problems this approach could address.
a) Assuming that the AI is already loyal, or at least following our wishes: how can we make sure that it takes human values into account in its decisions such that it doesn’t mistakenly do something against these values?
I believe running every decision through a “Classical Virtue Ethics”-control, to be a way to do this.
As an analogy, we can train AI on grammar and rhetoric, by giving it a huge amount of literature to read, and it will be able to master the rules and applications of grammar and rhetoric. In the same way, it’s plausible that it would be possible to get it to master Classical humanist virtue ethics as well, seeing as that is just another art, and not different in principle.
So I propose that we could train it in virtue ethics in the same way it’s trained in grammar and rhetoric. If you call that “just train it”, and that is a dismissal of the approach, I’d appreciate some more in-depth engagement with the analogy to grammar and rhetoric.
b) Assuming that the AI isn’t “inwardly aligned”, but instead “ethically neutral”—at least before it’s trained, how can we increase the chance that it develops inward alignment with human values?
a )AI could seem to take on some of the values of the text it is trained on. If we train it on Classical virtue ethics, and seek to increase the value it gives to this input, we might increase the chances of it becoming “inwardly aligned”. Edit: it seems you are engaging with this type of idea in your “fallacy of dumb superintelligence”.
b) We could extrapolate from how humans become “aligned” with virtues. In classical virtue ethics one’s self-concept is central to this. And we get this alignment by emulating the role model, by seeking to adopt the role-model’s perspective. In the same manner, we could try to find a way to get the AI’s self-concept model to be tied to the classical virtues.
Note: I see this approach is also proposed by “AI Alignment Proposals”