AI Alignment and the Classical Humanist Tradition
Hi guys, I’d like to share a proposal regarding AI alignment. The proposal is that training AI in the curriculum of Classical Virtue ethics could be a promising approach to alignment. A) Because general virtues with many exemplifications can help us teach the AI what we would really want it to do, even when we can’t micromanage it, and B) because this pedagogy seems to be a good fit for AI’s style of learning more generally.
Background i) The Pedagogy of Classical Art
In the Classical Humanist Tradition the pedagogy of learning depends on whether one studies a subject, or practices an art. Art is understood as craft or skill, for example the art of speaking well and persuading (rhetoric), and the art of living (virtue ethics). Training in the arts is done by
Studying general principles of the art
Seeing the art practiced, either by seeing it done by a master, or by reading or otherwise studying exemplars
Practicing the art (imitation, etc.)
cf. John Sellars, «The Art of Living», and Walker, «The Genuine Teachers of This Art»
Notably, 2) seems to be close to how AI learns – by reading information, and then using it as a base for its own imitation and application.
Background ii) Classical Humanist Ethics as Art
The Classical Humanist tradition of Virtue ethics operates in the wake of Socrates, and includes Plato, Aristotle, the Stoics, the Epicureans, and others. Later some Catholics took this up as well. They all practiced virtue as an art rather than as (just) a science.
The Stoic tradition is especially practical. As described by John Sellars, it has general and theoretical texts on the virtues, especially on the «Cardinal» virtues of justice, temperance, courage, and wisdom. And they also have moral biographies to read to see how exemplars have lived up to or failed to live up to these virtues. In this way one can learn and adopt the virtues by imitating the exemplars. And one can also get a lot of experience second-hand, by reading about their lives.
Massimo Pigliucci has created a beginner’s curriculum in the last chapter of his «Quest for Character». Donald Robertson has written on the same topic in his «How to Think Like a Roman Emperor». They are both writing in the tradition of Pierre Hadot: Philosophy as a Way of Life.
Classical Humanist Virtue Ethics and AI
The combination of general concepts of virtues (for example justice, benevolence, temperance, etc.), with many detailed exemplifications, might be an ideal way to teach an AI to do what we would deem wise and just, even in situations where we can’t micromanage the AI. And it seems to me that the Classical Humanist tradition has a pedagogy of virtue that might be a good fit for AI’s style of learning more broadly.
Suggestion
My suggestion is that we experiment with training an AI in the Classical humanist tradition of Virtue Ethics, and using the classical pedagogy of art, combined with theoretical treatment of the virtues, as well as practical examples, along the lines of Hadot, Sellars, Pigliucci and Robertson. For that I would need the help of someone with more technical skill.
(Sidenote: the above might primarily target “outer alignment”. For strengthening “inner alignment”, one could take advantage of the fact that the Classical Virtue tradition operates with self concept-models. If we could get the AI to adopt a virtuous self-concept, it might also become “inwardly aligned”. This is how human alignment with the virtues function.)
Appreciate any thoughts/suggestions.
Thanks for your suggestions. I think having more people deeply engaged with alignment is good for our chances of getting it right.
I think this proposal falls into the category of goal crafting (a term proposed by Roko) - deciding what we want an AGI to do. Most alignment work addresses technical alignment—how we might get an AGI to reliably do anything. I think you imply the approach “just train it”; this might work for some types of AGI, and some types of training.
I think many humans trained in classical ethics are not actually ethical by their standards. It is one thing to understand and another to believe in an ethical system. My post “the partial.fqllacy of dumb superintelligence” is one of many treatments of that knowing vs caring distinction.
Thanks for the reply. To comment on your reply, let me be more precise in what I believe are the problems this approach could address.
a) Assuming that the AI is already loyal, or at least following our wishes: how can we make sure that it takes human values into account in its decisions such that it doesn’t mistakenly do something against these values?
I believe running every decision through a “Classical Virtue Ethics”-control, to be a way to do this.
As an analogy, we can train AI on grammar and rhetoric, by giving it a huge amount of literature to read, and it will be able to master the rules and applications of grammar and rhetoric. In the same way, it’s plausible that it would be possible to get it to master Classical humanist virtue ethics as well, seeing as that is just another art, and not different in principle.
So I propose that we could train it in virtue ethics in the same way it’s trained in grammar and rhetoric. If you call that “just train it”, and that is a dismissal of the approach, I’d appreciate some more in-depth engagement with the analogy to grammar and rhetoric.
b) Assuming that the AI isn’t “inwardly aligned”, but instead “ethically neutral”—at least before it’s trained, how can we increase the chance that it develops inward alignment with human values?
a )AI could seem to take on some of the values of the text it is trained on. If we train it on Classical virtue ethics, and seek to increase the value it gives to this input, we might increase the chances of it becoming “inwardly aligned”. Edit: it seems you are engaging with this type of idea in your “fallacy of dumb superintelligence”.
b) We could extrapolate from how humans become “aligned” with virtues. In classical virtue ethics one’s self-concept is central to this. And we get this alignment by emulating the role model, by seeking to adopt the role-model’s perspective. In the same manner, we could try to find a way to get the AI’s self-concept model to be tied to the classical virtues.
If we are lucky enough to find ourselves in one of in those situations you describe, where we have an AGI that wants to do what humanity wants (or would/should want) it to want to do, then to what degree is additional training actually required? I’m sure there are many possible such scenarios that vary widely, but my mental default assumption is that such a system would already be motivated to seek out answers to questions such as, “What is it the humans want me to want, and why?” which would naturally include studying… well, pretty much the entire moral philosophy literature. I wouldn’t put super high odds on it, though.
That said, one of my main concern about a classical virtue ethics training regimen specifically is that it doesn’t really give a clear answer about how to prioritize among virtues (or more broadly, things some subset of cultures says are virtues) that are in conflict, and real humans do in fact disagree and debate about this all the time.
Note: I see this approach is also proposed by “AI Alignment Proposals”