Seth Herd comments on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI

Seth Herd 26 Jan 2024 20:31 UTC
6 points
1
I agree that combinations of pure consequentialism and deontology don’t describe all possible goals for AGI.
“Do what this person means by what they says” seems like a perfectly coherent goal. It’s neither consequentialist nor deontological (in the traditional sense of fixed deontological rules). I think this is subtly different than IRL or other schemes for maximizing an unknown utility function of the user’s (or humanity’s) preferences. This goal limits the agent to reasoning about the meaning of only one utterance at a time, not the broader space of true preferences.
This scheme gets much safer if you can include a second (probably primary) goal of “don’t do anything major without verifying that my person actually wants me to do it”. Of course defining “major” is a challenge, but I don’t think it’s an unsolvable challenge (;particularly if you’re aligning an AGI with some understanding of natural language. I’ve explored this line of thought a little in Corrigibility or DWIM is an attractive primary goal for AGI, and I’m working on another post to explore this more thoroughly.
In a multi-goal scheme, making “don’t do anything major without approval” the strongest goal might provide some additional safety. If it turns out that alignment isn’t stable and reflection causes the goal structure to collapse, the AGI probably winds up not doing anything at all. Of course there are still lots of challenges and things to work out in that scheme.