I read the whole thing because of its similarity to my proposals about metacognition as an aid to both capabilities and alignment in language model agents.
In this and my work, metacognition is a way to keep AI from doing the wrong thing (from the AIs perspective). They explicitly do not address the broader alignment problem of AIs wanting the wrong things (from humans’ perspective).
They note that “wiser” humans are more prone to serve the common good, by taking more perspectives into account. They wisely do not propose wisdom as a solution to the problem of defining human values or beneficial action from an AI. Wisdom here is an aid to fulfilling your values, not a definition of those values. Their presentation is a bit muddled on this issue, but I think their final sections on the broader alignment problem make this scoping clear.
My proposal of a metacognitive “internal review” or “System 2 alignment check” shares this weakness. It doesn’t address the right thing to point an AGI at; it merely shores up a couple of possible routes to goal mis-specification.
This article explicitly refuses to grapple with this problem:
3.4.1. Rethinking AI alignment
With respect to the broader goal of AI alignment, we are sympathetic to the goal but question this definition of the problem. Ultimately safe AI may be at least as much about constraining the power of AI systems within human institutions, rather than aligning their goals.
I think limiting the power of AI systems within human institutions is only sensible if you’re thinking of tool AI or weak AGI; thinking you’ll constrain superhuman AIs seems like obviously a fool’s errand. I think this proposal is meant to apply to AI, not ever-improving AGI. Which is fine, if we have a long time between transformative AI and real AGI.
I think it would be wildly foolish to assume we have that gap between important AI and real AGI. A highly competent assistant may soon be your new boss.
I have a different way to duck the problem of specifying complex and possibly fragile human values: make the AGI’s central goal to merely follow instructions. Something smarter than you wanting nothing more than to follow your instructions is counterintuitive, but I think it’s both consistent, and in-retrospect obvious; I think not only is this alignment target safer, but far more likely for our first AGIs. People are going to want the first semi-sapient AGIs to follow instructions, just like LLMs do, not make their own judgments about values or ethics. And once we’ve started down that path, there will be no immediate reason to tackle the full value alignment problem.
(In the longer term, we’ll probably want to use instruction-following as a stepping-stone to full value alignment, since instruction-following superintelligence would eventually fall into the wrong hands and receive some really awful instructions. But surpassing human intelligence and agency doesn’t necessitate shooting for full value alignment right away.)
A final note on the authors’ attitudes toward alignment: I also read it because I noted Yoshua Bengio and Melanie Mitchell among the authors. It’s what I’d expect from Mitchell, who has steadfastly refused to address the alignment problem, in part because she has long timelines, and in part because she believes in a “fallacy of dumb superintelligence” (I point out how she goes wrong in The (partial) fallacy of dumb superintelligence).
I’m disappointed to see Bengio lend his name to this refusal to grapple with the larger alignment problem. I hope this doesn’t signal a dedication to this approach. I had hoped for more from him.
I read the whole thing because of its similarity to my proposals about metacognition as an aid to both capabilities and alignment in language model agents.
In this and my work, metacognition is a way to keep AI from doing the wrong thing (from the AIs perspective). They explicitly do not address the broader alignment problem of AIs wanting the wrong things (from humans’ perspective).
They note that “wiser” humans are more prone to serve the common good, by taking more perspectives into account. They wisely do not propose wisdom as a solution to the problem of defining human values or beneficial action from an AI. Wisdom here is an aid to fulfilling your values, not a definition of those values. Their presentation is a bit muddled on this issue, but I think their final sections on the broader alignment problem make this scoping clear.
My proposal of a metacognitive “internal review” or “System 2 alignment check” shares this weakness. It doesn’t address the right thing to point an AGI at; it merely shores up a couple of possible routes to goal mis-specification.
This article explicitly refuses to grapple with this problem:
I think limiting the power of AI systems within human institutions is only sensible if you’re thinking of tool AI or weak AGI; thinking you’ll constrain superhuman AIs seems like obviously a fool’s errand. I think this proposal is meant to apply to AI, not ever-improving AGI. Which is fine, if we have a long time between transformative AI and real AGI.
I think it would be wildly foolish to assume we have that gap between important AI and real AGI. A highly competent assistant may soon be your new boss.
I have a different way to duck the problem of specifying complex and possibly fragile human values: make the AGI’s central goal to merely follow instructions. Something smarter than you wanting nothing more than to follow your instructions is counterintuitive, but I think it’s both consistent, and in-retrospect obvious; I think not only is this alignment target safer, but far more likely for our first AGIs. People are going to want the first semi-sapient AGIs to follow instructions, just like LLMs do, not make their own judgments about values or ethics. And once we’ve started down that path, there will be no immediate reason to tackle the full value alignment problem.
(In the longer term, we’ll probably want to use instruction-following as a stepping-stone to full value alignment, since instruction-following superintelligence would eventually fall into the wrong hands and receive some really awful instructions. But surpassing human intelligence and agency doesn’t necessitate shooting for full value alignment right away.)
A final note on the authors’ attitudes toward alignment: I also read it because I noted Yoshua Bengio and Melanie Mitchell among the authors. It’s what I’d expect from Mitchell, who has steadfastly refused to address the alignment problem, in part because she has long timelines, and in part because she believes in a “fallacy of dumb superintelligence” (I point out how she goes wrong in The (partial) fallacy of dumb superintelligence).
I’m disappointed to see Bengio lend his name to this refusal to grapple with the larger alignment problem. I hope this doesn’t signal a dedication to this approach. I had hoped for more from him.