I agree LLMs are a huge step forward towards getting AIs to do human-level moral reasoning, even if I don’t agree that we’re literally done. IMO the really ambitious ‘AI safetyists’ should now be asking what it means to get superhuman-level moral reasoning (there’s no such thing as perfect, but there sure is better than current LLMs), and how we could get there.
And sadly, just using an LLM as part of a larger AI, even one that reliably produces moral text on the everyday distribution, does not automatically lead to good outcomes, so there’s still a role for classical AI safetyism.
You could have an ontology mismatch between different parts of the planning process that degrades the morality of the actions. Sort of like translating the text into a different language where the words have different connotations.
You could have the ‘actor’ part of the planning process use out-of-distribution inputs to get immoral-but-convenient behavior past the ‘critic’.
You could have a planning process that interfaces with the LLM using activations rather than words, and this richer interface could allow RL to easily route around morality and just use the LLM for its world-model.
I agree LLMs are a huge step forward towards getting AIs to do human-level moral reasoning, even if I don’t agree that we’re literally done. IMO the really ambitious ‘AI safetyists’ should now be asking what it means to get superhuman-level moral reasoning (there’s no such thing as perfect, but there sure is better than current LLMs), and how we could get there.
And sadly, just using an LLM as part of a larger AI, even one that reliably produces moral text on the everyday distribution, does not automatically lead to good outcomes, so there’s still a role for classical AI safetyism.
You could have an ontology mismatch between different parts of the planning process that degrades the morality of the actions. Sort of like translating the text into a different language where the words have different connotations.
You could have the ‘actor’ part of the planning process use out-of-distribution inputs to get immoral-but-convenient behavior past the ‘critic’.
You could have a planning process that interfaces with the LLM using activations rather than words, and this richer interface could allow RL to easily route around morality and just use the LLM for its world-model.