Daniel Kokotajlo comments on The hot mess theory of AI misalignment: More intelligent agents behave less coherently

Daniel Kokotajlo 10 Mar 2023 2:40 UTC
25 points
15
I hope to find time to give a more thorough reply later; what I say below is hasty and may contain errors.

(1)
Define general competence factor as intelligence*coherence.

Take all the classic arguments about AI risk and ctrl-f “intelligence” and then replace it with “general competence factor.”

The arguments now survive your objection, I think.

When we select for powerful AGIs, when we train them to do stuff for us, we are generally speaking also training them to be coherent. It’s more accurate to say we are training them to have a high general competence factor, then to say we are training them to be intelligent-but-not-necessarily-coherent. The ones that aren’t so coherent will struggle to take over the world, yes, but they will also struggle to compete in the marketplace (and possibly even the training environment) with the ones that are.

(2) I’m a bit annoyed by the various bits of this post/paper that straw man the AI safety community, e.g. saying that X is commonly assumed when in fact there are pages and pages of argumentation supporting X, which you can easily find with a google, and lots more pages of argumentation on both sides of the issue as to whether X.

Relatedly… I just flat-out reject the premise that most work on AI risk assumes that AI will be less of a hot mess than humans. I for one am planning for a world where AIs are about as much of a hot mess as humans, at least at first. I think it’ll be a great achievement (relative to our current trajectory) if we can successfully leverage hot-mess AIs to end the acute risk period.

(3) That said, I’m intellectually curious/excited to discuss these results and arguments with you, and grateful that you did this research & posted it here. :) Onwards to solving these problems collaboratively!