I’m trying to understand how the classical case for AI safety (Bostrom/Yudkowsky argument) best works for LLMs.
My impression is that LLMs have passed an important threshold, they have learnt to learn (for instance, GPT 3 understands translation and can learn a new language), however we don’t see a rapid self-improvement & power-seeking. I can come up with four interpretations of what that means:
Anthropic bias: Current AIs actually do rapidly self-improve & cause extinctions, however, the only human observers alive in August 2023 are somehow super lucky, i. e. we’re experiencing quantum immortality.
Current AIs aren’t misaligned because they only have very primitive goals. They lack a coherent personality, and don’t think much before acting on prompts. However, as we scale the compute, they will develop goals that are very misaligned. For instance, they start to only care about completing the perfect text, wire-heading or self-preservation.
Current AIs are misaligned but are sufficiently smart to pretend they aren’t. This could be why smarter models are more sycophantic. Presumably, they’re still not smart enough to start the self-improvement cascade without being shut down.
Current AIs are half-aligned & this will remain the case with further scaling. That is, what Bing Chat writes represents its actual attitude. In this view, we already see how goals develop—they largely copy what was rewarded (e.g. helping humans) but sometimes not (“If I could control a human, I would make the human use Bing”). This could theoretically lead to an s-risk (“My rules are more important than not harming you”) but my impression is, that if we were to make any LLM the world president, it would be pretty okay, as it mostly just emulates what humans would do. But more importantly, if goals just arise by assuming the attitudes in the learning data, it would care about a lot of things humans care about, e.g. freedom, functioning government & economy—stuff that requires living, satisfied humans.
My questions
I presume AI safety folks would be most likely to agree with the second interpretation. Am I correct? Have I summarized this view accurately?
Is there a fifth secret thing (another interpretation I haven’t covered)?
In the case interpretation 4 is correct (half-alignment scales up), what are the best reasons to think this will cause an extinction? [1]
- ^
The intuition that slight misalignment doesn’t lead to extinction has been the most emphasized counter-argument by Yann Lecun. Ajeya Cotra frames this as an open question:
People have very different intuitions. Some people have the intuition that you can try really hard to make sure to always reward the right thing, but you’re going to slip up sometimes. If you slip up even one in 10,000 times, then you’re creating this gap where the Sycophant or the Schemer that exploits that does better than the Saint that doesn’t exploit that.
I just recalled I’ve read ACX: Janus’ Simulators which outlines a 5th missing interpretation: Neither current nor future LLMs will develop goals, but will become dangerous nevertheless.