I somewhat hopeful that this is right, but I’m also not so confident that I feel like we can ignore the risks of GPT-N.
For example, this post makes the argument that, because of GPT’s design and learning mechanism, we need not worry about it coming up with significantly novel things or outperforming humans because it’s optimizing for imitating existing human writing, not saying true things. On the other hand, it’s managing to do powerful things it wasn’t trained for, like solve math equations we have no reason to believe it saw in the training set or write code hasn’t seen before, which makes it possible that even if GPT-N isn’t trained to say true things and isn’t really capable of more than humans are, doesn’t mean it might not function like a Hansonian em and still be dangerous by simply doing what humans can do, only much faster.
Any of the risks of being like a group of humans, only much faster, apply. There are also the mesa alignment issues. I suspect that a sufficiently powerful GPT-n might form deceptively aligned mesa optimisers.
I would also worry that off distribution attractors could be malign and intelligent.
Suppose you give GPT-n an off training distribution prompt. You get it to generate text from this prompt. Sometimes it might wander back into the distribution, other times it might stay off distribution. How wide is the border between processes that are safely immitating humans, and processes that aren’t performing significant optimization?
You could get “viruses”, patterns of text that encourage GPT-n to repeat them so they don’t drop out of context. GPT-n already has an accurate world model, a world model that probably models the thought processes of humans in detail. You have all the components needed to create powerful malign intelligences, and a process that smashes them together indiscriminately.
I somewhat hopeful that this is right, but I’m also not so confident that I feel like we can ignore the risks of GPT-N.
For example, this post makes the argument that, because of GPT’s design and learning mechanism, we need not worry about it coming up with significantly novel things or outperforming humans because it’s optimizing for imitating existing human writing, not saying true things. On the other hand, it’s managing to do powerful things it wasn’t trained for, like solve math equations we have no reason to believe it saw in the training set or write code hasn’t seen before, which makes it possible that even if GPT-N isn’t trained to say true things and isn’t really capable of more than humans are, doesn’t mean it might not function like a Hansonian em and still be dangerous by simply doing what humans can do, only much faster.
Any of the risks of being like a group of humans, only much faster, apply. There are also the mesa alignment issues. I suspect that a sufficiently powerful GPT-n might form deceptively aligned mesa optimisers.
I would also worry that off distribution attractors could be malign and intelligent.
Suppose you give GPT-n an off training distribution prompt. You get it to generate text from this prompt. Sometimes it might wander back into the distribution, other times it might stay off distribution. How wide is the border between processes that are safely immitating humans, and processes that aren’t performing significant optimization?
You could get “viruses”, patterns of text that encourage GPT-n to repeat them so they don’t drop out of context. GPT-n already has an accurate world model, a world model that probably models the thought processes of humans in detail. You have all the components needed to create powerful malign intelligences, and a process that smashes them together indiscriminately.