future versions of such models could well work in other ways than text continuation, but this would require new ideas not present in the current way these models are trained, which is literally by trying to maximise the probability they assign to the true next word in the dataset. I think the abstraction of “GPT-N” is useful if it refers to a simply scaled up version of GPT-3, no clever additional tricks, no new paradigms, just the same thing with more parameters and more data, if you don’t assume this then “GPT-N” is no more specific than “Deep Learning-based AGI”, and we must then only talk in very general terms.
Regarding the exploits, you need to massage your question in a way that GPT-N predicts that its answer is the most likely thing that a human would write after your question. Over the whole internet, most of the time when someone asks someone else to answer a really hard question, the human who writes the text immediatly after that question will either a) be wrong or b) avoid the question. GPT-N isn’t trying to be right, to it, avoiding your question or being wrong is perfectly fine, because that’s what it was trained to output after hard questions.
To generate such an exploit, you need to convince GPT-N that the text it is being shown is actually coming from really competent humans, so you might try to frame your question as the beginning of a computer science paper, maybe written far in the future, and which has lots of citations, written by a collaboration of people GPT-N knows are competent. But then GPT-N might predict that those humans would not publish such a dangerous exploit, so it would yet again evade you. After a bit of trial and error, you might well corner GPT-N into producing what you want, but it will not be easy.
future versions of such models could well work in other ways than text continuation, but this would require new ideas not present in the current way these models are trained, which is literally by trying to maximise the probability they assign to the true next word in the dataset. I think the abstraction of “GPT-N” is useful if it refers to a simply scaled up version of GPT-3, no clever additional tricks, no new paradigms, just the same thing with more parameters and more data, if you don’t assume this then “GPT-N” is no more specific than “Deep Learning-based AGI”, and we must then only talk in very general terms.
Regarding the exploits, you need to massage your question in a way that GPT-N predicts that its answer is the most likely thing that a human would write after your question. Over the whole internet, most of the time when someone asks someone else to answer a really hard question, the human who writes the text immediatly after that question will either a) be wrong or b) avoid the question. GPT-N isn’t trying to be right, to it, avoiding your question or being wrong is perfectly fine, because that’s what it was trained to output after hard questions.
To generate such an exploit, you need to convince GPT-N that the text it is being shown is actually coming from really competent humans, so you might try to frame your question as the beginning of a computer science paper, maybe written far in the future, and which has lots of citations, written by a collaboration of people GPT-N knows are competent. But then GPT-N might predict that those humans would not publish such a dangerous exploit, so it would yet again evade you. After a bit of trial and error, you might well corner GPT-N into producing what you want, but it will not be easy.
Indeed I was talking about “Deep-learning based AGI”. Thank you for the thorough answer.