How is this clever javascript code the most likely text continuation of the human’s question? GPT-N outputs text continuations, unless the human input is “here is malicious javascript code, which hijacks the browser when displayed and takes over the world: … ” then GPT-N will not output something like it. In fact, such code is quite hard to write, and would not really be what a human would write in response to that question, so you’d need to do some really hard work to actually get something like GPT-N (assuming same training setup as GPT-3) to actually output malicious code. Of course, some idiot might in fact ask that question, and then we’re screwed.
For one, I don’t think it’s reasonnable to assume that the future GPT-N will only work in a “text continuation setup”.
But also, what would happen if you had it read a database of exploits and ask him to evade the API, or to “create a new exploit to break this API” or something like that.
future versions of such models could well work in other ways than text continuation, but this would require new ideas not present in the current way these models are trained, which is literally by trying to maximise the probability they assign to the true next word in the dataset. I think the abstraction of “GPT-N” is useful if it refers to a simply scaled up version of GPT-3, no clever additional tricks, no new paradigms, just the same thing with more parameters and more data, if you don’t assume this then “GPT-N” is no more specific than “Deep Learning-based AGI”, and we must then only talk in very general terms.
Regarding the exploits, you need to massage your question in a way that GPT-N predicts that its answer is the most likely thing that a human would write after your question. Over the whole internet, most of the time when someone asks someone else to answer a really hard question, the human who writes the text immediatly after that question will either a) be wrong or b) avoid the question. GPT-N isn’t trying to be right, to it, avoiding your question or being wrong is perfectly fine, because that’s what it was trained to output after hard questions.
To generate such an exploit, you need to convince GPT-N that the text it is being shown is actually coming from really competent humans, so you might try to frame your question as the beginning of a computer science paper, maybe written far in the future, and which has lots of citations, written by a collaboration of people GPT-N knows are competent. But then GPT-N might predict that those humans would not publish such a dangerous exploit, so it would yet again evade you. After a bit of trial and error, you might well corner GPT-N into producing what you want, but it will not be easy.
How is this clever javascript code the most likely text continuation of the human’s question? GPT-N outputs text continuations, unless the human input is “here is malicious javascript code, which hijacks the browser when displayed and takes over the world: … ” then GPT-N will not output something like it. In fact, such code is quite hard to write, and would not really be what a human would write in response to that question, so you’d need to do some really hard work to actually get something like GPT-N (assuming same training setup as GPT-3) to actually output malicious code. Of course, some idiot might in fact ask that question, and then we’re screwed.
For one, I don’t think it’s reasonnable to assume that the future GPT-N will only work in a “text continuation setup”.
But also, what would happen if you had it read a database of exploits and ask him to evade the API, or to “create a new exploit to break this API” or something like that.
I don’t work in the field so it’s a question.
future versions of such models could well work in other ways than text continuation, but this would require new ideas not present in the current way these models are trained, which is literally by trying to maximise the probability they assign to the true next word in the dataset. I think the abstraction of “GPT-N” is useful if it refers to a simply scaled up version of GPT-3, no clever additional tricks, no new paradigms, just the same thing with more parameters and more data, if you don’t assume this then “GPT-N” is no more specific than “Deep Learning-based AGI”, and we must then only talk in very general terms.
Regarding the exploits, you need to massage your question in a way that GPT-N predicts that its answer is the most likely thing that a human would write after your question. Over the whole internet, most of the time when someone asks someone else to answer a really hard question, the human who writes the text immediatly after that question will either a) be wrong or b) avoid the question. GPT-N isn’t trying to be right, to it, avoiding your question or being wrong is perfectly fine, because that’s what it was trained to output after hard questions.
To generate such an exploit, you need to convince GPT-N that the text it is being shown is actually coming from really competent humans, so you might try to frame your question as the beginning of a computer science paper, maybe written far in the future, and which has lots of citations, written by a collaboration of people GPT-N knows are competent. But then GPT-N might predict that those humans would not publish such a dangerous exploit, so it would yet again evade you. After a bit of trial and error, you might well corner GPT-N into producing what you want, but it will not be easy.
Indeed I was talking about “Deep-learning based AGI”. Thank you for the thorough answer.