If you ask GPT-n to produce a design for a fusion reactor, all the prompts that talk about fusion are going to say that a working reactor hasn’t yet been built, or imitate cranks or works of fiction.
It seems unlikely that a text predictor could pick up enough info about fusion to be able to design a working reactor, without figuring out that humans haven’t made any fusion reactors that produce net power.
If you did somehow get a response, the level of safety you would get is the level a typical human would display. (conditional on the prompt) If some information is an obvious infohazard, such that no human capable of coming up with it would share it, then such data won’t be in GPT-n ’s training dataset, and won’t be predicted. However, the process of conditioning might amplify tiny probabilities of human failure.
Suppose that any easy design of fusion reactor could be turned into a bomb. And ignore cranks and fiction. Then suppose 99% of people who invented a fusion reactor would realize this, and stay quiet. The other 1% would write an article that starts with “To make a fusion reactor …” . Then this prompt will cause GPT-n to generate the article that a human that didn’t notice the danger would come up with.
This also applies to dangers like leaking radiation, or just blowing up randomly if your materials weren’t pure enough.
You can probably avoid the generation of crank works and fiction by training a new version of GPT in which every learning example is labeled with <year of publication> and <subject matter>, which GPT has access to when it predict an example. So if you then generate a prompt and condition of something like <year: 2040> <subject matter: peer-reviewed physics publication>, you can easily tell GPT to avoid fiction and crank works, as well as make it model future scientific progress.
The practical problem with that is probably that you need to manually decide which papers go in which category. GPT needs such an enormous amount of data that any curating done needs to be automated. So metadata like authors, subject, date, website of provenance are quite easy to obtain for each example, but really high level stuff like “paper is about applying the methods of field X in field Y” is really hard.
I somewhat hopeful that this is right, but I’m also not so confident that I feel like we can ignore the risks of GPT-N.
For example, this post makes the argument that, because of GPT’s design and learning mechanism, we need not worry about it coming up with significantly novel things or outperforming humans because it’s optimizing for imitating existing human writing, not saying true things. On the other hand, it’s managing to do powerful things it wasn’t trained for, like solve math equations we have no reason to believe it saw in the training set or write code hasn’t seen before, which makes it possible that even if GPT-N isn’t trained to say true things and isn’t really capable of more than humans are, doesn’t mean it might not function like a Hansonian em and still be dangerous by simply doing what humans can do, only much faster.
Any of the risks of being like a group of humans, only much faster, apply. There are also the mesa alignment issues. I suspect that a sufficiently powerful GPT-n might form deceptively aligned mesa optimisers.
I would also worry that off distribution attractors could be malign and intelligent.
Suppose you give GPT-n an off training distribution prompt. You get it to generate text from this prompt. Sometimes it might wander back into the distribution, other times it might stay off distribution. How wide is the border between processes that are safely immitating humans, and processes that aren’t performing significant optimization?
You could get “viruses”, patterns of text that encourage GPT-n to repeat them so they don’t drop out of context. GPT-n already has an accurate world model, a world model that probably models the thought processes of humans in detail. You have all the components needed to create powerful malign intelligences, and a process that smashes them together indiscriminately.
If you ask GPT-n to produce a design for a fusion reactor, all the prompts that talk about fusion are going to say that a working reactor hasn’t yet been built, or imitate cranks or works of fiction.
It seems unlikely that a text predictor could pick up enough info about fusion to be able to design a working reactor, without figuring out that humans haven’t made any fusion reactors that produce net power.
If you did somehow get a response, the level of safety you would get is the level a typical human would display. (conditional on the prompt) If some information is an obvious infohazard, such that no human capable of coming up with it would share it, then such data won’t be in GPT-n ’s training dataset, and won’t be predicted. However, the process of conditioning might amplify tiny probabilities of human failure.
Suppose that any easy design of fusion reactor could be turned into a bomb. And ignore cranks and fiction. Then suppose 99% of people who invented a fusion reactor would realize this, and stay quiet. The other 1% would write an article that starts with “To make a fusion reactor …” . Then this prompt will cause GPT-n to generate the article that a human that didn’t notice the danger would come up with.
This also applies to dangers like leaking radiation, or just blowing up randomly if your materials weren’t pure enough.
You can probably avoid the generation of crank works and fiction by training a new version of GPT in which every learning example is labeled with <year of publication> and <subject matter>, which GPT has access to when it predict an example. So if you then generate a prompt and condition of something like <year: 2040> <subject matter: peer-reviewed physics publication>, you can easily tell GPT to avoid fiction and crank works, as well as make it model future scientific progress.
Hmm. I’m having a hard time writing this clearly, but I wonder if you could get interesting results by:
Training on a wide range of notably excellent papers from “narrow-scoped” domains,
Training on a wide range of papers that explore “we found this worked in X field, and we’re now seeing if it also works in Y field” syntheses,
Then giving GPT-N prompts to synthesize narrow-scoped domains in which that hasn’t been done yet.
You’d get some nonsense, I imagine, but it would probably at least spit out plausible hypotheses for actual testing, eh?
The practical problem with that is probably that you need to manually decide which papers go in which category. GPT needs such an enormous amount of data that any curating done needs to be automated. So metadata like authors, subject, date, website of provenance are quite easy to obtain for each example, but really high level stuff like “paper is about applying the methods of field X in field Y” is really hard.
I somewhat hopeful that this is right, but I’m also not so confident that I feel like we can ignore the risks of GPT-N.
For example, this post makes the argument that, because of GPT’s design and learning mechanism, we need not worry about it coming up with significantly novel things or outperforming humans because it’s optimizing for imitating existing human writing, not saying true things. On the other hand, it’s managing to do powerful things it wasn’t trained for, like solve math equations we have no reason to believe it saw in the training set or write code hasn’t seen before, which makes it possible that even if GPT-N isn’t trained to say true things and isn’t really capable of more than humans are, doesn’t mean it might not function like a Hansonian em and still be dangerous by simply doing what humans can do, only much faster.
Any of the risks of being like a group of humans, only much faster, apply. There are also the mesa alignment issues. I suspect that a sufficiently powerful GPT-n might form deceptively aligned mesa optimisers.
I would also worry that off distribution attractors could be malign and intelligent.
Suppose you give GPT-n an off training distribution prompt. You get it to generate text from this prompt. Sometimes it might wander back into the distribution, other times it might stay off distribution. How wide is the border between processes that are safely immitating humans, and processes that aren’t performing significant optimization?
You could get “viruses”, patterns of text that encourage GPT-n to repeat them so they don’t drop out of context. GPT-n already has an accurate world model, a world model that probably models the thought processes of humans in detail. You have all the components needed to create powerful malign intelligences, and a process that smashes them together indiscriminately.