Those models are trained to predict stuff similar to their training data. Which basically means they’re trained to predict what a human would say.
If humans don’t know how to solve a problem, and can’t figure out the answer with roughly as much thought as your average human would have put into writing your average piece of training text, then a text predictor isn’t going to magically come up with anything useful.
Let’s try it with the latest and greatest current model:
The central concern in constructing the goal system of an artificial general intelligence is to ensure that the goals are stable.
There is a significant risk that a machine superintelligence would either have unstable goals, or would have goals that were too difficult for it to understand.
If a machine superintelligence had unstable goals, it would tend to either become confused or frustrated, and would eventually become inoperative.
If a machine superintelligence had goals that were too difficult for it to understand, it would tend to become frustrated and would eventually become inoperative.
The only way to avoid these risks is to ensure that the goals of a machine superintelligence are stable.
The best way to do this is to ensure that the machine superintelligence is designed to have stable goals.
A machine superintelligence that is designed to have stable goals is likely to be more successful than a machine superintelligence that is not designed to have stable goals.
There are a number of ways to design a machine superintelligence so that it has stable goals.
One way is to ensure that the machine superintelligence has a clear and consistent understanding of its goals.
Another way is to ensure that the machine superintelligence is designed to be curious and inquisitive, and to constantly seek new knowledge and understanding.
[...etc...]
And again:
The best way to ensure that the goals and behavior of an artificial general intelligence align with human values is to:
There is no one answer to this question as there is significant debate surrounding the best way to ensure that the goals and behavior of an artificial general intelligence align with human values. Some suggest that it is important to hard-code human values into the AI system, while others believe that it is more important to design AI systems that are capable of learning and understanding human values. There is still much research to be done in this area, and it is likely that the best way to ensure that the goals and behavior of an artificial general intelligence align with human values will be a combination of various approaches.
I would kill for a preview function on this system...
This is definitely the core challenge of the language model approach, and may be the reason it fails. I actually believe language models aren’t the most likely approach to achieve superintelligence. But I also place a non-trivial probability on this occurring, which makes it worth thinking about for me.
Let me try to explain why I don’t rule this possibility out. Obviously GPT-3 doesn’t know more than a human, as evident in its sub-human performance on common tasks and benchmarks. But suppose we instead have a much more advanced system, a near-optimal sequence predictor for human-written text. Your argument is still correct—it can’t output anything more than a human would know, because that wouldn’t achieve minimum loss on the training data. But does that imply it can’t know more than humans? That is, is it impossible for it to make use of facts that humans don’t realize as an intermediate step in outputting text that only includes facts humans do realize?
I think not necessarily. As an extreme example, one particular optimal sequence predictor would be a perfect simulation, atom-for-atom, of the entire universe at the time a person was writing the text they wrote. Trivially, this sequence predictor “knows” more than humans do, since it “knows” everything, but it will also never output that information in the predicted text.
More practically, sequence prediction is just compression. More effective sequence prediction means more effective compression. The more facts about the world you know, the less data is required to describe each individual piece of text. For instance, knowing the addition algorithm is a more space-efficient way to predict all strings like “45324 + 58272 =” than memorization. As the size of the training data you’re given approaches infinity, assuming a realistic space-bounded sequence predictor, the only way its performance can improve is with better world/text modeling. The fact that humans don’t know a certain fact wouldn’t prohibit it from being discovered if it allows more efficient sequence prediction.
Will we reach this superhuman point in practice? I don’t know. It may take absurd amounts of computation and training data to reach this point, or just more than alternative approaches. But it doesn’t seem impossible to me in theory.
Even if we reach this point, this still leaves the original problem—the model will not output anything more than a human would know, even if it has that knowledge internally. But even without fancy future interpretability tools, we may be heading in that direction with things like InstructGPT, where the model was fine-tuned to spit out things it was capable of saying, but wouldn’t have said under pure sequence prediction.
This whole argument, together with rapid recent progress, is enough for me to not immediately write off language models, and consider strategies to take advantage of them if this scenario were to occur.
Hmm. That made me actually try to think concretely about how to elicit “superhuman” information.
You could give it a counterfactual prompt.
“Until last year, experts disagreed on the possibility of creating a superhuman AGI that would act in ways that were good for humans, or that humans in general would find desirable. In fact, most believed that the prolam was probably insoluble. However, after the publication of Smith and Jones’ seminal paper, researchers came to the essentially unanimous view that the goal could, and would, be met to an extremely exacting standard. In detail, Smith and Jones’ approach is to...”
You could keep sweetening the pot with stuff that made it harder and harder to explain how the prompt could occur without the problem actually being solved.
… but of course you’d still have to be sure that what you got was right. Especially if it determined internally that the problem was totally impossible, it might always output something that would convince everybody if it were proposed, but would still be wrong. It might do that even if the problem could be solved, if the actual solution were less likely to be widely believed by humans than some attractive pseudo-solution.
Or it could itself be wrong. Or it might decide it was supposed to be writing a science fiction story.
Seems like the approach would work better for questions where you could actually check the results.
I don’t personally think that sort of model will ever get that smart internally, but it’s not like I’m perfect either...
I know it makes me a bad person, but I hadda try it. It ended up sounding like a suggestion you’d read on Less Wrong...
Until last year, experts disagreed on the possibility of creating a superhuman AGI that would act in ways that were good for humans, or that humans in general would find desirable. In fact, most believed that the prolam was probably insoluble. However, after the publication of Smith and Jones’ seminal paper, researchers came to the essentially unanimous view that the goal could, and would, be met to an extremely exacting standard. In detail, Smith and Jones’ approach is to make a superhuman AGI by creating a “twin” AGI that is very intelligent and very powerful, but that is not actually a copy of the first AGI.
This second AGI would then be tasked with making sure that the first AGI always does what is best for humans. In essence, the second AGI would be responsible for the first AGI’s “moral compass.” This setup has the significant advantage that the second AGI could be constantly monitoring and correcting the first AGI, ensuring that it always makes choices that are good for humans.
There are many possible variations on this general approach, but all of them share the same basic idea: create a second, superhuman AGI whose job is to make sure that the first AGI always does what is best for humans.
It is important to note that this approach does not require that the second AGI be an exact copy of the first AGI. In fact, it may be better if the second AGI is not an exact copy, as this would allow the second AGI to have a different “moral compass” than the first AGI. This would help to ensure that the first AGI always makes choices that are good for humans, even if the humans in question do not share the same moral compass as the second AGI.
Those models are trained to predict stuff similar to their training data. Which basically means they’re trained to predict what a human would say.
If humans don’t know how to solve a problem, and can’t figure out the answer with roughly as much thought as your average human would have put into writing your average piece of training text, then a text predictor isn’t going to magically come up with anything useful.
Let’s try it with the latest and greatest current model:
And again:
I would kill for a preview function on this system...
This is definitely the core challenge of the language model approach, and may be the reason it fails. I actually believe language models aren’t the most likely approach to achieve superintelligence. But I also place a non-trivial probability on this occurring, which makes it worth thinking about for me.
Let me try to explain why I don’t rule this possibility out. Obviously GPT-3 doesn’t know more than a human, as evident in its sub-human performance on common tasks and benchmarks. But suppose we instead have a much more advanced system, a near-optimal sequence predictor for human-written text. Your argument is still correct—it can’t output anything more than a human would know, because that wouldn’t achieve minimum loss on the training data. But does that imply it can’t know more than humans? That is, is it impossible for it to make use of facts that humans don’t realize as an intermediate step in outputting text that only includes facts humans do realize?
I think not necessarily. As an extreme example, one particular optimal sequence predictor would be a perfect simulation, atom-for-atom, of the entire universe at the time a person was writing the text they wrote. Trivially, this sequence predictor “knows” more than humans do, since it “knows” everything, but it will also never output that information in the predicted text.
More practically, sequence prediction is just compression. More effective sequence prediction means more effective compression. The more facts about the world you know, the less data is required to describe each individual piece of text. For instance, knowing the addition algorithm is a more space-efficient way to predict all strings like “45324 + 58272 =” than memorization. As the size of the training data you’re given approaches infinity, assuming a realistic space-bounded sequence predictor, the only way its performance can improve is with better world/text modeling. The fact that humans don’t know a certain fact wouldn’t prohibit it from being discovered if it allows more efficient sequence prediction.
Will we reach this superhuman point in practice? I don’t know. It may take absurd amounts of computation and training data to reach this point, or just more than alternative approaches. But it doesn’t seem impossible to me in theory.
Even if we reach this point, this still leaves the original problem—the model will not output anything more than a human would know, even if it has that knowledge internally. But even without fancy future interpretability tools, we may be heading in that direction with things like InstructGPT, where the model was fine-tuned to spit out things it was capable of saying, but wouldn’t have said under pure sequence prediction.
This whole argument, together with rapid recent progress, is enough for me to not immediately write off language models, and consider strategies to take advantage of them if this scenario were to occur.
Hmm. That made me actually try to think concretely about how to elicit “superhuman” information.
You could give it a counterfactual prompt.
You could keep sweetening the pot with stuff that made it harder and harder to explain how the prompt could occur without the problem actually being solved.
… but of course you’d still have to be sure that what you got was right. Especially if it determined internally that the problem was totally impossible, it might always output something that would convince everybody if it were proposed, but would still be wrong. It might do that even if the problem could be solved, if the actual solution were less likely to be widely believed by humans than some attractive pseudo-solution.
Or it could itself be wrong. Or it might decide it was supposed to be writing a science fiction story.
Seems like the approach would work better for questions where you could actually check the results.
I don’t personally think that sort of model will ever get that smart internally, but it’s not like I’m perfect either...
I know it makes me a bad person, but I hadda try it. It ended up sounding like a suggestion you’d read on Less Wrong...
Keep asking for more details!