So why do people have more trouble thinking that people could understand the world through pure vision than pure text? I think people’s different treatment of these cases- vision and language- may be caused by a poverty of stimulus- overgeneralizing from cases in which we have only a small amount of text. It’s true that if I just tell you that all qubos are shrimbos, and all shrimbos are tubis, you’ll be left in the dark about all of these terms, but that intuition doesn’t necessarily scale up into a situation in which you are learning across billions of instances of words and come to understand their vastly complex patterns of co-occurrence with such precision that you can predict the next word with great accuracy.
GPT can not “predict the next word with great accuracy” for arbitrary text, the way that a physics model can predict the path of a falling or orbiting object for arbitrary objects. For example, neither you nor any language model (including future language models, unless they have training data pertaining to this Lesswrong comment) can predict that the next word, or following sequence of words making up the rest of this paragraph, will be:
first, a sentence about what beer I drank yesterday and what I am doing right now—followed by some sentences explicitly making my point. The beer I had was Yuengling and right now I am waiting for my laundry to be done as I write this comment. It was not predictable that those would be the next words because the next sequence of words in any text is inherently highly underdetermined—if the only information you have is the prompt that starts the text. There is no ground truth, independent of what the person writing the text intends to communicate, about what the correct completion of a text prompt is supposed to be.
Consider a kind of naive empiricist view of learning, in which one starts with patches of color in a field (vision), and slowly infers an underlying universe of objects through their patterns of relations and co-occurrence. Why is this necessarily any different or more grounded than learning by exposure to a vast language corpus, wherein one also learns through gaining insight into the relations of words and their co-occurences?
Well one thing to note is that actual learning (in humans at least) does not only involve getting data from vision, but also interacting with the world and getting information from multiple senses.
But the real reason I think the two are importantly different is that visual data about the world is closely tied to the way the world actually is—in a simple, straightforward way that does not require any prior knowledge about human minds (or any minds or other information processing systems) to interpret. For example, if I see what looks like a rock, and then walk a few steps and look back and see what looks like the other side of the rock, and then walk closer and it still looks like a rock, the most likely explanation for what I am seeing is that there is an actual rock there. And if I still have doubts, I can pick it up and see if it feels like a rock or drop it and see if it makes the sound a rock would make. The totality of the data pushes me towards a coherent “rock” concept and a world model that has rocks in it—as this is the simplest and most natural interpretation of the data.
By contrast, there is no reason to think that humans having the type of minds we have, living in our actual world, and using written language for the range of purposes we use it for is the simplest, or most likely, or most easily-converged-to explanation for why a large corpus of text exists.
From our point of view, we already know that humans exist and use language to communicate and as part of each human’s internal thought process, and that large numbers of humans over many years wrote the documents that became GPT’s training data.
But suppose you were something that didn’t start out knowing (or having any evolved instinctive expectation) that humans exist, or that minds or computer programs or other data-generating processes exist, and you just received GPT’s training data as a bunch of meaningless-at-first-glance tokens. There is no reason to think that building a model of humans and the world humans inhabit (as opposed to something like a markov model or a stochastic physical process or some other type of less-complicated-than-humans model) would be the simplest way to make sense of the patterns in that data.
GPT can not “predict the next word with great accuracy” for arbitrary text, the way that a physics model can predict the path of a falling or orbiting object for arbitrary objects. For example, neither you nor any language model (including future language models, unless they have training data pertaining to this Lesswrong comment) can predict that the next word, or following sequence of words making up the rest of this paragraph, will be:
first, a sentence about what beer I drank yesterday and what I am doing right now—followed by some sentences explicitly making my point. The beer I had was Yuengling and right now I am waiting for my laundry to be done as I write this comment. It was not predictable that those would be the next words because the next sequence of words in any text is inherently highly underdetermined—if the only information you have is the prompt that starts the text. There is no ground truth, independent of what the person writing the text intends to communicate, about what the correct completion of a text prompt is supposed to be.
Well one thing to note is that actual learning (in humans at least) does not only involve getting data from vision, but also interacting with the world and getting information from multiple senses.
But the real reason I think the two are importantly different is that visual data about the world is closely tied to the way the world actually is—in a simple, straightforward way that does not require any prior knowledge about human minds (or any minds or other information processing systems) to interpret. For example, if I see what looks like a rock, and then walk a few steps and look back and see what looks like the other side of the rock, and then walk closer and it still looks like a rock, the most likely explanation for what I am seeing is that there is an actual rock there. And if I still have doubts, I can pick it up and see if it feels like a rock or drop it and see if it makes the sound a rock would make. The totality of the data pushes me towards a coherent “rock” concept and a world model that has rocks in it—as this is the simplest and most natural interpretation of the data.
By contrast, there is no reason to think that humans having the type of minds we have, living in our actual world, and using written language for the range of purposes we use it for is the simplest, or most likely, or most easily-converged-to explanation for why a large corpus of text exists.
From our point of view, we already know that humans exist and use language to communicate and as part of each human’s internal thought process, and that large numbers of humans over many years wrote the documents that became GPT’s training data.
But suppose you were something that didn’t start out knowing (or having any evolved instinctive expectation) that humans exist, or that minds or computer programs or other data-generating processes exist, and you just received GPT’s training data as a bunch of meaningless-at-first-glance tokens. There is no reason to think that building a model of humans and the world humans inhabit (as opposed to something like a markov model or a stochastic physical process or some other type of less-complicated-than-humans model) would be the simplest way to make sense of the patterns in that data.