If you train GPT-3 on a bunch of medical textbooks and prompt it to tell you a cure for Alzheimer’s, it won’t tell you a cure, it will tell you what humans have said about curing Alzheimer’s … It would just tell you a plausible story about a situation related to the prompt about curing Alzheimer’s, based on its training data. Rather than a logical Oracle, this image-captioning-esque scheme would be an intuitive Oracle, telling you things that make sense based on associations already present within the training set.
What am I driving at here, by pointing out that curing Alzheimer’s is hard? It’s that the designs above are missing something, and what they’re missing is search. I’m not saying that getting a neural net to directly output your cure for Alzheimer’s is impossible. But it seems like it requires there to already be a “cure for Alzheimer’s” dimension in your learned model. The more realistic way to find the cure for Alzheimer’s, if you don’t already know it, is going to involve lots of logical steps one after another, slowly moving through a logical space, narrowing down the possibilities more and more, and eventually finding something that fits the bill. In other words, solving a search problem.
So if your AI can tell you how to cure Alzheimer’s, I think either it’s explicitly doing a search for how to cure Alzheimer’s (or worlds that match your verbal prompt the best, or whatever), or it has some internal state that implicitly performs a search.
Charlie’s quote is an excellent description of an important crux/challenge of getting useful difficult intellectual work out of GPTs.
Despite this, I think it’s possible in principle to train a GPT-like model to AGI or to solve problems at least as hard as humans can solve, for a combination of reasons:
I think it’s likely that GPTs implicitly perform search internally, to some extent, and will be able to perform more sophisticated search with scale.
It seems possible that a sufficiently powerful GPT trained on a massive corpus of human (medical + other) knowledge will learn better/more general abstractions than humans, so that in its ontology “a cure for Alzheimer’s” is an “intuitive” inference away, even if for humans it would require many logical steps and empirical research. I tend to think human knowledge implies a lot of low hanging fruit that we have not accessed because of insufficient exploration and because we haven’t compiled our data into the right abstractions. I don’t know how difficult a cure for Alzheimer’s is, and how close it is to being “implied” by the sum of human knowledge. Nor the solution to alignment. And eliciting this latent knowledge is another problem.
Of course, the models can do explicit search in simulated chains of thought. And if natural language in the wild doesn’t capture/imply the (right granularity of; right directed flow of evidence of) the search process that would be useful for attacking a given problem, it is still possible to record or construct data that does.
But it’s possible that the technical difficulties involved make SSL uncompetitive compared to other methods.
I also responded to Capybasilisk below, but I want to chime in here and use your own post against you, contra point 2 :P
It’s not so easy to get “latent knowledge” out of a simulator—it’s the simulands who have the knowledge, and they have to be somehow specified before you can step forward the simulation of them. When you get a text model to output a cure for Alzheimer’s in one step, without playing out the text of some chain of thought, it’s still simulating something to produce that output, and that something might be an optimization process that is going to find lots of unexpected and dangerous solutions to questions you might ask it.
Figuring out the alignment properties of simulated entities running in the “text laws of physics” seems like a challenge. Not an insurmountable challenge, maybe, and I’m curious about your current and future thoughts, but the sort of thing I want to see progress in before I put too much trust in attempts to use simulators to do superhuman abstraction-building.
If I was trying to have a human researcher cure Alzheimers, I’d give them a laboratory, lab assistants, a notebook, and likely also a computer. Similarly, if I wanted a simulacrum of a human researcher (or a great many simulacra of human researchers) to have a good chance of solving Alzheimer’s, I’d given them access to functionally equivalent resources, facilities and tools, crucially including the ability to design, carry out, and analyze the results of experiments in the real world.
Ah, the good old days post-GPT-2 when “GPT-3” was the future example :P
I think back then I still thoroughly understimated how useful natural-language “simulation” of human reasoning would be. I agree with janus that we have plenty of information telling us that yes, you can ride this same training procedure to very general problem solving (though I think including more modalities, active leaning, etc. will be incorporated before anyone really pushes brute force “GPT-N go brrr” to the extreme).
This is somewhat of a concern for alignment. I more or less stand by that comment you linked and its children; in particular, I said
The search thing is a little subtle. It’s not that search or optimization is automatically dangerous—it’s that I think the danger is that search can turn up adversarial examples / surprising solutions.
I mentioned how I think the particular kind of idiot-proofness that natural language processing might have is “won’t tell an idiot a plan to blow up the world if they ask for something else.” Well, I think that as soon as the AI is doing a deep search through outcomes to figure out how to make Alzheimer’s go away, you lose a lot of that protection and I think the AI is back in the category of Oracles that might tell an idiot a plan to blow up the world.
Simulating a reasoner who quickly finds a cure for Alzheimer’s is not by default safe (even though simulating a human writing in their diary is safe). Optimization processes that quickly find cures for Alzheimer’s are not humans, they must be doing some inhuman reasoning, and they’re capable of having lots of clever ideas with tight coupling to the real world.
I want to have confidence in the alignment properties of any powerful optimizers we unleash, and I imagine we can gain that confidence by knowing how they’re constructed, and trying them out in toy problems while inspecting their inner workings, and having them ask humans for feedback about how they should weigh moral options, etc. These are all things it’s hard to do for emergent simulands inside predictive simulators. I’m not saying it’s impossible for things to go well, I’m about evenly split on how much I think this is actually harder, versus how much I think this is just a new paradigm for thinking about alignment that doesn’t have much work in it yet.
I think talking of “loss minimizing” is conflating two different things here. Minimizing training loss is alignment of the model with the alignment target given by the training dataset. But the Alzheimer’s example is not about that, it’s about some sort of reflective equilibrium loss, harmony between the model and hypothetical queries it could in principle encounter but didn’t on the trainings dataset. The latter is also a measure of robustness.
Prompt-conditioned behaviors of a model (in particular, behaviors conditioned by presence of a word, or name of a character) could themselves be thought of as models, represented in the outer unconditioned model. These specialized models (trying to channel particular concepts) are not necessarily adequately trained, especially if they specialize in phenomena that were not explored in the episodes of the training dataset. The implied loss for an individual concept (specialized prompt-conditioned model) compares the episodes generated in its scope by all the other concepts of the outer model, to the sensibilities of the concept. Reflection reduces this internal alignment loss by rectifying the episodes (bargaining with the other concepts), changing the concept to anticipate the episodes’ persisting deformities, or by shifting the concept’s scope to pay attention to different episodes. With enough reflection, a concept is only invoked in contexts to which it’s robust, where its intuitive model-channeled guidance is coherent across the episodes of its reflectively settled scope, providing acausal coordination among these episodes in its role as an adjudicator, expressing its preferences.
So this makes a distinction between search and reflection in responding to a novel query, where reflection might involve some sort of search (as part of amplification), but its results won’t be robustly aligned before reflective equilibrium for the relevant concepts is established.
I’d especially like to hear your thoughts on the above proposal of loss-minimizing a language model all the way to AGI.
I hope you won’t mind me quoting your earlier self as I strongly agree with your previous take on the matter:
Charlie’s quote is an excellent description of an important crux/challenge of getting useful difficult intellectual work out of GPTs.
Despite this, I think it’s possible in principle to train a GPT-like model to AGI or to solve problems at least as hard as humans can solve, for a combination of reasons:
I think it’s likely that GPTs implicitly perform search internally, to some extent, and will be able to perform more sophisticated search with scale.
It seems possible that a sufficiently powerful GPT trained on a massive corpus of human (medical + other) knowledge will learn better/more general abstractions than humans, so that in its ontology “a cure for Alzheimer’s” is an “intuitive” inference away, even if for humans it would require many logical steps and empirical research. I tend to think human knowledge implies a lot of low hanging fruit that we have not accessed because of insufficient exploration and because we haven’t compiled our data into the right abstractions. I don’t know how difficult a cure for Alzheimer’s is, and how close it is to being “implied” by the sum of human knowledge. Nor the solution to alignment. And eliciting this latent knowledge is another problem.
Of course, the models can do explicit search in simulated chains of thought. And if natural language in the wild doesn’t capture/imply the (right granularity of; right directed flow of evidence of) the search process that would be useful for attacking a given problem, it is still possible to record or construct data that does.
But it’s possible that the technical difficulties involved make SSL uncompetitive compared to other methods.
I also responded to Capybasilisk below, but I want to chime in here and use your own post against you, contra point 2 :P
It’s not so easy to get “latent knowledge” out of a simulator—it’s the simulands who have the knowledge, and they have to be somehow specified before you can step forward the simulation of them. When you get a text model to output a cure for Alzheimer’s in one step, without playing out the text of some chain of thought, it’s still simulating something to produce that output, and that something might be an optimization process that is going to find lots of unexpected and dangerous solutions to questions you might ask it.
Figuring out the alignment properties of simulated entities running in the “text laws of physics” seems like a challenge. Not an insurmountable challenge, maybe, and I’m curious about your current and future thoughts, but the sort of thing I want to see progress in before I put too much trust in attempts to use simulators to do superhuman abstraction-building.
If I was trying to have a human researcher cure Alzheimers, I’d give them a laboratory, lab assistants, a notebook, and likely also a computer. Similarly, if I wanted a simulacrum of a human researcher (or a great many simulacra of human researchers) to have a good chance of solving Alzheimer’s, I’d given them access to functionally equivalent resources, facilities and tools, crucially including the ability to design, carry out, and analyze the results of experiments in the real world.
Ah, the good old days post-GPT-2 when “GPT-3” was the future example :P
I think back then I still thoroughly understimated how useful natural-language “simulation” of human reasoning would be. I agree with janus that we have plenty of information telling us that yes, you can ride this same training procedure to very general problem solving (though I think including more modalities, active leaning, etc. will be incorporated before anyone really pushes brute force “GPT-N go brrr” to the extreme).
This is somewhat of a concern for alignment. I more or less stand by that comment you linked and its children; in particular, I said
Simulating a reasoner who quickly finds a cure for Alzheimer’s is not by default safe (even though simulating a human writing in their diary is safe). Optimization processes that quickly find cures for Alzheimer’s are not humans, they must be doing some inhuman reasoning, and they’re capable of having lots of clever ideas with tight coupling to the real world.
I want to have confidence in the alignment properties of any powerful optimizers we unleash, and I imagine we can gain that confidence by knowing how they’re constructed, and trying them out in toy problems while inspecting their inner workings, and having them ask humans for feedback about how they should weigh moral options, etc. These are all things it’s hard to do for emergent simulands inside predictive simulators. I’m not saying it’s impossible for things to go well, I’m about evenly split on how much I think this is actually harder, versus how much I think this is just a new paradigm for thinking about alignment that doesn’t have much work in it yet.
I think talking of “loss minimizing” is conflating two different things here. Minimizing training loss is alignment of the model with the alignment target given by the training dataset. But the Alzheimer’s example is not about that, it’s about some sort of reflective equilibrium loss, harmony between the model and hypothetical queries it could in principle encounter but didn’t on the trainings dataset. The latter is also a measure of robustness.
Prompt-conditioned behaviors of a model (in particular, behaviors conditioned by presence of a word, or name of a character) could themselves be thought of as models, represented in the outer unconditioned model. These specialized models (trying to channel particular concepts) are not necessarily adequately trained, especially if they specialize in phenomena that were not explored in the episodes of the training dataset. The implied loss for an individual concept (specialized prompt-conditioned model) compares the episodes generated in its scope by all the other concepts of the outer model, to the sensibilities of the concept. Reflection reduces this internal alignment loss by rectifying the episodes (bargaining with the other concepts), changing the concept to anticipate the episodes’ persisting deformities, or by shifting the concept’s scope to pay attention to different episodes. With enough reflection, a concept is only invoked in contexts to which it’s robust, where its intuitive model-channeled guidance is coherent across the episodes of its reflectively settled scope, providing acausal coordination among these episodes in its role as an adjudicator, expressing its preferences.
So this makes a distinction between search and reflection in responding to a novel query, where reflection might involve some sort of search (as part of amplification), but its results won’t be robustly aligned before reflective equilibrium for the relevant concepts is established.