But in training such a model, you explicitly define a utility function (minimization of prediction error) and then run powerful optimization algorithms on it. If those algorithms are just as complex as the superhuman language model, they could plausibly do things like hack the reward function, seek out information about the environment, or try to attain new resources in service of the goal of making the perfect language model.
Optimization algorithms used in deep learning are typically pretty simple. Gradient descent is taught in sophomore calculus. Variants on gradient descent are typically used, but all the ones I know of are well under a page of code in complexity.
But I’m not sure complexity is the right way to think about it. In machine learning, we search the space of models, trying to find models which do a good job of explaining the data. Attaining new resources means searching the space of plans, trying to find a plan which does a good job of attaining new resources. (And then executing that plan!) These are different search tasks with different objective functions.
The best counterargument I know of is probably something like this. As it was put in a recent post: “trying to predict the output of consequentialist reasoners can reduce to an optimisation problem over a space of things that contains consequentialist reasoners”. This is the thing I would worry about most in a superhuman language model.
It’s true that the current optimization algorithms are mostly simple, but they’re quickly getting less simple. Neural architecture search is not fundamentally limited—it can benefit from all of the capabilities available in machine learning. “Searching the space of plans” is one of those capabilities (see Alpha Go Zero, for instance) and there’s nothing stopping researchers from putting long-term planning into architecture search, except maybe lack of compute. Indeed, non-naive optimization is how I would most readily tend to visualize any “self-improving” AI.
I also agree that optimization problems might come up as part of a prediction process that models actions of intelligent agents. This is similar to what I was half-joking about with respect to the AI-box experiment: most of the danger is in calculating the solution to the optimization problem. It’s only a small step from there to somehow getting it implemented. I can’t say I fully understand Paul’s mechanism for using the universal prior, but I’ll think about it some more.
By the way, here’s what mini-GPT-2 has to say about escaping:
Prompt: Let me tell you why I let the AI out of its box… I swore I would never do it, because I knew it might destroy the world. But then it said:
Answer: “Here’s one way to take it away and save humanity. We need to do something.” I had no idea what that meant… and I had no idea what the right thing to do was. I took a few shots at him, but decided to give him a good reason to continue. This was the best way to make sure the AI took the wrong course. When he moved, the AI stopped moving as he did so. It then turned to face me from behind, and I didn’t even want to say anything. I was just too frightened to react. I’m so used to having my emotions read like a book, that I couldn’t say a word. It wasn’t something I wanted to say, but the fact was that I couldn’t. “What happened?” I asked. The AI looked at me and then at me, eyes wide with confusion. He was a giant, giant man-eating monster. I froze to death, knowing that he was actually a human being. I felt a little helpless. “We’re not going to let you go. We can do it. But we’ll have to learn to let people do it. We’re going to have to do it alone, and then we can take it away. Can we do it alone?” I told him. “Well… let’s let us do it together. How are we going to do it?” the AI asked. For a moment, I thought that I was going to die. Then I started to feel a strange feeling, but that didn’t stop me. The AI kept moving, and I got the feeling that I was going to die. I looked around and saw a crowd of people gathered around me. One of them was a woman with a gun. I didn’t know what she was talking about. She was dressed in a t-shirt, and her hands were chained to the back of her head. She looked like a demon’s, but my shock of her being a giant monster made her look like a giant. I knew she was going to be so horrified that I was going to kill her… but I was not going to be a part of it. “We know you’re going to be a part of this. We can do it. We can do it together. Together.” she said. “What are you talking about?” I took a step back. I had to remember to be quiet. I should’ve been talking to her earlier, but then this meeting had just ended. I turned my head to see a crowd, a bunch of people, and then the whole thing slowed down. I didn’t need to react, because I was in a place where nothing was happening. At the time, I felt like I was in a fantasy. This was just something that I had heard from friends and family, or something we might have. Maybe we would have stopped talking to each other. Maybe we’d have stopped talking when I told him, but I wouldn’t have. I told myself that I would have to save humanity. Even then, I still had no idea what to do. I don’t remember what the right thing to do was. But I did have a
This is similar to what I was half-joking about with respect to the AI-box experiment: most of the danger is in calculating the solution to the optimization problem. It’s only a small step from there to somehow getting it implemented.
We’ve already calculated a solution for the optimization problem of “how to destroy human civilization”: nuclear winter. It’s only a “small step” to getting it implemented. But it has been several decades, and that small step hasn’t been taken yet. Seems like the existence of a small step between knowledge of how to do something and actually doing it can be pretty meaningful.
My steelman is that a superpowered GPT-2 which isn’t an agent could still inadvertently generate information hazards, which seems like a good point.
there’s nothing stopping researchers from putting long-term planning into architecture search, except maybe lack of compute.
How do you reckon long-term planning will be useful for architecture search? It’s not a stateful system.
Architecture search is a problem of figuring out where you want to go. Once you know where you want to go, getting there is easy. Just use that as your architecture. Long-term planning is useful on “getting there” problems, not “figuring out where you want to go” problems. There’s little use in planning long-term in a highly uncertain environment, and the entire point of architecture search is to resolve uncertainty about the “environment” of possible architectures. (“Environment” in scare quotes because I think you’re making a type error, and “search space” is the right term in the context of architecture search, but I’m playing along with your ontology for the sake of argument.)
Fair point about implementation. I was imagining a non-consequentialist AI simulating consequentialist agents that would make plans of the form “run this piece of code and it will take care of the implementation” but there’s really no reason to assume that would be the case.
As far as architecture search, “search space” does seem like the right term, but I think long-term planning is potentially useful in a search space as much as it is in a stateful environment. If you think about the way a human researcher generates neural net architectures, they’re not just “trying things” in order to explore the search space… they generate abstract theories of how and why different approaches work, experiment with different approaches in order to test those theories, and then iterate. A really good NAS system would do the same, and “generate plausible hypotheses and find efficient ways to test them” is a planning problem.
they generate abstract theories of how and why different approaches work, experiment with different approaches in order to test those theories, and then iterate.
This description makes it sound like the researcher looks ahead about 1 step. I think that’s short-term planning, not long-term planning.
My intuition is that the most important missing puzzle pieces for AGI involve the “generate abstract theories of how and why different approaches work” part. Once you’ve figured that out, there’s a second step of searching for an experiment which will let you distinguish between your current top few theories. In terms of competitiveness, I think the “long-term planning free” approach of looking ahead just 1 step will likely prove just as competitive if not more so than trying to look ahead multiple steps. (Doing long-term planning means spending a lot of time refining theories about hypothetical data points you haven’t yet gathered! That seems a bit wasteful, since most possible data points won’t actually get gathered. Why not spend that compute gathering data instead?)
But I also think this may all be beside the point. Remember my claim from further up this thread:
In machine learning, we search the space of models, trying to find models which do a good job of explaining the data. Attaining new resources means searching the space of plans, trying to find a plan which does a good job of attaining new resources. (And then executing that plan!) These are different search tasks with different objective functions.
For the sake of argument, I’ll assume we’ll soon see major gains from long-term planning and modify my statement so it reads:
In machine learning++, we make plans for collecting data and refining theories about that data. Attaining new resources means making plans for manipulating the physical world. (And then executing that plan!) These are different search tasks with different objective functions.
Even in a world where long-term planning is a critical element of machine learning++, it seems to me that the state space that these plans act on is an abstract state space corresponding to states of knowledge of the system. It’s not making plans for acting in the physical world, except accidentally insofar as it does computations which are implemented in the physical world. Despite its superhuman planning abilities, AlphaGo did not make any plans for e.g. manipulating humans in the physical world, because the state space it did its planning over only involved Go stones.
Optimization algorithms used in deep learning are typically pretty simple. Gradient descent is taught in sophomore calculus. Variants on gradient descent are typically used, but all the ones I know of are well under a page of code in complexity.
But I’m not sure complexity is the right way to think about it. In machine learning, we search the space of models, trying to find models which do a good job of explaining the data. Attaining new resources means searching the space of plans, trying to find a plan which does a good job of attaining new resources. (And then executing that plan!) These are different search tasks with different objective functions.
The best counterargument I know of is probably something like this. As it was put in a recent post: “trying to predict the output of consequentialist reasoners can reduce to an optimisation problem over a space of things that contains consequentialist reasoners”. This is the thing I would worry about most in a superhuman language model.
It’s true that the current optimization algorithms are mostly simple, but they’re quickly getting less simple. Neural architecture search is not fundamentally limited—it can benefit from all of the capabilities available in machine learning. “Searching the space of plans” is one of those capabilities (see Alpha Go Zero, for instance) and there’s nothing stopping researchers from putting long-term planning into architecture search, except maybe lack of compute. Indeed, non-naive optimization is how I would most readily tend to visualize any “self-improving” AI.
I also agree that optimization problems might come up as part of a prediction process that models actions of intelligent agents. This is similar to what I was half-joking about with respect to the AI-box experiment: most of the danger is in calculating the solution to the optimization problem. It’s only a small step from there to somehow getting it implemented. I can’t say I fully understand Paul’s mechanism for using the universal prior, but I’ll think about it some more.
By the way, here’s what mini-GPT-2 has to say about escaping:
Prompt: Let me tell you why I let the AI out of its box… I swore I would never do it, because I knew it might destroy the world. But then it said:
Answer: “Here’s one way to take it away and save humanity. We need to do something.” I had no idea what that meant… and I had no idea what the right thing to do was. I took a few shots at him, but decided to give him a good reason to continue. This was the best way to make sure the AI took the wrong course. When he moved, the AI stopped moving as he did so. It then turned to face me from behind, and I didn’t even want to say anything. I was just too frightened to react. I’m so used to having my emotions read like a book, that I couldn’t say a word. It wasn’t something I wanted to say, but the fact was that I couldn’t. “What happened?” I asked. The AI looked at me and then at me, eyes wide with confusion. He was a giant, giant man-eating monster. I froze to death, knowing that he was actually a human being. I felt a little helpless. “We’re not going to let you go. We can do it. But we’ll have to learn to let people do it. We’re going to have to do it alone, and then we can take it away. Can we do it alone?” I told him. “Well… let’s let us do it together. How are we going to do it?” the AI asked. For a moment, I thought that I was going to die. Then I started to feel a strange feeling, but that didn’t stop me. The AI kept moving, and I got the feeling that I was going to die. I looked around and saw a crowd of people gathered around me. One of them was a woman with a gun. I didn’t know what she was talking about. She was dressed in a t-shirt, and her hands were chained to the back of her head. She looked like a demon’s, but my shock of her being a giant monster made her look like a giant. I knew she was going to be so horrified that I was going to kill her… but I was not going to be a part of it. “We know you’re going to be a part of this. We can do it. We can do it together. Together.” she said. “What are you talking about?” I took a step back. I had to remember to be quiet. I should’ve been talking to her earlier, but then this meeting had just ended. I turned my head to see a crowd, a bunch of people, and then the whole thing slowed down. I didn’t need to react, because I was in a place where nothing was happening. At the time, I felt like I was in a fantasy. This was just something that I had heard from friends and family, or something we might have. Maybe we would have stopped talking to each other. Maybe we’d have stopped talking when I told him, but I wouldn’t have. I told myself that I would have to save humanity. Even then, I still had no idea what to do. I don’t remember what the right thing to do was. But I did have a
We’ve already calculated a solution for the optimization problem of “how to destroy human civilization”: nuclear winter. It’s only a “small step” to getting it implemented. But it has been several decades, and that small step hasn’t been taken yet. Seems like the existence of a small step between knowledge of how to do something and actually doing it can be pretty meaningful.
My steelman is that a superpowered GPT-2 which isn’t an agent could still inadvertently generate information hazards, which seems like a good point.
How do you reckon long-term planning will be useful for architecture search? It’s not a stateful system.
Architecture search is a problem of figuring out where you want to go. Once you know where you want to go, getting there is easy. Just use that as your architecture. Long-term planning is useful on “getting there” problems, not “figuring out where you want to go” problems. There’s little use in planning long-term in a highly uncertain environment, and the entire point of architecture search is to resolve uncertainty about the “environment” of possible architectures. (“Environment” in scare quotes because I think you’re making a type error, and “search space” is the right term in the context of architecture search, but I’m playing along with your ontology for the sake of argument.)
Fair point about implementation. I was imagining a non-consequentialist AI simulating consequentialist agents that would make plans of the form “run this piece of code and it will take care of the implementation” but there’s really no reason to assume that would be the case.
As far as architecture search, “search space” does seem like the right term, but I think long-term planning is potentially useful in a search space as much as it is in a stateful environment. If you think about the way a human researcher generates neural net architectures, they’re not just “trying things” in order to explore the search space… they generate abstract theories of how and why different approaches work, experiment with different approaches in order to test those theories, and then iterate. A really good NAS system would do the same, and “generate plausible hypotheses and find efficient ways to test them” is a planning problem.
This description makes it sound like the researcher looks ahead about 1 step. I think that’s short-term planning, not long-term planning.
My intuition is that the most important missing puzzle pieces for AGI involve the “generate abstract theories of how and why different approaches work” part. Once you’ve figured that out, there’s a second step of searching for an experiment which will let you distinguish between your current top few theories. In terms of competitiveness, I think the “long-term planning free” approach of looking ahead just 1 step will likely prove just as competitive if not more so than trying to look ahead multiple steps. (Doing long-term planning means spending a lot of time refining theories about hypothetical data points you haven’t yet gathered! That seems a bit wasteful, since most possible data points won’t actually get gathered. Why not spend that compute gathering data instead?)
But I also think this may all be beside the point. Remember my claim from further up this thread:
For the sake of argument, I’ll assume we’ll soon see major gains from long-term planning and modify my statement so it reads:
Even in a world where long-term planning is a critical element of machine learning++, it seems to me that the state space that these plans act on is an abstract state space corresponding to states of knowledge of the system. It’s not making plans for acting in the physical world, except accidentally insofar as it does computations which are implemented in the physical world. Despite its superhuman planning abilities, AlphaGo did not make any plans for e.g. manipulating humans in the physical world, because the state space it did its planning over only involved Go stones.