Most object-level questions about ML. It’s critical that I use potential application to ML as a guiding constraint in my research. But beyond that, I don’t think it makes sense for me to actually spin up any neural networks, or (at this stage) try to prove theorems about concrete transformer architectures. Certainly someone should be doing that, but there are far more people doing that than doing agent foundations.
I’m confused about this part. My understanding is that you’re trying to build a model of what actual AI agents will be like. But you don’t think that learning more about what current neural networks are like, and how they’re trained, is helpful for that goal? Do you expect there will be a whole new paradigm, and that current neural networks will be nothing like future AIs?
My impression is that Alex is trying to figure out what things like “optimization” are actually like, and that this analysis will apply to a wider variety of systems than just ML. Which makes sense to me—imo, anchoring too much on current systems seems unlikely to produce general, robust solutions to alignment.
Yeah, this is why we need a better explainer for agent foundations. I won’t do it justice in this comment but I’ll try to say some helpful words. (Have you read the Rocket Alignment Problem?)
Do you expect there will be a whole new paradigm, and that current neural networks will be nothing like future AIs?
I can give an easy “no” to this question. I do not necessarily expect future AIs to work in a whole new paradigm.
My understanding is that you’re trying to build a model of what actual AI agents will be like.
This doesn’t really describe what I’m doing. I’m trying to help figure out what AIs we should build, so I’m hoping to affect what actual AI agents will be like.
But more of what I’m doing is trying to understand what the space of possible agents looks like at all. I can see how that could sound like someone saying, “it seems like we don’t know how to build a safe bridge, so I’m going to start by trying to understand what the space of possible configurations of matter looks like at all” but I do think it’s different than that.
Let me try putting it this way. The arguments that AI could be an existential risk were formed before neural networks were obviously useful for anything. So the inherent danger of AIs does not come from anything particular to current systems. These arguments use specific properties about the general nature of intelligence and agency. But they are ultimately intuitive arguments. The intuition is good enough for us to know that the arguments are correct, but not good enough to help us understand how to build safe AIs. I’m trying to find the formalization behind those intuitions, so that we can have any chance at building a safe thing. Once we get some formal results about how powerful AIs could be safe even in principle, then we can start thinking about how to build versions of existing systems that have those properties. (And yes, that’s a really long feedback loop, so I try to recurringly check that my trains of ideas could still in principle apply to ML systems.)
I’m confused about this part. My understanding is that you’re trying to build a model of what actual AI agents will be like. But you don’t think that learning more about what current neural networks are like, and how they’re trained, is helpful for that goal? Do you expect there will be a whole new paradigm, and that current neural networks will be nothing like future AIs?
My impression is that Alex is trying to figure out what things like “optimization” are actually like, and that this analysis will apply to a wider variety of systems than just ML. Which makes sense to me—imo, anchoring too much on current systems seems unlikely to produce general, robust solutions to alignment.
Yeah, this is why we need a better explainer for agent foundations. I won’t do it justice in this comment but I’ll try to say some helpful words. (Have you read the Rocket Alignment Problem?)
I can give an easy “no” to this question. I do not necessarily expect future AIs to work in a whole new paradigm.
This doesn’t really describe what I’m doing. I’m trying to help figure out what AIs we should build, so I’m hoping to affect what actual AI agents will be like.
But more of what I’m doing is trying to understand what the space of possible agents looks like at all. I can see how that could sound like someone saying, “it seems like we don’t know how to build a safe bridge, so I’m going to start by trying to understand what the space of possible configurations of matter looks like at all” but I do think it’s different than that.
Let me try putting it this way. The arguments that AI could be an existential risk were formed before neural networks were obviously useful for anything. So the inherent danger of AIs does not come from anything particular to current systems. These arguments use specific properties about the general nature of intelligence and agency. But they are ultimately intuitive arguments. The intuition is good enough for us to know that the arguments are correct, but not good enough to help us understand how to build safe AIs. I’m trying to find the formalization behind those intuitions, so that we can have any chance at building a safe thing. Once we get some formal results about how powerful AIs could be safe even in principle, then we can start thinking about how to build versions of existing systems that have those properties. (And yes, that’s a really long feedback loop, so I try to recurringly check that my trains of ideas could still in principle apply to ML systems.)