Alex_Altair comments on My research agenda in agent foundations

Alex_Altair 29 Jun 2023 16:08 UTC
4 points
1
Yeah, this is why we need a better explainer for agent foundations. I won’t do it justice in this comment but I’ll try to say some helpful words. (Have you read the Rocket Alignment Problem?)
Do you expect there will be a whole new paradigm, and that current neural networks will be nothing like future AIs?
I can give an easy “no” to this question. I do not necessarily expect future AIs to work in a whole new paradigm.
My understanding is that you’re trying to build a model of what actual AI agents will be like.
This doesn’t really describe what I’m doing. I’m trying to help figure out what AIs we should build, so I’m hoping to affect what actual AI agents will be like.
But more of what I’m doing is trying to understand what the space of possible agents looks like at all. I can see how that could sound like someone saying, “it seems like we don’t know how to build a safe bridge, so I’m going to start by trying to understand what the space of possible configurations of matter looks like at all” but I do think it’s different than that.
Let me try putting it this way. The arguments that AI could be an existential risk were formed before neural networks were obviously useful for anything. So the inherent danger of AIs does not come from anything particular to current systems. These arguments use specific properties about the general nature of intelligence and agency. But they are ultimately intuitive arguments. The intuition is good enough for us to know that the arguments are correct, but not good enough to help us understand how to build safe AIs. I’m trying to find the formalization behind those intuitions, so that we can have any chance at building a safe thing. Once we get some formal results about how powerful AIs could be safe even in principle, then we can start thinking about how to build versions of existing systems that have those properties. (And yes, that’s a really long feedback loop, so I try to recurringly check that my trains of ideas could still in principle apply to ML systems.)