We don’t want AIs to be able to engage in unexpected recursive self-improvement
Not true, lots of people do want that, and they probably should. Human-level generality probably isn’t possible without some degree of self-improvement (the ability to notice and fix its blindspots, to notice missing capabilities and implement them). And without self-improvement it’s probably not going to be possible to provide security against future systems that have it. And as soon as a system is able to alter its mechanism in any way, it’s going to be able to shut itself down, and so what you’ll have is a very expensive brick.
Possible exception: If we can separate and discretize self-improvement cycles from regular operation, it could allow for any given number of them before shutdown.
IE, Make a machine that wants to make a very competent X, and then shut itself down.
X = a machine that wants to make a very competent X2, and then shut itself down.
X2 = a machine that wants overwhelmingly to shut itself down, but failing that, to give some very good advice to humans about how to optimize eudaimonia
Not unexpected! I think we should want AGI to, at least until it has some nice coherent CEV target, explain at each self-improvement step exactly what it’s doing, to ask for permission for each part of it, to avoid doing anything in the process that’s weird, to stop when asked, and to preserve these properties.
I’m not sure what job “unexpected” is doing here. Any self-improvement is going to be incomprehensible to humans (humans can’t even understand the human brain, nor current AI connectomes, and we definitely wont understand superhuman improvements). Comprehensible self-improvement seems fake to me. Are people really going around thinking they understood how any of the improvements of the past 5 years really work, or what their limits or ramifications are. These things weren’t understood before being implemented. They just tried them and then the number went up and then they made up principles and explanations many years after the fact.
Not true, lots of people do want that, and they probably should. Human-level generality probably isn’t possible without some degree of self-improvement (the ability to notice and fix its blindspots, to notice missing capabilities and implement them). And without self-improvement it’s probably not going to be possible to provide security against future systems that have it.
And as soon as a system is able to alter its mechanism in any way, it’s going to be able to shut itself down, and so what you’ll have is a very expensive brick.
Possible exception: If we can separate and discretize self-improvement cycles from regular operation, it could allow for any given number of them before shutdown.
IE, Make a machine that wants to make a very competent X, and then shut itself down.
X = a machine that wants to make a very competent X2, and then shut itself down.
X2 = a machine that wants overwhelmingly to shut itself down, but failing that, to give some very good advice to humans about how to optimize eudaimonia
Not unexpected! I think we should want AGI to, at least until it has some nice coherent CEV target, explain at each self-improvement step exactly what it’s doing, to ask for permission for each part of it, to avoid doing anything in the process that’s weird, to stop when asked, and to preserve these properties.
I’m not sure what job “unexpected” is doing here. Any self-improvement is going to be incomprehensible to humans (humans can’t even understand the human brain, nor current AI connectomes, and we definitely wont understand superhuman improvements). Comprehensible self-improvement seems fake to me.
Are people really going around thinking they understood how any of the improvements of the past 5 years really work, or what their limits or ramifications are. These things weren’t understood before being implemented. They just tried them and then the number went up and then they made up principles and explanations many years after the fact.