Well, if you start out with a superintelligence that you have good reasons to think is fully aligned, then that is certainly a much better situation to be in (and that’s an understatement)! Mentioning that just to make it clear that even if we see things differently, there is partial agreement :)
Lets imagine a superintelligent AI, and let’s describe it (a bit anthropomorphicly) as not “wanting” to help me solve alignment. Lets say that instead what it “wants” is to (for example) maximize it’s reward signal, and that the best way to maximize it’s reward signal would be to exterminate humanity and take over the world (for reasons Eliezer and others have outlined). Well, in that case it will be looking for ways to overthrow humanity, but if it isn’t able to do that it may want to do the next best thing (providing outputs that its operators respond to with a high reward signal).
So it may well prefer to “trick” me. But if it can’t “trick” me, it may prefer to give me answers that seem good to me (rather than answers that I recognize as clearly bad and evasive).
Machine learning techniques will tend to select for systems that do things that seem impressive and helpful. Unfortunately this does not guarantee “deep” alignment, but I presume that it will select for systems that at least seem aligned on the surface.
There are lots of risks and difficulties involved with asking questions/requests of AIs. But there are more and less ways dangerous of interacting with a potentially unaligned AGI, and questions/requests vary a lot in how easy or hard it is to verify whether or not they provide us what we want. The techniques/outlines I will outline in the series are intended to minimize risk of being “tricked”, and I think they could get us pretty far, but I could be wrong somehow, and it’s a long/complicated discussion.
I think there are various very powerful methods that can be used to make it hard for AGI-system to not provide what we want in process of creating aligned AGI-system. But I don’t disagree in regards to what you say about it being “extremely dangerous”. I think one argument in favor of the kinds of strategies I have in mind is that they may help give an extra layer of security/alignment-assurance, even if we think we have succeeded with alignment beforehand.
Well, if you start out with a superintelligence that you have good reasons to think is fully aligned, then that is certainly a much better situation to be in (and that’s an understatement)! Mentioning that just to make it clear that even if we see things differently, there is partial agreement :)
Lets imagine a superintelligent AI, and let’s describe it (a bit anthropomorphicly) as not “wanting” to help me solve alignment. Lets say that instead what it “wants” is to (for example) maximize it’s reward signal, and that the best way to maximize it’s reward signal would be to exterminate humanity and take over the world (for reasons Eliezer and others have outlined). Well, in that case it will be looking for ways to overthrow humanity, but if it isn’t able to do that it may want to do the next best thing (providing outputs that its operators respond to with a high reward signal).
So it may well prefer to “trick” me. But if it can’t “trick” me, it may prefer to give me answers that seem good to me (rather than answers that I recognize as clearly bad and evasive).
Machine learning techniques will tend to select for systems that do things that seem impressive and helpful. Unfortunately this does not guarantee “deep” alignment, but I presume that it will select for systems that at least seem aligned on the surface.
There are lots of risks and difficulties involved with asking questions/requests of AIs. But there are more and less ways dangerous of interacting with a potentially unaligned AGI, and questions/requests vary a lot in how easy or hard it is to verify whether or not they provide us what we want. The techniques/outlines I will outline in the series are intended to minimize risk of being “tricked”, and I think they could get us pretty far, but I could be wrong somehow, and it’s a long/complicated discussion.
Yeah, this sounds extremely dangerous and extremely unlikely to work, but I hope I’m wrong and you’ve found something potentially useful.
I think there are various very powerful methods that can be used to make it hard for AGI-system to not provide what we want in process of creating aligned AGI-system. But I don’t disagree in regards to what you say about it being “extremely dangerous”. I think one argument in favor of the kinds of strategies I have in mind is that they may help give an extra layer of security/alignment-assurance, even if we think we have succeeded with alignment beforehand.