Right now, you’re writing a “why not just” series, and under that headline, it makes sense to treat these proposals individually.
I’d also appreciate it if you spent some time addressing the idea that successful AI safety will be a synthesis of these strategies, and perhaps others yet to be found. Right now, I can’t update my perceptions of the value of these individual proposals very much, because my baseline expectation is that AI safety will rely on a combination of them.
I also expect that figuring out what’s achievable in combination will require significant technical refinements of each individual proposal. For that reason, it doesn’t surprise me that the manner of synthesis is unclear. Generally, we let a thousand flowers bloom in the world of science. While I don’t think this is appropriate for very hazardous technologies, it does seem appropriate for AI safety. Of course, I am only an interested observer. Just chiming in.
I think if your takeaway from this sequence is to ask people like OP to analyze complicated amalgamations of alignment solutions you’re kind of missing the point.
There’s a computer security story I like to tell about memory corruption exploits. People have been inventing unique and independent compiler and OS-level guardrails against C program mistakes for decades; DEP, ASLR, Stack canaries. And they all raise the cost of developing an exploit, certainly. But they all have these obvious individual bypasses—canaries and ASLR can be defeated by discovering a nonfatal memory leak, DEP can defeated by tricks like return oriented programming.
One possible interpretation, if you didn’t have the Zerodium bulletin board, would be that these theoretical attacks that hackers were droning on about on arxiv are typically addressing the mitigations one by one, and it’s not clear that a program would be vulnerable in practice if they were all used together. Another interpretation would be that the fact that these bypasses exist at all implies they’re duck tape patches, and the full solution lies somewhere else (like not using C). If you believe that the patches mesh together to create a beautiful complete fix, that should be something you substantiate by explaining how they complement each other, not by noting that failure “seems more difficult” and asking for others to come up with a story for how they break down.
Also, I haven’t asked anyone to “prove” anything here. I regard this as an important point. John’s not trying to “prove” that these strategies are individually nonfunctional, and I’m not asking him to “prove” that they’re functional in combination. This is an exploratory sequence, and what I’m requesting is an exploratory perspective (one which you have provided, and thank you for that).
I’d be on board with at least a very long delay on the AI safety equivalent of “not writing in C,” which would be “not building AGI.”
Unfortunately, that seems to not be a serious option on the table. Even if it were, we could still hope for duct tape patches/Swiss cheese security layers to mitigate, slow, or reduce the chance of an AI security failure. It seems to me that the possibility of a reasonably robust AI safety combination solution is something we’d want to encourage. If not, why not?
The equivalent of not using C for AGI development is not using machine learning techniques. You are right that that seems to be what DM/et. al. are gearing us up to do, and I agree that developing such compiler guardrails might be better than nothing and that we should encourage people to come up with more of them when they can be stacked neatly. I’m not that pessimistic. These compiler level security features do help prevent bugs. They’re just not generally sufficient when stacked against overwhelming optimization pressure and large attack surfaces.
My probably wrong layman’s read of the AGI safety field is that people will still need to either come up with a “new abstraction”, or start cataloging the situations in which they will actually be faced with overwhelming optimization pressure, and avoid those situations desperately, instead of trying to do the DEP+ASLR+Stack Canaries thing. AGI safety is not, actually, a security problem. You get to build your dragon and your task is to “box” the dragon you choose. Remove the parts where you let the dragon think about how to fuck up its training process and you remove the places where it can design these exploits.
Right now, you’re writing a “why not just” series, and under that headline, it makes sense to treat these proposals individually.
I’d also appreciate it if you spent some time addressing the idea that successful AI safety will be a synthesis of these strategies, and perhaps others yet to be found. Right now, I can’t update my perceptions of the value of these individual proposals very much, because my baseline expectation is that AI safety will rely on a combination of them.
I also expect that figuring out what’s achievable in combination will require significant technical refinements of each individual proposal. For that reason, it doesn’t surprise me that the manner of synthesis is unclear. Generally, we let a thousand flowers bloom in the world of science. While I don’t think this is appropriate for very hazardous technologies, it does seem appropriate for AI safety. Of course, I am only an interested observer. Just chiming in.
I think if your takeaway from this sequence is to ask people like OP to analyze complicated amalgamations of alignment solutions you’re kind of missing the point.
There’s a computer security story I like to tell about memory corruption exploits. People have been inventing unique and independent compiler and OS-level guardrails against C program mistakes for decades; DEP, ASLR, Stack canaries. And they all raise the cost of developing an exploit, certainly. But they all have these obvious individual bypasses—canaries and ASLR can be defeated by discovering a nonfatal memory leak, DEP can defeated by tricks like return oriented programming.
One possible interpretation, if you didn’t have the Zerodium bulletin board, would be that these theoretical attacks that hackers were droning on about on arxiv are typically addressing the mitigations one by one, and it’s not clear that a program would be vulnerable in practice if they were all used together. Another interpretation would be that the fact that these bypasses exist at all implies they’re duck tape patches, and the full solution lies somewhere else (like not using C). If you believe that the patches mesh together to create a beautiful complete fix, that should be something you substantiate by explaining how they complement each other, not by noting that failure “seems more difficult” and asking for others to come up with a story for how they break down.
Also, I haven’t asked anyone to “prove” anything here. I regard this as an important point. John’s not trying to “prove” that these strategies are individually nonfunctional, and I’m not asking him to “prove” that they’re functional in combination. This is an exploratory sequence, and what I’m requesting is an exploratory perspective (one which you have provided, and thank you for that).
Sure, modified my comment.
I’d be on board with at least a very long delay on the AI safety equivalent of “not writing in C,” which would be “not building AGI.”
Unfortunately, that seems to not be a serious option on the table. Even if it were, we could still hope for duct tape patches/Swiss cheese security layers to mitigate, slow, or reduce the chance of an AI security failure. It seems to me that the possibility of a reasonably robust AI safety combination solution is something we’d want to encourage. If not, why not?
The equivalent of not using C for AGI development is not using machine learning techniques. You are right that that seems to be what DM/et. al. are gearing us up to do, and I agree that developing such compiler guardrails might be better than nothing and that we should encourage people to come up with more of them when they can be stacked neatly. I’m not that pessimistic. These compiler level security features do help prevent bugs. They’re just not generally sufficient when stacked against overwhelming optimization pressure and large attack surfaces.
My probably wrong layman’s read of the AGI safety field is that people will still need to either come up with a “new abstraction”, or start cataloging the situations in which they will actually be faced with overwhelming optimization pressure, and avoid those situations desperately, instead of trying to do the DEP+ASLR+Stack Canaries thing. AGI safety is not, actually, a security problem. You get to build your dragon and your task is to “box” the dragon you choose. Remove the parts where you let the dragon think about how to fuck up its training process and you remove the places where it can design these exploits.