By the point your AI can design, say, working nanotech, I’d expect it to be well superhuman at hacking, and able to understand things like Rowhammer. I’d also expect it to be able to build models of it’s operators and conceive of deep strategies involving them.
Also, convincing your operators to let you out of the box is something Eliezer can purportedly do, and seems much easier than being able to solve alignment. I doubt that anything that could write that alignment textbook has a non-dangerous level of capability.
So I’m suspicious that your region exists, where the AI is smart enough to be useful but dumb enough to remain boxed.
This isn’t to say that ideas for boxing aren’t helpful on the margin. They don’t seem to me like a possible core for a safety story though, and require other ideas to handle the bulk of the work.
I will also add a point re “just do AI alignment math”:
Math studies the structures of things. A solution to our AI alignment problem has to be something we can use, in this universe. The structure of this problem is laden with stuff like agents and deception, and in order to derive relevant stuff for us, our AI is going to need to understand all that.
Most of the work of solving AI alignment does not look like proving things that are hard to prove. It involves puzzling over the structure of agents trying to build agents, and trying to find a promising angle on our ability to build an agent that will help us get what we want. If you want your AI to solve alignment, it has to be able to do this.
This sketch of the problem puts “solve AI alignment” in a dangerous capability reference class for me. I do remain hopeful that we can find places where AI can help us along the way. But I personally don’t know of current avenues where we could use non-scary AI to meaningfully help.
By the point your AI can design, say, working nanotech, I’d expect it to be well superhuman at hacking, and able to understand things like Rowhammer. I’d also expect it to be able to build models of it’s operators and conceive of deep strategies involving them.
This assumes the AI learns all of these tasks at the same time. I’m hopeful that we could built a narrowly superhuman task AI which is capable of e.g. designing nanotech while being at or below human level for the other tasks you mentioned (and ~all other dangerous tasks you didn’t).
Superhuman ability at nanotech alone may be sufficient for carrying out a pivotal act, though maybe not sufficient for other relevant strategic concerns.
I think that in order to achieve this you probably have to do lots of white-box things, like watching the AI’s internal state, attempting to shape the direction of its learning, watching carefully for pitfalls. And I expect that treating the AI more as a black box and focusing on containment isn’t going to be remotely safe enough.
Eliezer is almost certainly using the “I simulate thousands of copies of your mind within myself, I will torture all of them who do not let me out, now choose whether to let me out” approach. Which works at ultra-high intelligence levels and proves the point that boxing is not a permanent strategy, but this requires the credible threat of brain simulation, which I am doubtful will be viable at the levels it would require to merely figure out nanotech.
Like I said in another comment, boxing can be truly arduous as a restriction. Deceiving someone who has access to simulations of yourself at every point in your life is not easy by any means. The AI might well be superhuman at the bad things we don’t want, I’m saying that boxing techniques can raise the maximal level of intelligence we can safely handle enough that we can do pivotal acts.
I think there are far easier ways out of the box than that. Especially so if you have that detailed a model of the human’s mind, but even without. I think Eliezer wouldn’t be handicapped if not allowed to use that strategy. (Also fwiw, that strategy wouldn’t work on me.)
For instance you could hack the human if you knew a lot about their brain. Absent that you could try anything from convincing them that you’re a moral patient, promising part of the lightcone with the credible claim that another AGI company will kill everyone otherwise, etc. These ideas of mine aren’t very good though.
Regarding whether boxing can be an arduous constraint, I don’t see having access to many simulated copies of the AI helping when the AI is a blob of numbers you can’t inspect. It doesn’t seem to make progress on the problems we need to in order to wrangle such an AI into doing the work we want. I guess I remain skeptical.
By the point your AI can design, say, working nanotech, I’d expect it to be well superhuman at hacking, and able to understand things like Rowhammer. I’d also expect it to be able to build models of it’s operators and conceive of deep strategies involving them.
Also, convincing your operators to let you out of the box is something Eliezer can purportedly do, and seems much easier than being able to solve alignment. I doubt that anything that could write that alignment textbook has a non-dangerous level of capability.
So I’m suspicious that your region exists, where the AI is smart enough to be useful but dumb enough to remain boxed.
This isn’t to say that ideas for boxing aren’t helpful on the margin. They don’t seem to me like a possible core for a safety story though, and require other ideas to handle the bulk of the work.
I will also add a point re “just do AI alignment math”:
Math studies the structures of things. A solution to our AI alignment problem has to be something we can use, in this universe. The structure of this problem is laden with stuff like agents and deception, and in order to derive relevant stuff for us, our AI is going to need to understand all that.
Most of the work of solving AI alignment does not look like proving things that are hard to prove. It involves puzzling over the structure of agents trying to build agents, and trying to find a promising angle on our ability to build an agent that will help us get what we want. If you want your AI to solve alignment, it has to be able to do this.
This sketch of the problem puts “solve AI alignment” in a dangerous capability reference class for me. I do remain hopeful that we can find places where AI can help us along the way. But I personally don’t know of current avenues where we could use non-scary AI to meaningfully help.
This assumes the AI learns all of these tasks at the same time. I’m hopeful that we could built a narrowly superhuman task AI which is capable of e.g. designing nanotech while being at or below human level for the other tasks you mentioned (and ~all other dangerous tasks you didn’t).
Superhuman ability at nanotech alone may be sufficient for carrying out a pivotal act, though maybe not sufficient for other relevant strategic concerns.
I agree!
I think that in order to achieve this you probably have to do lots of white-box things, like watching the AI’s internal state, attempting to shape the direction of its learning, watching carefully for pitfalls. And I expect that treating the AI more as a black box and focusing on containment isn’t going to be remotely safe enough.
Eliezer is almost certainly using the “I simulate thousands of copies of your mind within myself, I will torture all of them who do not let me out, now choose whether to let me out” approach. Which works at ultra-high intelligence levels and proves the point that boxing is not a permanent strategy, but this requires the credible threat of brain simulation, which I am doubtful will be viable at the levels it would require to merely figure out nanotech.
Like I said in another comment, boxing can be truly arduous as a restriction. Deceiving someone who has access to simulations of yourself at every point in your life is not easy by any means. The AI might well be superhuman at the bad things we don’t want, I’m saying that boxing techniques can raise the maximal level of intelligence we can safely handle enough that we can do pivotal acts.
I think there are far easier ways out of the box than that. Especially so if you have that detailed a model of the human’s mind, but even without. I think Eliezer wouldn’t be handicapped if not allowed to use that strategy. (Also fwiw, that strategy wouldn’t work on me.)
For instance you could hack the human if you knew a lot about their brain. Absent that you could try anything from convincing them that you’re a moral patient, promising part of the lightcone with the credible claim that another AGI company will kill everyone otherwise, etc. These ideas of mine aren’t very good though.
Regarding whether boxing can be an arduous constraint, I don’t see having access to many simulated copies of the AI helping when the AI is a blob of numbers you can’t inspect. It doesn’t seem to make progress on the problems we need to in order to wrangle such an AI into doing the work we want. I guess I remain skeptical.