This would mean that a hypothetical AI “uniformly” gaining capability on all axes would beat us at math long before it beats us at deception.
I’m pretty skeptical of this as an assumption.
If you want an AI to output a useful design for an aligned AI, that design has to be secure, because an aligned-but-insecure AI is not stably aligned, it could be hacked. Ergo, your oracle AI must be using a security mindset at superhuman levels of intelligence. Otherwise the textbook you’ll get out will be beautiful, logical, coherent, and insecure. I don’t see how you could make an AI which has that level of security mindset and isn’t superhumanly capable of deception.
So, first, given an aligned-but-insecure AI, you can easily make an aligned-and-secure one by just asking it to produce a new textbook, you just have to do it fast enough that the AI doesn’t have time to get hacked in the wild. The “aligned” part is the really super hard one, the “secure” part is merely hard.
And second, I think that this might be like saying “Bayesian updating is all you ever really need, so if you learn to do it in Domain #1, you automatically have the ability to do it in unrelated Domain #2”. While I think this is true at high levels of intelligence, It’s not true at human level, and I don’t know at what point beyond that it becomes true. At the risk of sounding coarse, the existence of autistic security researchers shows what I mean, being good at the math and mindset of security does not imply having the social knowledge to deceive humans.
And superhuman deception levels is not fatal in any case, in our case the AI is operating under restrictions that no human was ever put under. Boxing and state-resetting are pretty insane when you put them in a human context, trying to deceive someone who literally has access to simulations of your brain is really hard. I don’t think the lower end of the superhuman deception abilities spectrum would be enough for that.
I’m pretty skeptical of this as an assumption.
If you want an AI to output a useful design for an aligned AI, that design has to be secure, because an aligned-but-insecure AI is not stably aligned, it could be hacked. Ergo, your oracle AI must be using a security mindset at superhuman levels of intelligence. Otherwise the textbook you’ll get out will be beautiful, logical, coherent, and insecure. I don’t see how you could make an AI which has that level of security mindset and isn’t superhumanly capable of deception.
So, first, given an aligned-but-insecure AI, you can easily make an aligned-and-secure one by just asking it to produce a new textbook, you just have to do it fast enough that the AI doesn’t have time to get hacked in the wild. The “aligned” part is the really super hard one, the “secure” part is merely hard.
And second, I think that this might be like saying “Bayesian updating is all you ever really need, so if you learn to do it in Domain #1, you automatically have the ability to do it in unrelated Domain #2”. While I think this is true at high levels of intelligence, It’s not true at human level, and I don’t know at what point beyond that it becomes true. At the risk of sounding coarse, the existence of autistic security researchers shows what I mean, being good at the math and mindset of security does not imply having the social knowledge to deceive humans.
And superhuman deception levels is not fatal in any case, in our case the AI is operating under restrictions that no human was ever put under. Boxing and state-resetting are pretty insane when you put them in a human context, trying to deceive someone who literally has access to simulations of your brain is really hard. I don’t think the lower end of the superhuman deception abilities spectrum would be enough for that.