Razied comments on On A List of Lethalities

Razied 13 Jun 2022 22:31 UTC
12 points
So, first, given an aligned-but-insecure AI, you can easily make an aligned-and-secure one by just asking it to produce a new textbook, you just have to do it fast enough that the AI doesn’t have time to get hacked in the wild. The “aligned” part is the really super hard one, the “secure” part is merely hard.

And second, I think that this might be like saying “Bayesian updating is all you ever really need, so if you learn to do it in Domain #1, you automatically have the ability to do it in unrelated Domain #2”. While I think this is true at high levels of intelligence, It’s not true at human level, and I don’t know at what point beyond that it becomes true. At the risk of sounding coarse, the existence of autistic security researchers shows what I mean, being good at the math and mindset of security does not imply having the social knowledge to deceive humans.

And superhuman deception levels is not fatal in any case, in our case the AI is operating under restrictions that no human was ever put under. Boxing and state-resetting are pretty insane when you put them in a human context, trying to deceive someone who literally has access to simulations of your brain is really hard. I don’t think the lower end of the superhuman deception abilities spectrum would be enough for that.