quetzal_rainbow comments on Thoughts on “AI is easy to control” by Pope & Belrose

quetzal_rainbow 3 Dec 2023 10:28 UTC
11 points
10
You can just have model with capabilities of smartest human hacker that exfiltrates itself, hacks 1-10% of world computing power with shittiest protection, distributes itself in Rosetta@home style and bootstrap whatever takeover plan using sheer bruteforce. Thus said, I see no reason for capabilities to land exactly on point “smartest human hacker”, because there is nothing special at this point, and it can be 2x, 5x, 10x, without any necessity to become 1000000x within a second.
I’m pretty optimistic about our white box alignment methods generalizing fine.
And I still don’t get why! I would like to see your theory of generalization in DL that allows to have such level of optimism, and “gradient descent is powerful” simply doesn’t catch that.