This comment seems to be assuming some kind of hard takeoff scenario, which I discount as absurdly unlikely. That said, even in that scenario, I’m pretty optimistic about our white box alignment methods generalizing fine.
You can just have model with capabilities of smartest human hacker that exfiltrates itself, hacks 1-10% of world computing power with shittiest protection, distributes itself in Rosetta@home style and bootstrap whatever takeover plan using sheer bruteforce. Thus said, I see no reason for capabilities to land exactly on point “smartest human hacker”, because there is nothing special at this point, and it can be 2x, 5x, 10x, without any necessity to become 1000000x within a second.
I’m pretty optimistic about our white box alignment methods generalizing fine.
And I still don’t get why! I would like to see your theory of generalization in DL that allows to have such level of optimism, and “gradient descent is powerful” simply doesn’t catch that.
This comment seems to be assuming some kind of hard takeoff scenario, which I discount as absurdly unlikely. That said, even in that scenario, I’m pretty optimistic about our white box alignment methods generalizing fine.
You can just have model with capabilities of smartest human hacker that exfiltrates itself, hacks 1-10% of world computing power with shittiest protection, distributes itself in Rosetta@home style and bootstrap whatever takeover plan using sheer bruteforce. Thus said, I see no reason for capabilities to land exactly on point “smartest human hacker”, because there is nothing special at this point, and it can be 2x, 5x, 10x, without any necessity to become 1000000x within a second.
And I still don’t get why! I would like to see your theory of generalization in DL that allows to have such level of optimism, and “gradient descent is powerful” simply doesn’t catch that.