Rob Bensinger comments on A central AI alignment problem: capabilities generalization, and the sharp left turn

Rob Bensinger 30 Jun 2022 19:25 UTC
7 points
2
The whole strawberry thing is really confusing to me; it just doesn’t map to any natural problem that humans actually care about.
It maps to the pivotal acts I think are most promising; the order of difficulty seems similar to me, and the kind of problem seems similar to me too.
And when EY says “and nothing else,” it’s not clear what the actual boundaries to impact are.
I think EY mainly means ‘without the AI steering or directly impacting the macro-state of the world’. The task obviously isn’t possible if you literally can’t affect any atoms in the universe outside the strawberries itself. But IMO it would work fine for this challenge if the humans put a bunch of resources into a room (or into a football stadium, if needed), and got the AI to execute the task without having a direct large impact on the macro-state of the world (nor an indirect impact that involves the AI deliberately steering the world toward a new macro-state in some way).
I think the test should be the Wozniak test: a simple robot enters a random home and uses the tools available to make coffee. [...] EY is right, it should be just as difficult to do consistently, given the difficulty of alignment.
This seems much easier to do right, because (a) the robot can get by with being dramatically less smart, and (b) the task itself is extremely easy for humans for understand, oversee, and verify. (Indeed, the task is so simple that in real life you could just have a human hook up a camera to the robot and steer the robot by remote control.)
For the Wozniak test, the capabilities of the system can be dramatically weaker, and the alignment is dramatically easier. This doesn’t obviously capture the things I see as hard as alignment.