Yeah, I basically agree with everything you’re saying. This is very much a “lol we’re fucked what now” solution, not an “alignment” solution per se. The only reason we might vaguely hope that we don’t need 1- 0.1^10 accuracy, but rather 1 − 0.1^5 accuracy, is that not losing control in the face of a more powerful actor is a pretty basic preference that doesn’t take genius LLM moves to extract. Whether this just breaks immediately because the ASI finds a loophole is kind of dependent on “how hard is it to break, vs. to just do the thing they probably actually want me to do”.
This is functionally impossible in regimes like developing nanotechnology. Is it impossible for dumb shit, like “write me a groundbreaking alignment paper and also obey my preferences as defined from fine-tuning this LLM”? I don’t know. I don’t love the odds, but I don’t have a great argument that they’re less than 1%?
Yeah, I basically agree with everything you’re saying. This is very much a “lol we’re fucked what now” solution, not an “alignment” solution per se. The only reason we might vaguely hope that we don’t need 1- 0.1^10 accuracy, but rather 1 − 0.1^5 accuracy, is that not losing control in the face of a more powerful actor is a pretty basic preference that doesn’t take genius LLM moves to extract. Whether this just breaks immediately because the ASI finds a loophole is kind of dependent on “how hard is it to break, vs. to just do the thing they probably actually want me to do”.
This is functionally impossible in regimes like developing nanotechnology. Is it impossible for dumb shit, like “write me a groundbreaking alignment paper and also obey my preferences as defined from fine-tuning this LLM”? I don’t know. I don’t love the odds, but I don’t have a great argument that they’re less than 1%?