I think you make a good point. I kind of cheated in order to resolve the story quickly. I think you still have this problem that a sufficiently powerful black box can potentially tell the difference between training and reality, and it also has to have a perfectly innocuous function it’s optimizing for, or you can have negative consequences. For instance, a GPT-n that optimizes for “continuable outputs” sounds pretty good, but could lead to this kind of problem.
I think you make a good point. I kind of cheated in order to resolve the story quickly. I think you still have this problem that a sufficiently powerful black box can potentially tell the difference between training and reality, and it also has to have a perfectly innocuous function it’s optimizing for, or you can have negative consequences. For instance, a GPT-n that optimizes for “continuable outputs” sounds pretty good, but could lead to this kind of problem.