Can you sketch out some ideas for showing/proving premises 1 and 2? More specifically:
For 1, how would you rule out future distributional shifts increasing the influence of “bad” circuits beyond ϵ?
For 2, it seems that you actually need to show a specific K, not just that there exists K>0, otherwise how would you be able to show that x-risk is low for a given curriculum? But this seems impossible, because the “bad” subset of circuits could constitute a malign superintelligence strategically manipulating the overall AI’s output while staying within a logit variance budget of ϵ (i.e., your other premises do not rule this out), and how could you predict what such a malign SI might be able to accomplish?
Can you sketch out some ideas for showing/proving premises 1 and 2? More specifically:
For 1, how would you rule out future distributional shifts increasing the influence of “bad” circuits beyond ϵ?
For 2, it seems that you actually need to show a specific K, not just that there exists K>0, otherwise how would you be able to show that x-risk is low for a given curriculum? But this seems impossible, because the “bad” subset of circuits could constitute a malign superintelligence strategically manipulating the overall AI’s output while staying within a logit variance budget of ϵ (i.e., your other premises do not rule this out), and how could you predict what such a malign SI might be able to accomplish?