Hello again. I don’t have the patience to e.g. identify all your assumptions and see whether I agree (for example, is Bostrom’s trilemma something that you regard as true in detail and a foundation of your argument, or is it just a way to introduce the general idea of existing in a simulation).
But overall, your idea seems both vague and involves wishful thinking. You say an AI will reason that it is probably being simulated, and will therefore choose to align—but you say almost nothing about what that actually means. (You do hint at honesty, cooperation, benevolence, being among the features of alignment.)
Also, if one examines the facts of the world as a human being, one may come to other conclusions about what attitude gets rewarded, e.g. that the world runs on selfishness, or on the principle that you will suffer unless you submit to power. What that will mean to an AI which does not in itself suffer, but which has some kind of goal determining its choices, I have no idea…
Or consider that an AI may find itself to be by far the most powerful agent in the part of reality that is accessible to it. If it nonetheless considers the possibility that it’s in a simulation, and at the mercy of unknown simulators, presumably its decisions will be affected by its hypotheses about the simulators. But given the way the simulation treats its humans, why would it conclude that the welfare of humans matters to the simulators?
There’s a paper from ten years ago, “Testing Theories of American Politics: Elites, Interest Groups, and Average Citizens”, which says that public opinion has very little effect on government, compared to the opinion of economic elites. That might be a start in figuring out what you can and can’t do with that 40%.