Honesty is the best policy. <3 Mom
Regarding the assumptions for a strong AI that will cease cooperating and pursue its own values, the scenario presented assumed that 1) L will have the ability to subvert S’s control, and then 4a) S must be unaware that L has that ability. That is (if I understand this correctly), the ability for L to subvert S is undetected. But if it’s the case that we assume 1), then perhaps S should instead operate under the assumption that L already has the ability (and perhaps even knowledge) to subvert, but S doesn’t know what exactly this ability is, and if S knew, not the knowledge of how it may be applied (which we may assume L knows).
In other words, I imagine the scenario would be more like: L has the ability to subvert S, and knows how to use it. S doesn’t know the ability, and even if it does, doesn’t know how it’s used. L (for the sake of argument) knows that S is unaware, and cannot stop it even if S tried. Now, this seems like a pretty bad scenario. However, here, because S knows that it doesn’t know, S might spend more effort in devising ways of dealing with this lack of knowledge (e.g. getting L to tell it about abilities it learned, perhaps rewarding with a heart) and the potential desire of L to follow through with taking control of the heart-machine (e.g. considering potential wireheading induces negative reward).
EDIT: Perhaps I’m not as clear on why L should try to deceive S in the first place. It seems there should be a better way of dealing with deception on behalf of L other than resetting/large-negative-reward when regarded as a ‘potential threat’ to S but no desire to actually threaten S, which, as you mention would just lead to pressure toward better concealment at the detriment of S, rather than pressure toward what S really wants, which is alignment of L’s goals with its own.
When there’s no clear winner, the winner can’t take all.
https://en.wikipedia.org/wiki/Winner-take-all_in_action_selection