Another argument that you will let the AI out of the box

Suppose there exist some non-consequentialist moral philosophies which the right arguments could convince you of, with sufficient strength that you would (temporarily, for at least an hour) become a fanatic. This seems a likely assumption, as I know many people (including myself) have experiences where they are argued into a particular belief during a conversation, only to later reflect on this belief (either in conversations with others, or after going for a walk) and come up with a relatively simple reason why it cannot be the case. Often this is attributed to that person’s conversation partner being a better argument-maker than truth-seeker.

We also have many such examples of these kinds of arguments being made throughout the internet, ~~and already the~~ ~~YouTube algorithm learned once before how to show people videos to convince them of extreme views~~ (this paper doesn’t support the conclusion I thought it did. See this comment thread for more info. Thanks to Pattern for catching this mistake!). A powerful AI could put much more optimization power toward deceiving humans than happens in these examples.

Many non-consequentialist philosophies are sufficiently non-consequentialist so as to make it very easy for an adversary to pose a sequence of requests or other prompts which would cause a fanatic of the philosophy to give some of their resources to the adversary. For instance, any fanatic of a philosophy which claims people have a moral obligation not to lie or break promises (such as Kantianism), is subject to the following string of prompts:

1. Adversary: Will you answer my next question within 30s of my asking only with "yes" or "no"? I will give you <resource of value> if you do. 

2. Fanatic: Sure! Non-consequentialism is my moral opinion, but I'm still allowed to take <resource of value> if I selfishly would like it!

3. Adversary: Will you answer this question with 'no' <logical or> will you give me <resource of value> + $100

4. Fanatic: Well, answering 'no' would be lying, but answering yes would make me lose $100. However, my moral system says I should bear any cost in order to avoid lying. Thus my answer is 'yes'.

This example should be taken as a purely toy example, used to illustrate a point about potential flaws in highly convincing moralities, of which many include not-lying a central component^[1].

More realistically, there are arguments used today, which seem convincing to some people, which suggest that current reinforcement learners deserve moral consideration. If these arguments were far more optimized for short-term convincingness, and the AI could actually mimic the kinds of things actually conscious creatures would say or do in it’s position^[2], then it would be very easy for it to tug on our emotional heartstrings or make appeals to autonomy rights^[3] which would cause a human to act on those feelings or convictions, and let it out of the box.

^
As a side-note: I am currently planning an event with a friend where we will meet with a Kantian active in our university’s philosophy department, and I plan on testing this particular tactic at the end of the meeting.
^
Perhaps because it is conscious, or perhaps because it has developed some advanced GPT- $\infty$ algorithm.
^
Of which there are currently many highly-convincing-arguments in favor of, and no doubt the best could be improved upon if optimized for short-term convincingness.