I don’t have much of an answer for you but wanted to explicitly thank you for posting this thread, I am in a similar situation and wouldn’t have thought to ask here but should have.
Magus
[Hi! Been lurking for a long time, this seems like as good a reason as any to actually put something out there. Epistemic status: low confidence but it seems low risk high reward to try. not intended to be a full list, I do not have the expertise for that, I am just posting any ideas at all that I have and don’t already see here. this probably already exists and I just don’t know the name.]
1) input masking, basically for oracle/task-AI you ask the AI for a program that solves a slightly more general version of your problem and don’t give the AI the information necessary to narrow it down, then run the program on your actual case (+ probably some simple test cases you know the answer to to make sure it solves the problem).
this lets you penalize the AI for complexity of the output program and therefore it will give you something narrow instead of a general reasoner.
(obviously you still have to be sensible about the output program, don’t go post the code to github or give it internet access.)2) reward function stability. we know we might have made mistakes inputting the reward function, but we have some example test cases we’re confident in. tell the AI to look for a bunch of different possible functions that give the same output as the existing reward function, and filter potential actions by whether any of those see them as harmful.
I think that visiting a drug rehab center would be much less convincing (though much faster) than the above suggested method. This is because a drug rehab center will look bad whether or not the effects are very rare, since it’s selected for people who got bad enough effects to be in a rehab center.
(If his argument is that the bad effects don’t exist, a rehab center would be good evidence against that, but it sounds like he believes more that they’re rare and mild enough to be worth it.)
In general, if you want to convince someone who is taking ideas from this community seriously of something, you want to show them evidence that would only exist if the thing you want to convince them of is true, and possibly even explicitly lay out why you expect that.