Magus comments on Let’s See You Write That Corrigibility Tag

Magus 20 Jun 2022 13:49 UTC
11 points
0
[Hi! Been lurking for a long time, this seems like as good a reason as any to actually put something out there. Epistemic status: low confidence but it seems low risk high reward to try. not intended to be a full list, I do not have the expertise for that, I am just posting any ideas at all that I have and don’t already see here. this probably already exists and I just don’t know the name.]
1) input masking, basically for oracle/task-AI you ask the AI for a program that solves a slightly more general version of your problem and don’t give the AI the information necessary to narrow it down, then run the program on your actual case (+ probably some simple test cases you know the answer to to make sure it solves the problem).
this lets you penalize the AI for complexity of the output program and therefore it will give you something narrow instead of a general reasoner.
(obviously you still have to be sensible about the output program, don’t go post the code to github or give it internet access.)
2) reward function stability. we know we might have made mistakes inputting the reward function, but we have some example test cases we’re confident in. tell the AI to look for a bunch of different possible functions that give the same output as the existing reward function, and filter potential actions by whether any of those see them as harmful.
What links here?
- 4. Existing Writing on Corrigibility by Max Harms (10 Jun 2024 14:08 UTC; 47 points)