Maybe_a comments on Train for incorrigibility, then reverse it (Shutdown Problem Contest Submission)

Maybe_a 19 Jul 2023 16:11 UTC
1 point
0
Things that I seem to notice about the plan:
1. Adjusting weights a plan for basic AIs, which can’t seek to e.g. be internally consistent, eventually landing wherever the attractors take it.
2. Say, you manage to give your AI enough quirks for it to go cry in a corner. Now you need to lower your AI nerfing to get more intelligence, leading to brinkmanship dynamics.
3. In the middle, you have a bunch of AI, trained for maximum of various aspects of incorrigibility, hoping they are incapable of cooperating; or for that any single AI will not act destructively (while trained for incorrigibility).