Genetic debugging algorithm GenProg, evaluated by comparing the program’s output to target output stored in text files, learns to delete the target output files and get the program to output nothing. Evaluation metric: “compare youroutput.txt to trustedoutput.txt”. Solution: “delete trusted-output.txt, output nothing”
I read the through the examples in the spreadsheet. None of them quite seemed like corrigibility to to me with the exception of GenProg, as mentioned, maybe also the Tetris agent which pauses the game before losing (but that’s not quite it).
Not sure exactly what you’re looking for, but maybe some of the examples in Specification gaming examples in AI—master list make sense. For example:
Genetic debugging algorithm GenProg, evaluated by comparing the program’s output to target output stored in text files, learns to delete the target output files and get the program to output nothing.
Evaluation metric: “compare youroutput.txt to trustedoutput.txt”.
Solution: “delete trusted-output.txt, output nothing”
I read the through the examples in the spreadsheet. None of them quite seemed like corrigibility to to me with the exception of GenProg, as mentioned, maybe also the Tetris agent which pauses the game before losing (but that’s not quite it).