I don’t know the answer to this, but strong upvoted because I think this question, and variants like “is anyone working on ensuring AI labs don’t sign-flip parts of the reward function” and equally silly things, are important.
Alas, formal methods can’t really help with that part. If you have the correct spec, formal methods can help you know with as much certainty as we know how to get that your program implements the spec without failing in undefined ways on weird edge cases. But even experienced, motivated formal methods practitioners sometimes get the spec wrong. I suspect “getting the sign of the reward function” right is part of the spec, where theorem provers don’t provide much leverage beyond what a marker and whiteboard (or program and unit tests) give you.
I was thinking of units tests generated from some spec for helping with that part. If someone could build such a spec/tool and share it, said spec/tool could be extensively analysed and iterated upon.
I don’t know the answer to this, but strong upvoted because I think this question, and variants like “is anyone working on ensuring AI labs don’t sign-flip parts of the reward function” and equally silly things, are important.
Alas, formal methods can’t really help with that part. If you have the correct spec, formal methods can help you know with as much certainty as we know how to get that your program implements the spec without failing in undefined ways on weird edge cases. But even experienced, motivated formal methods practitioners sometimes get the spec wrong. I suspect “getting the sign of the reward function” right is part of the spec, where theorem provers don’t provide much leverage beyond what a marker and whiteboard (or program and unit tests) give you.
I was thinking of units tests generated from some spec for helping with that part. If someone could build such a spec/tool and share it, said spec/tool could be extensively analysed and iterated upon.