The strategy assumes we’ll develop a good set of safety properties that we’re demanding proof of.
I think this is very important. From skimming the paper it seems that unfortunately the authors do not discuss it much. I imagine that actually formally specifying safety properties is actually a rather difficult step.
To go with the example of not helping terrorists spread harmful virus: How would you even go about formulating this mathematically? This seems highly non-trivial to me. Do you need to mathematically formulate what exactly are harmful viruses?
The same holds for Asimov’s three laws of robotics, turning these into actual math or code seems to be quite challenging.
There’s likely some room for automated systems to figure out what safety humans want, and turn it into rigorous specifications.
Probably obvious to many, but I’d like to point out that these automated systems themselves need to be sufficiently aligned to humans, while also accomplishing tasks that are difficult for humans to do and probably involve a lot of moral considerations.
Thank you for writing this review.
I think this is very important. From skimming the paper it seems that unfortunately the authors do not discuss it much. I imagine that actually formally specifying safety properties is actually a rather difficult step.
To go with the example of not helping terrorists spread harmful virus: How would you even go about formulating this mathematically? This seems highly non-trivial to me. Do you need to mathematically formulate what exactly are harmful viruses?
The same holds for Asimov’s three laws of robotics, turning these into actual math or code seems to be quite challenging.
Probably obvious to many, but I’d like to point out that these automated systems themselves need to be sufficiently aligned to humans, while also accomplishing tasks that are difficult for humans to do and probably involve a lot of moral considerations.