In some sense, the Agent Foundations program at MIRI sees the problem as: human values are currently an informal object. We can only get meaningful guarantees for formal systems. So, we need to work on formalizing concepts like human values. Only then will we be able to get formal safety guarantees.
unless i’m misunderstanding you or MIRI, that’s not their primary concern at all:
Another way of putting this view is that nearly all of the effort should be going into solving the technical problem, “How would you get an AI system to do some very modest concrete action requiring extremely high levels of intelligence, such as building two strawberries that are completely identical at the cellular level, without causing anything weird or disruptive to happen?”
Where obviously it’s important that the system not do anything severely unethical in the process of building its strawberries; but if your strawberry-building system requires its developers to have a full understanding of meta-ethics or value aggregation in order to be safe and effective, then you’ve made some kind of catastrophic design mistake and should start over with a different approach.
Good citation. Yeah, I should have flagged harder that my description there was a caricature and not what anyone said at any point. I still need to think more about how to revise the post to be less misleading in this respect.
One thing I can say is that the reason that quote flags that particular failure mode is because, according to the MIRI way of thinking about the problem, that is an easy failure mode to fall into.
unless i’m misunderstanding you or MIRI, that’s not their primary concern at all:
Good citation. Yeah, I should have flagged harder that my description there was a caricature and not what anyone said at any point. I still need to think more about how to revise the post to be less misleading in this respect.
One thing I can say is that the reason that quote flags that particular failure mode is because, according to the MIRI way of thinking about the problem, that is an easy failure mode to fall into.