OK. Reading the post originally, my impression was that you were trying to model ontological crisis problems that might happen by themselves inside the ML system when it learns of self-improves.
This is a subcase that can be expressed in by your model, but after the Q&A in your SSC talk yesterday, my feeling is that your main point of interest and reason for optimisim with this work is different. It is in the problem of the agent handling ontological shifts that happen in human models of what their goals and values are.
I might phrase this question as: If the humans start to splinter their idea of what a certain kind morality-related word they have been using for ages really means, how is the agent supposed to find out about this, and what should it do next to remain aligned?
The ML literature is full of uncertainty metrics that might be used to measure such splits (this paper comes to mind as a memorable lava-based example). It is also full of proposals for mitigation like ‘ask the supervisor’ or ‘slow down’ or ‘avoid going into that part of the state space’.
The general feeling I have, which I think is also the feeling in the ML community, is that such uncertainty metrics are great for suppressing all kinds of failure scenarios. But if you are expecting a 100% guarantee that the uncertainty metrics will detect every possible bad situation (that the agent will see every unknown unknown coming before it can hurt you), you will be disappointed. So I’d like to ask you: what is your sense of optimism or pessimism in this area?
But if you are expecting a 100% guarantee that the uncertainty metrics will detect every possible bad situation
I’m more thinking of how we could automate the navigating of these situations. The detection will be part of this process, and it’s not a Boolean yes/no, but a matter of degree.
OK. Reading the post originally, my impression was that you were trying to model ontological crisis problems that might happen by themselves inside the ML system when it learns of self-improves.
This is a subcase that can be expressed in by your model, but after the Q&A in your SSC talk yesterday, my feeling is that your main point of interest and reason for optimisim with this work is different. It is in the problem of the agent handling ontological shifts that happen in human models of what their goals and values are.
I might phrase this question as: If the humans start to splinter their idea of what a certain kind morality-related word they have been using for ages really means, how is the agent supposed to find out about this, and what should it do next to remain aligned?
The ML literature is full of uncertainty metrics that might be used to measure such splits (this paper comes to mind as a memorable lava-based example). It is also full of proposals for mitigation like ‘ask the supervisor’ or ‘slow down’ or ‘avoid going into that part of the state space’.
The general feeling I have, which I think is also the feeling in the ML community, is that such uncertainty metrics are great for suppressing all kinds of failure scenarios. But if you are expecting a 100% guarantee that the uncertainty metrics will detect every possible bad situation (that the agent will see every unknown unknown coming before it can hurt you), you will be disappointed. So I’d like to ask you: what is your sense of optimism or pessimism in this area?
I’m more thinking of how we could automate the navigating of these situations. The detection will be part of this process, and it’s not a Boolean yes/no, but a matter of degree.