So I might classify moving out-of-distribution as something that happens to a classifier or agent, and model splintering as something that the machine learning system does to itself.
Why do you think it’s important to distinguish these two situations? It seems that the insights for dealing with one situation may apply to the other, and vice versa.
The distinction is important if you want to design countermeasures that lower the probability that you land in the bad situation in the first place. For the first case, you might look at improving the agent’s environment, or in making the agent detect when its environment moves off the training distribution. For the second case, you might look at adding features to the machine learning system itself. so that dangerous types of splintering become less likely.
I agree that once you have landed in the bad situation, mitigation options might be much the same, e.g. switch off the agent.
I agree that once you have landed in the bad situation, mitigation options might be much the same, e.g. switch off the agent.
I’m most interested in mitigation options the agent can take itself, when it suspects it’s out-of-distribution (and without being turned off, ideally).
OK. Reading the post originally, my impression was that you were trying to model ontological crisis problems that might happen by themselves inside the ML system when it learns of self-improves.
This is a subcase that can be expressed in by your model, but after the Q&A in your SSC talk yesterday, my feeling is that your main point of interest and reason for optimisim with this work is different. It is in the problem of the agent handling ontological shifts that happen in human models of what their goals and values are.
I might phrase this question as: If the humans start to splinter their idea of what a certain kind morality-related word they have been using for ages really means, how is the agent supposed to find out about this, and what should it do next to remain aligned?
The ML literature is full of uncertainty metrics that might be used to measure such splits (this paper comes to mind as a memorable lava-based example). It is also full of proposals for mitigation like ‘ask the supervisor’ or ‘slow down’ or ‘avoid going into that part of the state space’.
The general feeling I have, which I think is also the feeling in the ML community, is that such uncertainty metrics are great for suppressing all kinds of failure scenarios. But if you are expecting a 100% guarantee that the uncertainty metrics will detect every possible bad situation (that the agent will see every unknown unknown coming before it can hurt you), you will be disappointed. So I’d like to ask you: what is your sense of optimism or pessimism in this area?
But if you are expecting a 100% guarantee that the uncertainty metrics will detect every possible bad situation
I’m more thinking of how we could automate the navigating of these situations. The detection will be part of this process, and it’s not a Boolean yes/no, but a matter of degree.
Thanks! Lots of useful insights in there.
Why do you think it’s important to distinguish these two situations? It seems that the insights for dealing with one situation may apply to the other, and vice versa.
The distinction is important if you want to design countermeasures that lower the probability that you land in the bad situation in the first place. For the first case, you might look at improving the agent’s environment, or in making the agent detect when its environment moves off the training distribution. For the second case, you might look at adding features to the machine learning system itself. so that dangerous types of splintering become less likely.
I agree that once you have landed in the bad situation, mitigation options might be much the same, e.g. switch off the agent.
I’m most interested in mitigation options the agent can take itself, when it suspects it’s out-of-distribution (and without being turned off, ideally).
OK. Reading the post originally, my impression was that you were trying to model ontological crisis problems that might happen by themselves inside the ML system when it learns of self-improves.
This is a subcase that can be expressed in by your model, but after the Q&A in your SSC talk yesterday, my feeling is that your main point of interest and reason for optimisim with this work is different. It is in the problem of the agent handling ontological shifts that happen in human models of what their goals and values are.
I might phrase this question as: If the humans start to splinter their idea of what a certain kind morality-related word they have been using for ages really means, how is the agent supposed to find out about this, and what should it do next to remain aligned?
The ML literature is full of uncertainty metrics that might be used to measure such splits (this paper comes to mind as a memorable lava-based example). It is also full of proposals for mitigation like ‘ask the supervisor’ or ‘slow down’ or ‘avoid going into that part of the state space’.
The general feeling I have, which I think is also the feeling in the ML community, is that such uncertainty metrics are great for suppressing all kinds of failure scenarios. But if you are expecting a 100% guarantee that the uncertainty metrics will detect every possible bad situation (that the agent will see every unknown unknown coming before it can hurt you), you will be disappointed. So I’d like to ask you: what is your sense of optimism or pessimism in this area?
I’m more thinking of how we could automate the navigating of these situations. The detection will be part of this process, and it’s not a Boolean yes/no, but a matter of degree.