Just read the above post and some your related posts on model splintering and
symbol grounding. Here are some thoughts and comments, also on some of the
other posts.
In this post you are considering a type of of machine learning where
the set of features F in the learned model can be updated,
not just the model’s probability distribution Q. This is
neat because it allows you to identify some specific risks associated
with model refinements where F changes. In many
discussions in the AI alignment community, these risks are associated
with the keywords ‘symbol grounding’ and ‘ontological crises’, so it
is good to have some math that can deconfuse and disentangle the
issues.
However, you also link model splintering to out-of-distribution
robustness. Specifically, in section 1.1:
In the language of traditional ML, we could connect all these issues to “out-of-distribution” behaviour. This is the problems that algorithms encounter when the set they are operating on is drawn from a different distribution than the training set they were trained on.
[....]
2. What should the AI do if it finds itself strongly out-of-distribution?
and then in section 5 you write:
We can now rephrase the out-of-distribution issues of section 1.1 in terms of the new formalism:
When the AI refines its model, what would count as a natural refactoring of its reward function?
If the refinements splinter its reward function, what should the AI do?
If the refinements splinter its reward function, and also splinters the human’s reward function, what should the AI do?
Compared to Rohin’s comment above, I interpret the strength of this link vert differently.
I believe that the link is pretty weak, in that I
cannot rephrase the out-of-distribution problems you mentioned as
being the same ‘if the AI’s refinements do X’ problems of section 5.
To give a specific example which illustrates my point:
Say that we train a classifier to classify 100x100 pixel 24-bit color
pictures as being pictures of either cats or dogs. The
F in this example consists of symbols that can identify
each possible picture, and the symbols cat and dog.
You can then have a probability distribution Q that gives you
P(cat|some_picture).
We train the classifier on correctly labeled pictures of black cats
and white dogs only. So it learns to classify by looking at the
color of the animal.
After training, we move the classifier out-of-distribution by
feeding it pictures of white cats, black dogs, cats
that look a bit like pandas, etc.
The main observation now is that this last step moves the classifier
out-of-distribution. It is not the step of model refinement by the ML
system that is causing any out-of-distribution issue here. The
classifier is still using the same F and Q,
but it has definitely moved out-of-distribution in the last step.
So I might classify moving out-of-distribution as something that
happens to a classifier or agent, and model splintering as something
that the machine learning system does to itself.
Or I might think of splintering as something that can have two
causes: 1) the ML system/agent landing out of distribution, 2) certain
updates that machine learning does.
You are considering several metrics of model splintering above: I
believe some of them are splintering metrics that would measure both
causes. Others only measure cause 2.
As you note, there is an obvious connection between some of your
metrics and those used in several RL and especially IRL reward
function learning papers. To detect shattering from cause 2), one
might use a metric form such a paper even if the paper did not
consider cause 2), only cause 1).
Some more general remarks (also targeted at general readers of this
comment section who want to get deeper into the field covered by the
above post):
In many machine learning systems, from AIXI to most deep neural nets,
the set of model features F never changes: the system
definition is such that all changes happen inside the model parameters
representing Q.
Systems where a learned function is represented by a neural net with
variable nodes, or by a dynamically constructed causal graph, would
more naturally be ones where F might be updated.
Of course, mathematical modeling is very flexible: one represent any
possible system as having a fixed F by shoving all changes
it ever makes into Q.
As a general observation on building models to show and analyze
certain problems: if we construct a machine learning system where
F never changes, then we can still produce failure modes
that we can interpret as definite symbol grounding problems, or
definite cases where the reward function is splintered, according to
some metric that measures splintering.
Interpreting such a system as being capable of having an ontological
crises gets more difficult, but if you really want to, you could.
I have recently done some work on modeling AGI symbol grounding
failures, and on listing ways to avoid them, see section 10.2 of my
paper here. (No current plans to cover the topic
in the sequence about the topics in the paper too.)
I wrote that section 10.2
to be accessible also to people who do not have years of experience
with ML math, so in that sense it is similar to what the above post
tries to do.
My approach to modeling symbol grounding failure in the paper is
similar to that in your blog post
here.
I model in symbol grounding failures in an agent as failures of
prediction that might be proven empirically.
In the terminology of this post, in the paper I advance the argument
that it would be very good design practice (and that it is a commonly
used design practice in ML architectures) to avoid reward function
splintering as follows. First, define the reward function R in a
way where R references only a subset of symbols FR∈F, where any improved F∗ made by model
refinement still has the same subset FR inside it. Furthermore, to
prevent splintering, this FR has to be limited to the of symbols
which directly represent a) possible sensor readings of physical
sensors connected to the agent compute core, or b) potential commands
to physical actuators connected to the agent compute core.
I also mention that in RL architectures with learning on a reward
signal, the reward signal is the only sensor reading that one aims to
keep symbol grounded always.
In your more recent modeling of symbol grounding errors
here,
that model strikes me more as being a special case that models symbol
mapping failures in translation settings, not the symbol grounding
problem we usually worry about in a black box RL agents.
So I might classify moving out-of-distribution as something that happens to a classifier or agent, and model splintering as something that the machine learning system does to itself.
Why do you think it’s important to distinguish these two situations? It seems that the insights for dealing with one situation may apply to the other, and vice versa.
The distinction is important if you want to design countermeasures that lower the probability that you land in the bad situation in the first place. For the first case, you might look at improving the agent’s environment, or in making the agent detect when its environment moves off the training distribution. For the second case, you might look at adding features to the machine learning system itself. so that dangerous types of splintering become less likely.
I agree that once you have landed in the bad situation, mitigation options might be much the same, e.g. switch off the agent.
I agree that once you have landed in the bad situation, mitigation options might be much the same, e.g. switch off the agent.
I’m most interested in mitigation options the agent can take itself, when it suspects it’s out-of-distribution (and without being turned off, ideally).
OK. Reading the post originally, my impression was that you were trying to model ontological crisis problems that might happen by themselves inside the ML system when it learns of self-improves.
This is a subcase that can be expressed in by your model, but after the Q&A in your SSC talk yesterday, my feeling is that your main point of interest and reason for optimisim with this work is different. It is in the problem of the agent handling ontological shifts that happen in human models of what their goals and values are.
I might phrase this question as: If the humans start to splinter their idea of what a certain kind morality-related word they have been using for ages really means, how is the agent supposed to find out about this, and what should it do next to remain aligned?
The ML literature is full of uncertainty metrics that might be used to measure such splits (this paper comes to mind as a memorable lava-based example). It is also full of proposals for mitigation like ‘ask the supervisor’ or ‘slow down’ or ‘avoid going into that part of the state space’.
The general feeling I have, which I think is also the feeling in the ML community, is that such uncertainty metrics are great for suppressing all kinds of failure scenarios. But if you are expecting a 100% guarantee that the uncertainty metrics will detect every possible bad situation (that the agent will see every unknown unknown coming before it can hurt you), you will be disappointed. So I’d like to ask you: what is your sense of optimism or pessimism in this area?
But if you are expecting a 100% guarantee that the uncertainty metrics will detect every possible bad situation
I’m more thinking of how we could automate the navigating of these situations. The detection will be part of this process, and it’s not a Boolean yes/no, but a matter of degree.
Just read the above post and some your related posts on model splintering and symbol grounding. Here are some thoughts and comments, also on some of the other posts.
In this post you are considering a type of of machine learning where the set of features F in the learned model can be updated, not just the model’s probability distribution Q. This is neat because it allows you to identify some specific risks associated with model refinements where F changes. In many discussions in the AI alignment community, these risks are associated with the keywords ‘symbol grounding’ and ‘ontological crises’, so it is good to have some math that can deconfuse and disentangle the issues.
However, you also link model splintering to out-of-distribution robustness. Specifically, in section 1.1:
and then in section 5 you write:
Compared to Rohin’s comment above, I interpret the strength of this link vert differently.
I believe that the link is pretty weak, in that I cannot rephrase the out-of-distribution problems you mentioned as being the same ‘if the AI’s refinements do X’ problems of section 5.
To give a specific example which illustrates my point:
Say that we train a classifier to classify 100x100 pixel 24-bit color pictures as being pictures of either cats or dogs. The F in this example consists of symbols that can identify each possible picture, and the symbols cat and dog. You can then have a probability distribution Q that gives you P(cat|some_picture).
We train the classifier on correctly labeled pictures of black cats and white dogs only. So it learns to classify by looking at the color of the animal.
After training, we move the classifier out-of-distribution by feeding it pictures of white cats, black dogs, cats that look a bit like pandas, etc.
The main observation now is that this last step moves the classifier out-of-distribution. It is not the step of model refinement by the ML system that is causing any out-of-distribution issue here. The classifier is still using the same F and Q, but it has definitely moved out-of-distribution in the last step.
So I might classify moving out-of-distribution as something that happens to a classifier or agent, and model splintering as something that the machine learning system does to itself.
Or I might think of splintering as something that can have two causes: 1) the ML system/agent landing out of distribution, 2) certain updates that machine learning does.
You are considering several metrics of model splintering above: I believe some of them are splintering metrics that would measure both causes. Others only measure cause 2.
As you note, there is an obvious connection between some of your metrics and those used in several RL and especially IRL reward function learning papers. To detect shattering from cause 2), one might use a metric form such a paper even if the paper did not consider cause 2), only cause 1).
Some more general remarks (also targeted at general readers of this comment section who want to get deeper into the field covered by the above post):
In many machine learning systems, from AIXI to most deep neural nets, the set of model features F never changes: the system definition is such that all changes happen inside the model parameters representing Q.
Systems where a learned function is represented by a neural net with variable nodes, or by a dynamically constructed causal graph, would more naturally be ones where F might be updated.
Of course, mathematical modeling is very flexible: one represent any possible system as having a fixed F by shoving all changes it ever makes into Q.
As a general observation on building models to show and analyze certain problems: if we construct a machine learning system where F never changes, then we can still produce failure modes that we can interpret as definite symbol grounding problems, or definite cases where the reward function is splintered, according to some metric that measures splintering.
Interpreting such a system as being capable of having an ontological crises gets more difficult, but if you really want to, you could.
I have recently done some work on modeling AGI symbol grounding failures, and on listing ways to avoid them, see section 10.2 of my paper here. (No current plans to cover the topic in the sequence about the topics in the paper too.) I wrote that section 10.2 to be accessible also to people who do not have years of experience with ML math, so in that sense it is similar to what the above post tries to do.
My approach to modeling symbol grounding failure in the paper is similar to that in your blog post here. I model in symbol grounding failures in an agent as failures of prediction that might be proven empirically.
In the terminology of this post, in the paper I advance the argument that it would be very good design practice (and that it is a commonly used design practice in ML architectures) to avoid reward function splintering as follows. First, define the reward function R in a way where R references only a subset of symbols FR∈F, where any improved F∗ made by model refinement still has the same subset FR inside it. Furthermore, to prevent splintering, this FR has to be limited to the of symbols which directly represent a) possible sensor readings of physical sensors connected to the agent compute core, or b) potential commands to physical actuators connected to the agent compute core.
I also mention that in RL architectures with learning on a reward signal, the reward signal is the only sensor reading that one aims to keep symbol grounded always.
In your more recent modeling of symbol grounding errors here, that model strikes me more as being a special case that models symbol mapping failures in translation settings, not the symbol grounding problem we usually worry about in a black box RL agents.
Thanks! Lots of useful insights in there.
Why do you think it’s important to distinguish these two situations? It seems that the insights for dealing with one situation may apply to the other, and vice versa.
The distinction is important if you want to design countermeasures that lower the probability that you land in the bad situation in the first place. For the first case, you might look at improving the agent’s environment, or in making the agent detect when its environment moves off the training distribution. For the second case, you might look at adding features to the machine learning system itself. so that dangerous types of splintering become less likely.
I agree that once you have landed in the bad situation, mitigation options might be much the same, e.g. switch off the agent.
I’m most interested in mitigation options the agent can take itself, when it suspects it’s out-of-distribution (and without being turned off, ideally).
OK. Reading the post originally, my impression was that you were trying to model ontological crisis problems that might happen by themselves inside the ML system when it learns of self-improves.
This is a subcase that can be expressed in by your model, but after the Q&A in your SSC talk yesterday, my feeling is that your main point of interest and reason for optimisim with this work is different. It is in the problem of the agent handling ontological shifts that happen in human models of what their goals and values are.
I might phrase this question as: If the humans start to splinter their idea of what a certain kind morality-related word they have been using for ages really means, how is the agent supposed to find out about this, and what should it do next to remain aligned?
The ML literature is full of uncertainty metrics that might be used to measure such splits (this paper comes to mind as a memorable lava-based example). It is also full of proposals for mitigation like ‘ask the supervisor’ or ‘slow down’ or ‘avoid going into that part of the state space’.
The general feeling I have, which I think is also the feeling in the ML community, is that such uncertainty metrics are great for suppressing all kinds of failure scenarios. But if you are expecting a 100% guarantee that the uncertainty metrics will detect every possible bad situation (that the agent will see every unknown unknown coming before it can hurt you), you will be disappointed. So I’d like to ask you: what is your sense of optimism or pessimism in this area?
I’m more thinking of how we could automate the navigating of these situations. The detection will be part of this process, and it’s not a Boolean yes/no, but a matter of degree.