Misunderstanding: I expect we can’t construct a counterfactual planner because we can’t pick out the compute core in the black-box learned model.
And my Eliezer’s problem with counterfactual planning is that the plan may start by unleashing a dozen memetic, biological, technological, magical, political and/or untyped existential hazards on the world which then may not even be coordinated correctly when one of your safeguards takes out one of the resulting silicon entities.
we can’t pick out the compute core in the black-box learned model.
Agree it is hard to pick the compute core out of a black-box learned model that includes the compute core.
But one important point I am trying to make in the counterfactual planning sequence/paper is that you do not have to solve that problem. I show that it is tractable to route around it, and still get an AGI.
I don’t understand your second paragraph ‘And my Eliezer’s problem...’. Can you unpack this a bit more? Do you mean that counterfactual planning does not automatically solve the problem of cleaning up an already in-progress mess when you press the emergency stop button too late? It does not intend to, and I do not think that the cleanup issue is among the corrigibility-related problems Eliezer has been emphasizing in the discussion above.
Oh, I wasn’t expecting you to have addressed the issue! 10.2.4 says L wouldn’t be S if it were calculated from projected actions instead of given actions. How so? Mightn’t it predict the given actions correctly?
You’re right on all counts in your last paragraph.
10.2.4 says L wouldn’t be S if it were calculated from projected actions instead of given actions. How so? Mightn’t it predict the given actions correctly?
Not sure if a short answer will help, so I will write a long one.
In 10.2.4 I talk about the possibility of an unwanted learned
predictive function L−(s′,s,a) that makes predictions without using the
argument a. This is possible for example by using s′ together with
a (learned) model πl of the compute core to predict a: so a viable L−
could be defined as L−(s′,s,a)=S(s′,s,πl(s)). This L− could
make predictions fully compatible with the observational record o, but I
claim it would not be a reasonable learned L according to the
reasonableness criterion L≈S. How so?
The reasonableness criterion L≈S is similar to that used in
supervised machine learning: we evaluate the learned L not primarily by how it
matches the training set (how well it predicts the observations in
o), but by evaluating it on a separate test set. This test set can be constructed by sampling S to create samples not contained in o. Mathematically, perfect reasonableness is defined as L=S, which implies that L predicts all samples from S fully accurately.
Philosophically/ontologically speaking, an the agent
specification in my paper, specifically the learning world diagram and the descriptive text around it of how this diagram is a model of reality, gives the engineer an unambiguous prescription of how
they might build experimental equipment that can measure the
properties of the S in the learning world diagram by sampling reality. A version of this equipment must of course be built into the agent, to create the observations that drive machine learning of L,
but another version can be used stand-alone to construct a test set.
A sampling action to construct a member of the test set would set
up a desired state s and action a, and then observe the resulting
s′. Mathematically speaking, this observation gives additional information about the
numeric value of S(s′,s,a) and of all S(s′′,s,a) for all s′′≠s′.
I discuss in the section that, if we take an observational record o sampled from S, then two learned predictive
functions L1 and L2 could be found which are both fully compatible with all
observations in o. So to determine which one might be a more
reasonable approximation of S, we can see how well they would each predict
samples not yet in o.
In the case of section 10.2.4, the crucial experimental test showing that L− is an unreasonable approximation of S is one where we create a test set by setting up an st and an at where we know that at is an action that would definitely not be taken by the real
compute core software running in the agent, when it it encounters
state st. So we set up a test where we expect that at≠πl(st). L− will (likely) mis-predict the outcome of this test. In philosophical/ontological terms, you can read this test as one that (likely) falsifies the claim that L− is a correct theory of S.
As discussed in section 10.2.4, there are parallels between the above rejection test and
the idea of random exploration, where random exploration causes the observational record o, the training set, to already contain observations where at≠πl(st) for any deterministic πl. So this will likely suppress the creation of an unwanted L− via machine learning.
Some background: the symbol grounding issue I discuss in 10.2.4 is very related to the
five-and-ten problem you can find in MIRI’s work on embedded agency.
In my experience, most people in
AI, robotics, statistics, or cyber-physical systems have no problem
seeing the solution to this five-and-ten problem, i.e. how to construct an agent that avoids it But somehow, and I
do not know exactly why, MIRI-style(?) Rationalists keep treating it
as a major open philosophical problem that is ignored by the
mainstream AI/academic community. So you can read section 10.2.4 as
my attempt to review and explain the standard solution to the
five-and-ten problem, as used in statistics and engineering. The section
was partly written with Rationalist readers in mind.
Philosophically speaking, the reasonableness criterion
defined in my paper, and by supervised machine learning, has strong ties to Popper’s view of science and
engineering, which emphasizes falsification via new experiments as the
key method for deciding between competing theories
about the nature of reality. I believe that MIRI-style rationality
de-emphasizes the conceptual tools provided by Popper. Instead it
emphasizes a version of Bayesianism that provides a much more
limited vocabulary to reason about differences between the map and the
territory.
I would be interested to know if the above explanation was helpful to
you, and if so which parts.
Misunderstanding: I expect we can’t construct a counterfactual planner because we can’t pick out the compute core in the black-box learned model.
And my Eliezer’s problem with counterfactual planning is that the plan may start by unleashing a dozen memetic, biological, technological, magical, political and/or untyped existential hazards on the world which then may not even be coordinated correctly when one of your safeguards takes out one of the resulting silicon entities.
Agree it is hard to pick the compute core out of a black-box learned model that includes the compute core.
But one important point I am trying to make in the counterfactual planning sequence/paper is that you do not have to solve that problem. I show that it is tractable to route around it, and still get an AGI.
I don’t understand your second paragraph ‘And my Eliezer’s problem...’. Can you unpack this a bit more? Do you mean that counterfactual planning does not automatically solve the problem of cleaning up an already in-progress mess when you press the emergency stop button too late? It does not intend to, and I do not think that the cleanup issue is among the corrigibility-related problems Eliezer has been emphasizing in the discussion above.
Oh, I wasn’t expecting you to have addressed the issue! 10.2.4 says L wouldn’t be S if it were calculated from projected actions instead of given actions. How so? Mightn’t it predict the given actions correctly?
You’re right on all counts in your last paragraph.
Not sure if a short answer will help, so I will write a long one.
In 10.2.4 I talk about the possibility of an unwanted learned predictive function L−(s′,s,a) that makes predictions without using the argument a. This is possible for example by using s′ together with a (learned) model πl of the compute core to predict a: so a viable L− could be defined as L−(s′,s,a)=S(s′,s,πl(s)). This L− could make predictions fully compatible with the observational record o, but I claim it would not be a reasonable learned L according to the reasonableness criterion L≈S. How so?
The reasonableness criterion L≈S is similar to that used in supervised machine learning: we evaluate the learned L not primarily by how it matches the training set (how well it predicts the observations in o), but by evaluating it on a separate test set. This test set can be constructed by sampling S to create samples not contained in o. Mathematically, perfect reasonableness is defined as L=S, which implies that L predicts all samples from S fully accurately.
Philosophically/ontologically speaking, an the agent specification in my paper, specifically the learning world diagram and the descriptive text around it of how this diagram is a model of reality, gives the engineer an unambiguous prescription of how they might build experimental equipment that can measure the properties of the S in the learning world diagram by sampling reality. A version of this equipment must of course be built into the agent, to create the observations that drive machine learning of L, but another version can be used stand-alone to construct a test set.
A sampling action to construct a member of the test set would set up a desired state s and action a, and then observe the resulting s′. Mathematically speaking, this observation gives additional information about the numeric value of S(s′,s,a) and of all S(s′′,s,a) for all s′′≠s′.
I discuss in the section that, if we take an observational record o sampled from S, then two learned predictive functions L1 and L2 could be found which are both fully compatible with all observations in o. So to determine which one might be a more reasonable approximation of S, we can see how well they would each predict samples not yet in o.
In the case of section 10.2.4, the crucial experimental test showing that L− is an unreasonable approximation of S is one where we create a test set by setting up an st and an at where we know that at is an action that would definitely not be taken by the real compute core software running in the agent, when it it encounters state st. So we set up a test where we expect that at≠πl(st). L− will (likely) mis-predict the outcome of this test. In philosophical/ontological terms, you can read this test as one that (likely) falsifies the claim that L− is a correct theory of S.
As discussed in section 10.2.4, there are parallels between the above rejection test and the idea of random exploration, where random exploration causes the observational record o, the training set, to already contain observations where at≠πl(st) for any deterministic πl. So this will likely suppress the creation of an unwanted L− via machine learning.
Some background: the symbol grounding issue I discuss in 10.2.4 is very related to the five-and-ten problem you can find in MIRI’s work on embedded agency. In my experience, most people in AI, robotics, statistics, or cyber-physical systems have no problem seeing the solution to this five-and-ten problem, i.e. how to construct an agent that avoids it But somehow, and I do not know exactly why, MIRI-style(?) Rationalists keep treating it as a major open philosophical problem that is ignored by the mainstream AI/academic community. So you can read section 10.2.4 as my attempt to review and explain the standard solution to the five-and-ten problem, as used in statistics and engineering. The section was partly written with Rationalist readers in mind.
Philosophically speaking, the reasonableness criterion defined in my paper, and by supervised machine learning, has strong ties to Popper’s view of science and engineering, which emphasizes falsification via new experiments as the key method for deciding between competing theories about the nature of reality. I believe that MIRI-style rationality de-emphasizes the conceptual tools provided by Popper. Instead it emphasizes a version of Bayesianism that provides a much more limited vocabulary to reason about differences between the map and the territory.
I would be interested to know if the above explanation was helpful to you, and if so which parts.