I like your “Corrigibility with Utility Preservation” paper.
Thanks!
I don’t get why you prefer not using the usual conditional probability
notation.
Well, I wrote in the paper (section 5) that I used p(st,a,st+1)
instead of the usual conditional probability notation
P(st+1|st,a) because it ‘fits better with the mathematical logic
style used in the definitions and proofs below.’ i.e. the proofs use
the mathematics of second order logic, not probability theory.
However this was not my only reason for this preference. The other
reason what that I had an intuitive suspicion back in 2019 that the
use of conditional probability notation, in the then existing papers
and web pages on balancing terms, acted as an of impediment to
mathematical progress. My suspicion was that it acted as an overly
Bayesian framing that made it more difficult to clarify and generalize
the mathematics of this technique any further.
In hindsight in 2021, I can be a bit more clear about my 2019
intuition. Armstrong’s original balancing term elements
E(v|u→v) and E(u|u→u),
where u→v and u→v are low-probability
near-future events, can be usefully generalized (and simplified) as
the Pearlian E(v|do(v)) and E(u|do(u)) where the do terms are interventions (or ‘edits’) on
the current world state.
The notation E(v|u→v) makes it look like the
balancing terms might have some deep connection to Bayesian updating
or Bayesian philosophy, whereas I feel they do not have any such deep
connection.
That being said, in my 2020 paper
I present a simplified version of the math in the 2019 paper using the
traditional P(st+1|st,a) notation again, and without having to
introduce do.
gc leads to TurnTrout’s attainable utility preservation.
Yes it is very related: I explore that connection in more detail in
section 12 of my 2020 paper.
In general I think that counterfactual expected-utility reward
function terms are a Swiss army knifes with many interesting uses. I
feel that as a community, we have not yet gotten to the bottom of
their possibilities (and their possible failure modes).
Why not use V in the definition of π∗?
In definition of π∗ (section 5.3 equation 4) I am using a
V term, so I am not sure if I understand the question.
(I am running out of time now, will get back to the remaining
questions in your comment later)
pi has form afV, V has form mfV, f is a long reused term. Expand recursion to get afmfmf… and mfmfmf.… Define E=fmE and you get pi=aE without writing f twice. Sure, you use V a lot but my intuition is that there should be some a priori knowable argument for putting the definitions your way or your theory is going to end up with the wrong prior.
Thanks for expanding on your question about the use of V.
Unfortunately. I still have a hard time understanding your
question, so I’ll say a few things and hope that will clarify.
If you expand the V term defined in (5) recursively, you get a
tree-like structure. Each node in the tree has as many sub nodes as
there are elements in the set W. The tree is in fact a tree of
branching world lines. Hope this helps you visualize what is going
on.
I could shuffle around some symbols and terms in the definitions (4)
and (5) and still create a model of exactly the same agent that will
behave in exactly the same way. So the exact way in which these two
equations are written down and recurse on each other is somewhat
contingent. My equations stay close to what is used when you model an
agent or ‘rational’ decision making process with a Bellman equation. If your
default mental model of an agent is a set of Q-learning equations, the
model I develop will look strange, maybe even unnatural at first
sight.
or your theory is going to end up with the wrong prior.
OK, maybe this is the main point that inspired your question. The
agency/world models developed in the paper are not a ‘theory’, in the
sense that theories have predictive power. A mathematical model used
as a theory, like f=m∗a, predicts how objects will accelerate when
subjected to a force.
The agent model in the paper does not really ‘predict’ how agents will
behave. The model is compatible with almost every possible agent
construction and agent behavior, if we are allowed to pick the agent’s
reward function R freely after observing of reverse-engineering the
agent to be modeled.
On purpose, the agent model is constructed with so many ‘free
parameters’ that is has no real predictive power. What you get here
is an agent model that can describe almost every possible agent and world in which it could operate.
In mathematics. the technique I am using in the paper is sometimes called
‘without loss of generality’.
I am developing very general proofs by introducing constraining
assumptions ‘without loss of generality’.
Another thing to note is that the model of the π∗pfcgc agent
in the paper, the model of an agent with the corrigibility-creating
safety layer, acts as a specification of how to add this layer to
any generic agent design.
This dual possible use, theory or specification, of models can be
tricky if you are not used to it. In observation-based science,
mathematical models are usually always theories only. In engineering
(and in theoretical CS, the kind where you prove programs correct,
which tends to be a niche part of CS nowadays) models often act as
specifications. In statistics, the idea that statistical models act
as theories tends to be de-emphasized. The paper uses models in the
way they are used in theoretical CS.
You may want to take a look at this post in the
sequence,
which copies text from a 2021 paper where I tried to make the
theory/specification use of models more accessible. If you read that
post, if might be easier to fully track what is happening, in a
mathematical sense, in my 2019 paper.
In category theory, one learns that good math is like kabbalah, where nothing is a coincidence. All short terms ought to mean something, and when everything fits together better than expected, that is a sign that one is on the right track, and that there is a pattern to formalize.π∗X=argmaxa∑ypx(x,a,y)(RX(x,y)+γVX(y)) and Vx=maxa∑ypx(x,a,y)(RX(x,y)+γVX(y)) can be replaced by π∗X=argmaxaEX and EX=∑ypx(x,a,y)(RX(x,y)+γmaxaEX). I expect that the latter formation is better because it is shorter. Its only direct effect would be that you would write maxaEX instead of VX, so the previous sentence must cash out as this being a good thing. Indeed, it points out a direction in which to generalize. How does your math interact with quantilization? I plan to expand when I’ve had time to read all links.
In category theory, one learns that good math is like kabbalah, where nothing is a coincidence.
OK, I think I see what inspired your question.
If you want to give this kind of give the math the kabbalah
treatment, you may also look at the math in [EFDH16], which produces
agents similar to my definitions (4) (5), and also some variants that have different
types of self-reflection. In the
later paper here, Everitt et
al. develop some diagrammatic models of this type of agent
self-awareness, but the models are not full definitions of the agent.
For me, the main questions I have about the math developed in the
paper is how exactly I can map the model and the constraints (C1-3)
back to things I can or should build in physical reality.
There is a thing going on here (when developing agent models, especially when treating AGI/superintelligence and embeddeness) that also often happens in post-Newtonian
physics. The equations work, but if we attempt to map these equations
to some prior intuitive mental model we have about how reality or decision
making must necessarily work, we have to conclude that this attempt raises some
strange and troubling questions.
I’m with modern physics here (I used to be an experimental physicist
for a while), where the (mainstream) response to this is that ‘the math
works, your intuitive feelings about how X must necessarily work are
wrong, you will get used to it eventually’.
BTW, I offer some additional interpretation of a
difficult-to-interpret part of the math in section 10 of my 2020
paper here.
How does your math interact with quantilization?
You could insert quantilization in several ways in the model. Most
obvious way is to change the basic definition (4). You might also
define a transformation that takes any reward function R and returns
a quantilized reward function Rq, this gives you a different type of
quantilization, but I feel it would be in the same spirit.
In a more general sense, I do not feel that quantilization can produce
the kind of corrigibility I am after in the paper. The effects you
get on the agent by changing f0 into fc, by adding a balancing
term to the reward function, are not the same effects produced by
quantilization.
Thanks!
Well, I wrote in the paper (section 5) that I used p(st,a,st+1) instead of the usual conditional probability notation P(st+1|st,a) because it ‘fits better with the mathematical logic style used in the definitions and proofs below.’ i.e. the proofs use the mathematics of second order logic, not probability theory.
However this was not my only reason for this preference. The other reason what that I had an intuitive suspicion back in 2019 that the use of conditional probability notation, in the then existing papers and web pages on balancing terms, acted as an of impediment to mathematical progress. My suspicion was that it acted as an overly Bayesian framing that made it more difficult to clarify and generalize the mathematics of this technique any further.
In hindsight in 2021, I can be a bit more clear about my 2019 intuition. Armstrong’s original balancing term elements E(v|u→v) and E(u|u→u), where u→v and u→v are low-probability near-future events, can be usefully generalized (and simplified) as the Pearlian E(v|do(v)) and E(u|do(u)) where the do terms are interventions (or ‘edits’) on the current world state.
The notation E(v|u→v) makes it look like the balancing terms might have some deep connection to Bayesian updating or Bayesian philosophy, whereas I feel they do not have any such deep connection.
That being said, in my 2020 paper I present a simplified version of the math in the 2019 paper using the traditional P(st+1|st,a) notation again, and without having to introduce do.
Yes it is very related: I explore that connection in more detail in section 12 of my 2020 paper. In general I think that counterfactual expected-utility reward function terms are a Swiss army knifes with many interesting uses. I feel that as a community, we have not yet gotten to the bottom of their possibilities (and their possible failure modes).
In definition of π∗ (section 5.3 equation 4) I am using a V term, so I am not sure if I understand the question.
(I am running out of time now, will get back to the remaining questions in your comment later)
pi has form afV, V has form mfV, f is a long reused term. Expand recursion to get afmfmf… and mfmfmf.… Define E=fmE and you get pi=aE without writing f twice. Sure, you use V a lot but my intuition is that there should be some a priori knowable argument for putting the definitions your way or your theory is going to end up with the wrong prior.
Thanks for expanding on your question about the use of V. Unfortunately. I still have a hard time understanding your question, so I’ll say a few things and hope that will clarify.
If you expand the V term defined in (5) recursively, you get a tree-like structure. Each node in the tree has as many sub nodes as there are elements in the set W. The tree is in fact a tree of branching world lines. Hope this helps you visualize what is going on.
I could shuffle around some symbols and terms in the definitions (4) and (5) and still create a model of exactly the same agent that will behave in exactly the same way. So the exact way in which these two equations are written down and recurse on each other is somewhat contingent. My equations stay close to what is used when you model an agent or ‘rational’ decision making process with a Bellman equation. If your default mental model of an agent is a set of Q-learning equations, the model I develop will look strange, maybe even unnatural at first sight.
OK, maybe this is the main point that inspired your question. The agency/world models developed in the paper are not a ‘theory’, in the sense that theories have predictive power. A mathematical model used as a theory, like f=m∗a, predicts how objects will accelerate when subjected to a force.
The agent model in the paper does not really ‘predict’ how agents will behave. The model is compatible with almost every possible agent construction and agent behavior, if we are allowed to pick the agent’s reward function R freely after observing of reverse-engineering the agent to be modeled.
On purpose, the agent model is constructed with so many ‘free parameters’ that is has no real predictive power. What you get here is an agent model that can describe almost every possible agent and world in which it could operate.
In mathematics. the technique I am using in the paper is sometimes called ‘without loss of generality’. I am developing very general proofs by introducing constraining assumptions ‘without loss of generality’.
Another thing to note is that the model of the π∗pfcgc agent in the paper, the model of an agent with the corrigibility-creating safety layer, acts as a specification of how to add this layer to any generic agent design.
This dual possible use, theory or specification, of models can be tricky if you are not used to it. In observation-based science, mathematical models are usually always theories only. In engineering (and in theoretical CS, the kind where you prove programs correct, which tends to be a niche part of CS nowadays) models often act as specifications. In statistics, the idea that statistical models act as theories tends to be de-emphasized. The paper uses models in the way they are used in theoretical CS.
You may want to take a look at this post in the sequence, which copies text from a 2021 paper where I tried to make the theory/specification use of models more accessible. If you read that post, if might be easier to fully track what is happening, in a mathematical sense, in my 2019 paper.
In category theory, one learns that good math is like kabbalah, where nothing is a coincidence. All short terms ought to mean something, and when everything fits together better than expected, that is a sign that one is on the right track, and that there is a pattern to formalize.π∗X=argmaxa∑ypx(x,a,y)(RX(x,y)+γVX(y)) and Vx=maxa∑ypx(x,a,y)(RX(x,y)+γVX(y)) can be replaced by π∗X=argmaxaEX and EX=∑ypx(x,a,y)(RX(x,y)+γmaxaEX). I expect that the latter formation is better because it is shorter. Its only direct effect would be that you would write maxaEX instead of VX, so the previous sentence must cash out as this being a good thing. Indeed, it points out a direction in which to generalize. How does your math interact with quantilization? I plan to expand when I’ve had time to read all links.
OK, I think I see what inspired your question.
If you want to give this kind of give the math the kabbalah treatment, you may also look at the math in [EFDH16], which produces agents similar to my definitions (4) (5), and also some variants that have different types of self-reflection. In the later paper here, Everitt et al. develop some diagrammatic models of this type of agent self-awareness, but the models are not full definitions of the agent.
For me, the main questions I have about the math developed in the paper is how exactly I can map the model and the constraints (C1-3) back to things I can or should build in physical reality.
There is a thing going on here (when developing agent models, especially when treating AGI/superintelligence and embeddeness) that also often happens in post-Newtonian physics. The equations work, but if we attempt to map these equations to some prior intuitive mental model we have about how reality or decision making must necessarily work, we have to conclude that this attempt raises some strange and troubling questions.
I’m with modern physics here (I used to be an experimental physicist for a while), where the (mainstream) response to this is that ‘the math works, your intuitive feelings about how X must necessarily work are wrong, you will get used to it eventually’.
BTW, I offer some additional interpretation of a difficult-to-interpret part of the math in section 10 of my 2020 paper here.
You could insert quantilization in several ways in the model. Most obvious way is to change the basic definition (4). You might also define a transformation that takes any reward function R and returns a quantilized reward function Rq, this gives you a different type of quantilization, but I feel it would be in the same spirit.
In a more general sense, I do not feel that quantilization can produce the kind of corrigibility I am after in the paper. The effects you get on the agent by changing f0 into fc, by adding a balancing term to the reward function, are not the same effects produced by quantilization.