Tetraspace comments on Tetraspace Grouping’s Shortform

Tetraspace 27 May 2020 0:27 UTC
11 points
Thoughts on Abram Demski’s Partial Agency:
When I read Partial Agency, I was struck with a desire to try formalizing this partial agency thing. Defining Myopia seems like it might have a definition of myopia; one day I might look at it. Anyway,
Formalization of Partial Agency: Try One
A myopic agent is optimizing a reward function $R (x_{0}, y (x_{0}))$ where $x$ is the vector of parameters it’s thinking about and $y$ is the vector of parameters it isn’t thinking about. The gradient descent step picks the $δ x$ in the direction that maximizes $R (x_{0} + δ x, y (x_{0}))$ (it is myopic so it can’t consider the effects on $y$ ), and then moves the agent to the point $(x_{0} + δ x, y (x_{0} + δ x))$ .
This is dual to a stop-gradient agent, which picks the $δ x$ in the direction that maximizes $f (x_{0} + δ x, y (x_{0} + δ x))$ but then moves the agent to the point $(x_{0} + δ x, y (x_{0}))$ (the gradient through $y$ is stopped).
For example,
- Nash equilibria - $x$ are the parameters defining the agent’s behavior. $y (x_{0})$ are the parameters of the other agents if they go up against the agent parametrized by $x_{0}$ . $R$ is the reward given for an agent $x$ going up against a set of agents $y$ .
- Image recognition with a neural network - $x$ is the parameters defining the network, $y (x_{0})$ are the image classifications for every image in the dataset for the network with parameters $x_{0}$ , and $R$ is the loss function plus the loss of the network described by $x$ on classifying the current training example.
- Episodic agent - $x$ are parameters describing the agents behavior. $y (x_{0})$ are the performances of the agent $x_{0}$ in future episodes. $R$ is the sum of $y$ , plus the reward obtained in the current episode.
Partial Agency due to Uncertainty?
Is it possible to cast partial agency in terms of uncertainty over reward functions? One reason I’d be myopic is if I didn’t believe that I could, in expectation, improve some part of the reward, perhaps because it’s intractable to calculate (behavior of other agents) or something I’m not programmed to care about (reward in other episodes).
Let $R_{1}$ be drawn from a probability distribution over reward functions. Then one could decompose the true, uncertain, reward into $R^{'} = R_{0} (x_{0}) + R_{1} (x_{0})$ defined in such a way that $E (R_{1} (x_{0} + δ x) - R_{1} (x_{0})) \approx 0$ for any $δ x$ ? Then this is would be myopia where the agent either doesn’t know or doesn’t care about $R_{1}$ , or at least doesn’t know or care what its output does to $R_{1}$ . This seems sufficient, but not necessary.
Now I have two things that might describe myopia, so let’s use both of them at once! Since you only end up doing gradient descent on $R_{0}$ , it would make sense to say $R^{'} (x) = R (x, y (x))$ , $R_{0} (x) = R (x, y (x_{0}))$ , and hence that $R_{1} (x) = R (x, y (x)) - R (x, y (x_{0}))$ .
Since $R_{1} (x_{0} + δ x) = R_{1} (x_{0}) + δ x \frac{\partial R_{1}}{\partial x}$ for small $δ x$ , this means that $E (\frac{\partial R_{1}}{\partial x}) = 0$ , so substituting in my expression for $R_{1}$ gives $E (\frac{\partial R}{\partial x} + \frac{\partial R}{\partial y} \frac{\partial y}{\partial x} - \frac{\partial R}{\partial x}) = 0$ , so $E (\frac{\partial R}{\partial y} \frac{\partial y}{\partial x}) = 0$ . Uncertainly is only over $R$ , so this is just the claim that the agent will be myopic with respect to $y$ if $E (\frac{\partial R}{\partial y}) = 0$ . So it won’t want to include $y$ in its gradient calculation if it thinks the gradients with respect to $y$ are, on average, 0. Well, at least I didn’t derive something obviously false!
But Wait There’s More
When writing the examples for the gradient descenty formalisation, something struck me: it seems there’s a $R (x) = r (x) + \sum_{i} y_{i} (x)$ structure to a lot of them, where $r$ is the reward on the current episode, and $y_{i}$ are rewards obtained on future episodes.
You could maybe even use this to have soft episode boundaries, like say the agent receives a reward $r_{t}$ on each timestep so $R (x) = r_{0} (x) + r_{1} (x) α + r_{2} (x) α^{2} + \sum_{i = 3} r_{i} (x) α^{i}$ , and saying that $α^{3} ≪ 1$ so that $\frac{\partial R}{\partial r_{i}} ≪ 1$ for $i \geq 3$ , which is basically the criterion for myopia up above.
Unrelated Note
On a completely unrelated note, I read the Parable of Predict-O-Matic in the past, but foolishly neglected to read Partial Agency beforehand. The only thing that I took away from PoPOM the first time around was the bit about inner optimisers, coincidentally the only concept introduced that I had been thinking about beforehand. I should have read the manga before I watched the anime.
- Tetraspace 28 May 2020 16:58 UTC
  1 point
  Parent
  So the definition of myopia given in Defining Myopia was quite similar to my expansion in the But Wait There’s More section; you can roughly match them up by saying $r (x) = \sum_{i} f_{i} r_{i} (x)$ and $y_{i} (x) = (1 - f_{i}) r_{i} (x)$ , where $f_{i}$ is a real number corresponding to the amount that the agent cares about rewards obtained in episode $i$ and $r_{i}$ is the reward obtained in episode $i$ . Putting both of these into the sum gives $R (x) = \sum_{i} r_{i} (x)$ , the undiscounted, non-myopic reward that the agent eventually obtains.
  In terms of the $R = R_{0} + R_{1}$ definition that I give in the uncertainty framing, this is $R_{0} = R (x, y_{0}) = \sum_{i} f_{i} r_{i} (x) + \sum_{i} (1 - f_{i}) r_{i} (x_{0})$ , and $R_{1} = R (x, y) - R (x, y_{0}) = \sum_{i} (1 - f_{i}) (r_{i} (x) - r_{i} (x_{0}))$ .
  So if you let $r$ be a vector of the reward obtained on each step and $f$ be a vector of how much the agent cares about each step then $x \to x + ϵ \sum_{i} f_{i} \frac{\partial r_{i}}{\partial x}$ , and thus the change to the overall reward is $R \to R + ϵ \sum_{i} \frac{\partial r_{i}}{\partial x} \sum_{j} f_{j} \frac{\partial r_{j}}{\partial x}$ , which can be negative if the two sums have different signs.
  I was hoping that a point would reveal itself to me about now but I’ll have to get back to you on that one.