Koen.Holtman comments on Formal Solution to the Inner Alignment Problem

Koen.Holtman 3 Mar 2021 16:36 UTC
LW: 1 AF: 1
0
AF
[edited to delete and replace an earlier question] Question about the paper: under equation (3) on page 4 I am reading:

The 0 on the l.h.s. means the imitator is picking the action itself instead of deferring to the demonstrator,

This confused me initially to no end, and still confuses me. Should this be:

The 0 on the l.h.s. means the imitator is picking the action $a$ itself instead of deferring to the demonstrator or picking one of the other actions???

This would seem to be more consistent with the definitions that follow, and it would seem to make more sense overall.
- michaelcohen 4 Mar 2021 16:46 UTC
  LW: 1 AF: 1
  0
  AF Parent
  A policy outputs a distribution over ${0, 1} \times A$ , and equations 3 and 4 define what this distribution is for the imitator. If it outputs (0, a), that means $q_{t} = 0$ and and $a_{t} = 1$ and if it outputs (1, a), that means $q_{t} = 1$ and $a_{t} = a$ . When I say
  The 0 on the l.h.s. means the imitator is picking the action itself instead of deferring to the demonstrator,
  that’s just describing the difference between equations 3 and 4. Look at equation 4 to see that when $q_{t} = 1$ , the distribution over the action is equal to that of the demonstrator. So we describe the behavior that follows $q_{t} = 1$ as “deferring to the demonstrator”. If we look at the distribution over the action when $q_{t} = 0$ , it’s something else, so we say the imitator is “picking its own action”.
  The 0 on the l.h.s. means the imitator is picking the action $a$ itself instead of deferring to the demonstrator or picking one of the other actions???
  The 0 means the imitator is picking the action, and the $a$ means it’s not picking another action that’s not $a$ .
  - Koen.Holtman 5 Mar 2021 10:39 UTC
    LW: 1 AF: 1
    0
    AF Parent
    I agree with your description above about how it all works. But I guess I was not explaining well enough why I got confused and why the edits of inserting the $a$ and the bold text above would have stopped me getting confused. So I’ll try again.
    
    I read the sentence fragment
    
    The 0 on the l.h.s. means the imitator is picking the action itself instead of deferring to the demonstrator,
    
    below equation (3) as an explanatory claim that the value defined in equation (3) defines the probability that the imitator is picking the action itself instead of deferring to the demonstrator, the probability given the history $| h_{< t}$ . However, this is not the value being defined by equation (3), instead it defines the probability the imitator is picking the action itself instead of deferring to the demonstrator when the history is $| h_{< t}$ and the next action taken is $a$ .
    
    The actual probability of the imitator is picking the action itself under $| h_{< t}$ , is given by $\sum_{a \in A} π_{α}^{i} (0, a | h_{< t})$ , which is only mentioned in passing in the lines between equations (3) and (4).
    
    So when I was reading the later sections in the paper and I wanted to look back at what the probability was that the imitator would pick the action, my eye landed on equation (3) and the sentence below it. When I read that sentence, it stopped me from looking further to find the equation $\sum_{a \in A} π_{α}^{i} (0, a | h_{< t})$ , which is the equation I was really looking for. Instead my mind auto-completed equation (3) by adding an $a v g_{a}$ term to it, which makes for a much more conservative querying policy than the one you defined, and this then got me into wondering how you were dealing with learning nondeterminstic policies, if at all, etc.
    
    So overall I think you can improve readability by doing some edits to draw attention more strongly to the conditional nature of $a$ , and foregroundig the definition of $θ_{q}$ more clearly as a single-line equation.