A policy outputs a distribution over {0,1}×A, and equations 3 and 4 define what this distribution is for the imitator. If it outputs (0, a), that means qt=0 and and at=1 and if it outputs (1, a), that means qt=1 and at=a. When I say
The 0 on the l.h.s. means the imitator is picking the action itself instead of deferring to the demonstrator,
that’s just describing the difference between equations 3 and 4. Look at equation 4 to see that when qt=1, the distribution over the action is equal to that of the demonstrator. So we describe the behavior that follows qt=1 as “deferring to the demonstrator”. If we look at the distribution over the action when qt=0, it’s something else, so we say the imitator is “picking its own action”.
The 0 on the l.h.s. means the imitator is picking the action a itself instead of deferring to the demonstrator or picking one of the other actions???
The 0 means the imitator is picking the action, and the a means it’s not picking another action that’s not a.
I agree with your description above about how it all works. But I guess I was not explaining well enough why I got confused and why the edits of inserting the a and the bold text above would have stopped me getting confused. So I’ll try again.
I read the sentence fragment
The 0 on the l.h.s. means the imitator is picking the action itself instead of deferring to the demonstrator,
below equation (3) as an explanatory claim that the value defined in equation (3) defines the probability that the imitator is picking the action itself instead of deferring to the demonstrator, the probability given the history |h<t. However, this is not the value being defined by equation (3), instead it defines the probability the imitator is picking the action itself instead of deferring to the demonstrator when the history is |h<tand the next action taken is a.
The actual probability of the imitator is picking the action itself under |h<t, is given by ∑a∈Aπiα(0,a|h<t), which is only mentioned in passing in the lines between equations (3) and (4).
So when I was reading the later sections in the paper and I wanted to look back at what the probability was that the imitator would pick the action, my eye landed on equation (3) and the sentence below it. When I read that sentence, it stopped me from looking further to find the equation ∑a∈Aπiα(0,a|h<t), which is the equation I was really looking for. Instead my mind auto-completed equation (3) by adding an avga term to it, which makes for a much more conservative querying policy than the one you defined, and this then got me into wondering how you were dealing with learning nondeterminstic policies, if at all, etc.
So overall I think you can improve readability by doing some edits to draw attention more strongly to the conditional nature of a, and foregroundig the definition of θq more clearly as a single-line equation.
A policy outputs a distribution over {0,1}×A, and equations 3 and 4 define what this distribution is for the imitator. If it outputs (0, a), that means qt=0 and and at=1 and if it outputs (1, a), that means qt=1 and at=a. When I say
that’s just describing the difference between equations 3 and 4. Look at equation 4 to see that when qt=1, the distribution over the action is equal to that of the demonstrator. So we describe the behavior that follows qt=1 as “deferring to the demonstrator”. If we look at the distribution over the action when qt=0, it’s something else, so we say the imitator is “picking its own action”.
The 0 means the imitator is picking the action, and the a means it’s not picking another action that’s not a.
I agree with your description above about how it all works. But I guess I was not explaining well enough why I got confused and why the edits of inserting the a and the bold text above would have stopped me getting confused. So I’ll try again.
I read the sentence fragment
below equation (3) as an explanatory claim that the value defined in equation (3) defines the probability that the imitator is picking the action itself instead of deferring to the demonstrator, the probability given the history |h<t. However, this is not the value being defined by equation (3), instead it defines the probability the imitator is picking the action itself instead of deferring to the demonstrator when the history is |h<t and the next action taken is a.
The actual probability of the imitator is picking the action itself under |h<t, is given by ∑a∈Aπiα(0,a|h<t), which is only mentioned in passing in the lines between equations (3) and (4).
So when I was reading the later sections in the paper and I wanted to look back at what the probability was that the imitator would pick the action, my eye landed on equation (3) and the sentence below it. When I read that sentence, it stopped me from looking further to find the equation ∑a∈Aπiα(0,a|h<t), which is the equation I was really looking for. Instead my mind auto-completed equation (3) by adding an avga term to it, which makes for a much more conservative querying policy than the one you defined, and this then got me into wondering how you were dealing with learning nondeterminstic policies, if at all, etc.
So overall I think you can improve readability by doing some edits to draw attention more strongly to the conditional nature of a, and foregroundig the definition of θq more clearly as a single-line equation.