I agree with your description above about how it all works. But I guess I was not explaining well enough why I got confused and why the edits of inserting the a and the bold text above would have stopped me getting confused. So I’ll try again.
I read the sentence fragment
The 0 on the l.h.s. means the imitator is picking the action itself instead of deferring to the demonstrator,
below equation (3) as an explanatory claim that the value defined in equation (3) defines the probability that the imitator is picking the action itself instead of deferring to the demonstrator, the probability given the history |h<t. However, this is not the value being defined by equation (3), instead it defines the probability the imitator is picking the action itself instead of deferring to the demonstrator when the history is |h<tand the next action taken is a.
The actual probability of the imitator is picking the action itself under |h<t, is given by ∑a∈Aπiα(0,a|h<t), which is only mentioned in passing in the lines between equations (3) and (4).
So when I was reading the later sections in the paper and I wanted to look back at what the probability was that the imitator would pick the action, my eye landed on equation (3) and the sentence below it. When I read that sentence, it stopped me from looking further to find the equation ∑a∈Aπiα(0,a|h<t), which is the equation I was really looking for. Instead my mind auto-completed equation (3) by adding an avga term to it, which makes for a much more conservative querying policy than the one you defined, and this then got me into wondering how you were dealing with learning nondeterminstic policies, if at all, etc.
So overall I think you can improve readability by doing some edits to draw attention more strongly to the conditional nature of a, and foregroundig the definition of θq more clearly as a single-line equation.
I agree with your description above about how it all works. But I guess I was not explaining well enough why I got confused and why the edits of inserting the a and the bold text above would have stopped me getting confused. So I’ll try again.
I read the sentence fragment
below equation (3) as an explanatory claim that the value defined in equation (3) defines the probability that the imitator is picking the action itself instead of deferring to the demonstrator, the probability given the history |h<t. However, this is not the value being defined by equation (3), instead it defines the probability the imitator is picking the action itself instead of deferring to the demonstrator when the history is |h<t and the next action taken is a.
The actual probability of the imitator is picking the action itself under |h<t, is given by ∑a∈Aπiα(0,a|h<t), which is only mentioned in passing in the lines between equations (3) and (4).
So when I was reading the later sections in the paper and I wanted to look back at what the probability was that the imitator would pick the action, my eye landed on equation (3) and the sentence below it. When I read that sentence, it stopped me from looking further to find the equation ∑a∈Aπiα(0,a|h<t), which is the equation I was really looking for. Instead my mind auto-completed equation (3) by adding an avga term to it, which makes for a much more conservative querying policy than the one you defined, and this then got me into wondering how you were dealing with learning nondeterminstic policies, if at all, etc.
So overall I think you can improve readability by doing some edits to draw attention more strongly to the conditional nature of a, and foregroundig the definition of θq more clearly as a single-line equation.