That sounds correct, but in the first step, I think what they are optimizing is not (just) features but representations of those features. The natural language descriptions of the features are very simple ones that require no expertise, but some representations and ways of wiring them together are more conducive to learning and more conducive to building higher level features that Stockfish uses. But, again, it doesn’t sound like they are applying much optimization power at this step.
Also, one thing that I think is just omitted is whether the network in the first step is the same network in the later steps, or whether the later steps introduce more layers to exploit the same features.
That sounds correct, but in the first step, I think what they are optimizing is not (just) features but representations of those features. The natural language descriptions of the features are very simple ones that require no expertise, but some representations and ways of wiring them together are more conducive to learning and more conducive to building higher level features that Stockfish uses. But, again, it doesn’t sound like they are applying much optimization power at this step.
Also, one thing that I think is just omitted is whether the network in the first step is the same network in the later steps, or whether the later steps introduce more layers to exploit the same features.
Features and representations: agreed. (I wasn’t trying to be precise.)
I assumed the same network in the first step as later, but agree that it isn’t made explicit in the paper.