Can you share the hyperparameters used to make this figure?
Ah, never mind, I believe I found the relevant hyperparameters here: https://github.com/adamimos/epsilon-transformers/blob/main/examples/msp_analysis.ipynb
In particular, the stuff I needed was that it has only a single attention head per layer, and 4 layers.
Actually I would still really appreciate the training hyperparameters like batch size, learning rate schedule...
A simple suggestion on word usage: from “belief state” to “interpretive state.” This would align your comments better with disciplines more concerned with behavior than cognition. JL Tropea.
I think you may have meant this as a top-level comment rather than a reply to my comment?
Can you share the hyperparameters used to make this figure?
Ah, never mind, I believe I found the relevant hyperparameters here: https://github.com/adamimos/epsilon-transformers/blob/main/examples/msp_analysis.ipynb
In particular, the stuff I needed was that it has only a single attention head per layer, and 4 layers.
Actually I would still really appreciate the training hyperparameters like batch size, learning rate schedule...
A simple suggestion on word usage: from “belief state” to “interpretive state.” This would align your comments better with disciplines more concerned with behavior than cognition. JL Tropea.
I think you may have meant this as a top-level comment rather than a reply to my comment?