Very strongly agree with the size considerations for future work, but would be most interested to see if a notably larger size saw less “bag of heuristics” behavior and more holistic integrated and interdependent heuristic behaviors. Even if the task/data at hand is simple and narrowly scoped, it may be that there are fundamental size thresholds for network organization and complexity for any given task.
Also, I suspect parameter to parameter the model would perform better if trained using ternary weights like BitNet 1.5b. The scaling performance gains at similar parameter sizes in pretraining in that work makes sense if the ternary constraint is forcing network reorganization instead of fp compromises in cases where nodes are multi-role. Board games, given the fairly unambiguous nature of the data, seems like a case where this constrained reorganization vs node compromises would be an even more significant gain.
It might additionally be interesting to add synthetic data into the mix that was generated from a model trained to predict games backwards. The original Othello-GPT training data had a considerable amount of the training data as synthetic. There may be patterns overrepresented in forward generated games that could be balanced out by backwards generated gameplay. I’d mostly been thinking about this in terms of Chess-GPT and the idea of improving competency ratings, but it may be that expanding the training data with bi-directionally generated games instead of just unidirectional generated synthetic games reduces the margin of error in predicting non-legal moves further with no changes to the network training itself.
Really glad this toy model is continuing to get such exciting and interesting deeper analyses.
Very strongly agree with the size considerations for future work, but would be most interested to see if a notably larger size saw less “bag of heuristics” behavior and more holistic integrated and interdependent heuristic behaviors. Even if the task/data at hand is simple and narrowly scoped, it may be that there are fundamental size thresholds for network organization and complexity for any given task.
Also, I suspect parameter to parameter the model would perform better if trained using ternary weights like BitNet 1.5b. The scaling performance gains at similar parameter sizes in pretraining in that work makes sense if the ternary constraint is forcing network reorganization instead of fp compromises in cases where nodes are multi-role. Board games, given the fairly unambiguous nature of the data, seems like a case where this constrained reorganization vs node compromises would be an even more significant gain.
It might additionally be interesting to add synthetic data into the mix that was generated from a model trained to predict games backwards. The original Othello-GPT training data had a considerable amount of the training data as synthetic. There may be patterns overrepresented in forward generated games that could be balanced out by backwards generated gameplay. I’d mostly been thinking about this in terms of Chess-GPT and the idea of improving competency ratings, but it may be that expanding the training data with bi-directionally generated games instead of just unidirectional generated synthetic games reduces the margin of error in predicting non-legal moves further with no changes to the network training itself.
Really glad this toy model is continuing to get such exciting and interesting deeper analyses.