Do you think a vision transformer trained on 2-dimensional images of the board state would also come up with a bag of heuristics or would it naturally learn a translation invariant algorithm taking advantage of the uniform way the architecture could process the board? (Let’s say that there are 64 1 pixel by 1 pixel patches, perfectly aligned with the 64 board locations of an 8x8 pixel image, to make it maximally “easy” for both the model and for interpretability work.)
And would it differ based on whether one used an explicit 2D positional embedding, or a learned embedding, or a 1D positional embedding that ordered the patches from top to bottom, right to left?
I know that of course giving a vision transformer the actual board state like this shortcircuits the cool part where OthelloGPT tries to learn its own representation of the board. But I’m wondering if even in this supposedly easy setting it still would end up imperfect with a tiny error rate and a bag-of-heuristics-like way of computing legal moves.
And brainstorming a bit here: a slightly more interesting setting that might not shortcircuit the cool part would be if the input to the vision transformer was a 3D “video” of the moves on the board. E.g. the input[t][x][y] is 1 if on turn t, a move was made at (x,y), and 0 otherwise. Self-attention would presumably be causally-masked on the t dimension but not on x and y. Would we get a bag of heuristics here in the computation of the board state and the legal moves from that state?
I would guess that it would learn an exact algorithm rather than heuristics. The challenging part for OthelloGPT is that the naive algorithm to calculate board state from input tokens requires up to 60 sequential steps, and it only has 8 layers to calculate the board state and convert this to a probability distribution over legal moves.
Do you think a vision transformer trained on 2-dimensional images of the board state would also come up with a bag of heuristics or would it naturally learn a translation invariant algorithm taking advantage of the uniform way the architecture could process the board? (Let’s say that there are 64 1 pixel by 1 pixel patches, perfectly aligned with the 64 board locations of an 8x8 pixel image, to make it maximally “easy” for both the model and for interpretability work.)
And would it differ based on whether one used an explicit 2D positional embedding, or a learned embedding, or a 1D positional embedding that ordered the patches from top to bottom, right to left?
I know that of course giving a vision transformer the actual board state like this shortcircuits the cool part where OthelloGPT tries to learn its own representation of the board. But I’m wondering if even in this supposedly easy setting it still would end up imperfect with a tiny error rate and a bag-of-heuristics-like way of computing legal moves.
And brainstorming a bit here: a slightly more interesting setting that might not shortcircuit the cool part would be if the input to the vision transformer was a 3D “video” of the moves on the board. E.g. the input[t][x][y] is 1 if on turn t, a move was made at (x,y), and 0 otherwise. Self-attention would presumably be causally-masked on the t dimension but not on x and y. Would we get a bag of heuristics here in the computation of the board state and the legal moves from that state?
I would guess that it would learn an exact algorithm rather than heuristics. The challenging part for OthelloGPT is that the naive algorithm to calculate board state from input tokens requires up to 60 sequential steps, and it only has 8 layers to calculate the board state and convert this to a probability distribution over legal moves.