This thread analyzes what is going on under the hood with the chess transformer. It is a stronger player than the Stockfish version it was distilling, at the cost of more compute but only by a fixed multiplier, it remains O(1).
I found this claim suspect because this basically is not a thing that happens in board games. In complex strategy board games like Chess, practical amounts of search on top of a good prior policy and/or eval function (which Stockfish has), almost always outperforms any pure forward pass policy model that doesn’t do explicit search, even when that pure policy model is quite large and extensively trained. With any reasonable settings, it’s very unlikely that the distillation of Stockfish into a pure policy model produces a better player than Stockfish.
I skimmed the paper (https://arxiv.org/pdf/2402.04494), and had trouble finding such a claim, and indeed it seems the original poster of that thread later retracted that claim as due to their own mistake in interpreting the data table of the paper. The post where they acknowledge the mistake is much less prominent than the original post, link here: https://x.com/sytelus/status/1848239379753717874 . The chess transformer remains quite a bit weaker than the Stockfish it tries to predict/imitate.
I assume you’re familiar with the case of the parallel postulate in classical geometry as being independent of other axioms? Where that independence corresponds with the existence of spherical/hyperbolic geometries (i.e. actual models in which the axiom is false) versus normal flat Euclidean geometry (i.e. actual models in which it is true).
To me, this is a clear example of there being no such thing as an “objective” truth about the the validity of the parallel postulate—you are entirely free to assume either it or incompatible alternatives. You end up with equally valid theories, it’s just those theories are applicable to different models, and those models are each useful in different situations, so the only thing it comes down to is which models you happen to be wanting to use or explore or prove things about on a given day.
Similarly for the huge variety of different algebraic or topological structures (groups, ordered fields, manifolds, etc) - it is extremely common to have statements that are independent of the axioms, e.g. in a ring it is independent of the axioms whether multiplication is commutative or not. And both choices are valid. We have commutative rings, and we have noncommutative rings, and both are self-consistent mathematical structures that one might wish to study.
Loosely analogous to how one can write a compiler/interpreter for a programming language within other programming languages, some theories can easily simulate other theories. Set theories are particularly good and convenient for simulating other theories, but one can also simulate set theories within other seemingly more “primitive” theories (e.g. simulating it in theories of basic arithmetic via Godel numbering). This might be analogous to e.g. someone writing a C compiler in Brainfuck. Just like how it’s meaningless to talk about whether a programming language or a given sub-version or feature extension of a programming language is more “objectively true” than another, there are many who take the position that the same holds for different set theories.
When you say you’re “leaning towards a view that maintains objective mathematical truth” with respect to certain axioms, is there some fundamental principle by which you’re discriminating the axioms that you want to assign objective truth from axioms like the parallel postulate or the commutativity of rings, which obviously have no objective truth? Or do you think that even in these latter cases there is still an objective truth?