we see that the model trained to be good at Othello seems to have a much worse world model.
That seems in odds with what optimization theory dictates—in the limit of compute (or data even) the representations should converge to the optimal ones. instrumental convergence too. I don’t get why any model trained on Othello-related tasks wouldn’t converge to such a (useful) representation.
IMHO this point is a bit overlooked. Perhaps it might be worth investigating why simply playing Othello isn’t enough? Has it to do with randomly initialized priors? I feel this could be very important, especially from a Mech Inter viewpoint—you could have different (maybe incomplete) heuristics or representations yielding the same loss. Kinda reminds me of the EAI paper which hinted that different learning rates (often) achieve the same loss but converge on different attention patterns and representations.
Perhaps there’s some variable here that we’re not considering/evaluating closely enough...
That seems in odds with what optimization theory dictates—in the limit of compute (or data even) the representations should converge to the optimal ones. instrumental convergence too. I don’t get why any model trained on Othello-related tasks wouldn’t converge to such a (useful) representation.
IMHO this point is a bit overlooked. Perhaps it might be worth investigating why simply playing Othello isn’t enough? Has it to do with randomly initialized priors? I feel this could be very important, especially from a Mech Inter viewpoint—you could have different (maybe incomplete) heuristics or representations yielding the same loss. Kinda reminds me of the EAI paper which hinted that different learning rates (often) achieve the same loss but converge on different attention patterns and representations.
Perhaps there’s some variable here that we’re not considering/evaluating closely enough...