Very informative toy examples. Regarding this point:
> Some kind of failure of spatial reasoning (wandering items, whatever was going on with some of the sliding square chain-of-thoughts where pieces vanished)
I would strongly agree with this. I actually think the sliding block puzzle is a task which might just be easy for humans on account of our strong spatial priors. In the physical world, things move with spatial locality and two objects cannot be in the same place. For the LLM, it is trained on orders of magnitude less data to learn to represent spatial locality in text-based drawings of 2D images. In other words, basic physical reasoning priors which are second nature to us (due to many thousands of hours of training data) may not be fully imbued in the model.
I think the reason I emphasize this is because I worry that any test of generality which invokes 2D space might actually be a test of the strength of spatial priors.
I would love to hear thoughts on (a) whether spatial reasoning is needed to automate coding / get self-improving AI and (b) examples of clear LLM failures on math/logic reasoning which don’t invoke a spatial reasoning prior
(+ hope this is in line with community norms—haven’t really posted on LW much!)
Very informative toy examples. Regarding this point:
> Some kind of failure of spatial reasoning (wandering items, whatever was going on with some of the sliding square chain-of-thoughts where pieces vanished)
I would strongly agree with this. I actually think the sliding block puzzle is a task which might just be easy for humans on account of our strong spatial priors. In the physical world, things move with spatial locality and two objects cannot be in the same place. For the LLM, it is trained on orders of magnitude less data to learn to represent spatial locality in text-based drawings of 2D images. In other words, basic physical reasoning priors which are second nature to us (due to many thousands of hours of training data) may not be fully imbued in the model.
I think the reason I emphasize this is because I worry that any test of generality which invokes 2D space might actually be a test of the strength of spatial priors.
I would love to hear thoughts on (a) whether spatial reasoning is needed to automate coding / get self-improving AI and (b) examples of clear LLM failures on math/logic reasoning which don’t invoke a spatial reasoning prior
(+ hope this is in line with community norms—haven’t really posted on LW much!)