Yep I had considered doing that. Sadly, if resample ablations on the filler tokens reduced performance, that doesn’t necessarily imply that the filler tokens are being used for extra computation. For example, the model could just copy the relevant details from the problem into the filler token positions and solve it there.
Yep I had considered doing that. Sadly, if resample ablations on the filler tokens reduced performance, that doesn’t necessarily imply that the filler tokens are being used for extra computation. For example, the model could just copy the relevant details from the problem into the filler token positions and solve it there.
Oh hmm that’s very clever and I don’t know how I’d improve the method to avoid this.