We store activations in a buffer of ~500k tokens which is refilled and shuffled whenever 50% of the tokens are used (ie: Neel’s approach).
I am not sure I understand the reasoning around this approach. Why do you want to refill and shuffle tokens whenever 50% of the tokens are used? Is this just tokens in the training set or also the test set? In Neel’s code I didn’t see a train/test split, isn’t that important? Also, can you track the number of epochs of training when using this buffer method (it seems like that makes it more difficult)?
Why do you want to refill and shuffle tokens whenever 50% of the tokens are used?
Neel was advised by the authors that it was important minimise batches having tokens from the same prompt. This approach leads to a buffer having activations from many different prompts fairly quickly.
Is this just tokens in the training set or also the test set? In Neel’s code I didn’t see a train/test split, isn’t that important?
I never do evaluations on tokens from prompts used in training, rather, I just sample new prompts from the buffer. Some library set aside a set of tokens to do evaluations on which are re-used. I don’t currently do anything like this but it might be reasonable. In general, I’m not worried about overfitting.
Also, can you track the number of epochs of training when using this buffer method (it seems like that makes it more difficult)?
Epochs in training makes sense in a data-limited regime which we aren’t in. OpenWebText has way more tokens than we ever train any sparse autoencoder on so we’re always on way less than 1 epoch. We never reuse the same activations when training, but may use more than one activation from the same prompt.
Neel was advised by the authors that it was important minimise batches having tokens from the same prompt. This approach leads to a buffer having activations from many different prompts fairly quickly.
Oh I see, it’s a constraint on the tokens from the vocabulary rather than the prompts. Does the buffer ever reuse prompts or does it always use new ones?
I am not sure I understand the reasoning around this approach. Why do you want to refill and shuffle tokens whenever 50% of the tokens are used? Is this just tokens in the training set or also the test set? In Neel’s code I didn’t see a train/test split, isn’t that important? Also, can you track the number of epochs of training when using this buffer method (it seems like that makes it more difficult)?
Neel was advised by the authors that it was important minimise batches having tokens from the same prompt. This approach leads to a buffer having activations from many different prompts fairly quickly.
I never do evaluations on tokens from prompts used in training, rather, I just sample new prompts from the buffer. Some library set aside a set of tokens to do evaluations on which are re-used. I don’t currently do anything like this but it might be reasonable. In general, I’m not worried about overfitting.
Epochs in training makes sense in a data-limited regime which we aren’t in. OpenWebText has way more tokens than we ever train any sparse autoencoder on so we’re always on way less than 1 epoch. We never reuse the same activations when training, but may use more than one activation from the same prompt.
Oh I see, it’s a constraint on the tokens from the vocabulary rather than the prompts. Does the buffer ever reuse prompts or does it always use new ones?