Super interesting, thanks! I hadn’t come across that work before, and that’s a cute and elegant definition.
To me, it’s natural to extend this to specific substrings in the document? I believe that models are trained with documents chopped up and concatenated to fit into segment that fully fit the context window, so it feels odd to talk about document as the unit of analysis. And in some sense a 1000 token document is actually 1000 sub-tasks of predicting token k given the prefix up to token k-1, each of which can be memorised.
Maybe we should just not apply a gradient update to the tokens in the repeated substring? But keep the document in and measure loss on the rest.
Super interesting, thanks! I hadn’t come across that work before, and that’s a cute and elegant definition.
To me, it’s natural to extend this to specific substrings in the document? I believe that models are trained with documents chopped up and concatenated to fit into segment that fully fit the context window, so it feels odd to talk about document as the unit of analysis. And in some sense a 1000 token document is actually 1000 sub-tasks of predicting token k given the prefix up to token k-1, each of which can be memorised.
Maybe we should just not apply a gradient update to the tokens in the repeated substring? But keep the document in and measure loss on the rest.