This seems reasonable, though efficacy of the learning method seems unclear to me.
But:
with a heavily-reinforced constraint that the author vectors are identical for documents which have the same author
This seems wrong. To pick on myself, my peer reviewed papers, my substack, my lesswrong posts, my 1990s blog posts, and my twitter feed are all substantively different in ways that I think the author vector should capture.
My guess is that we want to capture those differences with the time&date meta-data instead (and to some extent, location and other metadata). That way, we can easily query what you-in-particular would say at other periods in your life (such as the future). However, I agree that this is at least not obvious.
Maybe a better way to do it would be to explicitly take both approaches, so that there’s an abstract-you vector which then gets mapped into a particular-you author space via combination with your age (ie with date&time). This attempts to explicitly capture the way you change over time (we can watch your vector move through the particular-author space), while still allowing us to query what you would say at times where we don’t have evidence in the form of writing from you.
Ideally, imagining the most sophisticated version of the setup, the model would be able to make date&time attributions very fine-grained, guessing when specific words were written & constructing a guessed history of revisions for a document. This complicates things yet further.
This seems reasonable, though efficacy of the learning method seems unclear to me.
But:
This seems wrong. To pick on myself, my peer reviewed papers, my substack, my lesswrong posts, my 1990s blog posts, and my twitter feed are all substantively different in ways that I think the author vector should capture.
My guess is that we want to capture those differences with the time&date meta-data instead (and to some extent, location and other metadata). That way, we can easily query what you-in-particular would say at other periods in your life (such as the future). However, I agree that this is at least not obvious.
Maybe a better way to do it would be to explicitly take both approaches, so that there’s an abstract-you vector which then gets mapped into a particular-you author space via combination with your age (ie with date&time). This attempts to explicitly capture the way you change over time (we can watch your vector move through the particular-author space), while still allowing us to query what you would say at times where we don’t have evidence in the form of writing from you.
Ideally, imagining the most sophisticated version of the setup, the model would be able to make date&time attributions very fine-grained, guessing when specific words were written & constructing a guessed history of revisions for a document. This complicates things yet further.