Phenomenology entails “bracketing” which suspends judgement, but may be considered, in terms of Quine, as quotations marks in an attempt to ascend to a level of discourse in which judgment is less contingent hence more sound: Source attribution. My original motivation for suggesting Wikipedia to Marcus Hutter as the corpus for his Hutter Prize For Lossless Compression of Human Knowledge, was to create an objective criterion for selecting language models that best-achieve source attribution on the road to text prediction. Indeed, I originally wanted the change log, rather than Wikipedia itself, to be the corpus but it was too big. Nevertheless, while all text within Wikipedia may be, to first order, attributed to Wikipedia, to optimally approach the Algorithmic Information content of Wikipedia, it would be essential to discover latent identities so as to more optimally predict characteristics of a given article. Moreover, there would need to be a latent taxonomy of identities affecting the conditional algorithmic probabilities (conditioned on the topic as well as the applicable latent attribution) so that one might be able to better predict the bias or spin a given identity might place on the article including idioms and other modes of expression.
Grounding this in Algorithmic Information/Probability permits a principled means of quantifying the structure of bias, which is a necessary precursor to discovering the underlying truths latent in the text.
Starting with a large language model is more problematic but here’s a suggestion:
Perform parameter distillation to reduce the parameter count for better interpretability on the way toward feature discovery. From there it may be easier to construct the taxonomy of identities thence the bias structure.
While I still think the Hutter Prize is the most rigorous approach to this problem, it is apparently the case that there is insufficient familiarity with how one can practically go about incentivizing information criterion for model selection to achieve the desired end which is extracting a rigorous notion of unbiased truth.
Phenomenology entails “bracketing” which suspends judgement, but may be considered, in terms of Quine, as quotations marks in an attempt to ascend to a level of discourse in which judgment is less contingent hence more sound: Source attribution. My original motivation for suggesting Wikipedia to Marcus Hutter as the corpus for his Hutter Prize For Lossless Compression of Human Knowledge, was to create an objective criterion for selecting language models that best-achieve source attribution on the road to text prediction. Indeed, I originally wanted the change log, rather than Wikipedia itself, to be the corpus but it was too big. Nevertheless, while all text within Wikipedia may be, to first order, attributed to Wikipedia, to optimally approach the Algorithmic Information content of Wikipedia, it would be essential to discover latent identities so as to more optimally predict characteristics of a given article. Moreover, there would need to be a latent taxonomy of identities affecting the conditional algorithmic probabilities (conditioned on the topic as well as the applicable latent attribution) so that one might be able to better predict the bias or spin a given identity might place on the article including idioms and other modes of expression.
Grounding this in Algorithmic Information/Probability permits a principled means of quantifying the structure of bias, which is a necessary precursor to discovering the underlying truths latent in the text.
Starting with a large language model is more problematic but here’s a suggestion:
Perform parameter distillation to reduce the parameter count for better interpretability on the way toward feature discovery. From there it may be easier to construct the taxonomy of identities thence the bias structure.
While I still think the Hutter Prize is the most rigorous approach to this problem, it is apparently the case that there is insufficient familiarity with how one can practically go about incentivizing information criterion for model selection to achieve the desired end which is extracting a rigorous notion of unbiased truth.