Subscripts for Probabilities
cross-posted from niplav.github.io
Gwern has wondered about a use-case for subscripts in hypertext. While they have settled on a specific use-case, namely years for citations, I propose a different one: reporting explicit probabilities.
Explicitely giving for probabilities in day-to-day English text is usually quite clunky: “I assign 35% to North Korea testing an intercontinental ballistic missile until the end of this year” reads far less smoothly than “I don’t think North Korea will test an intercontinental ballistic missile this year”.
And since subscripts are a solution in need of a problem, one can wonder how well those two fit together: Quite well, I claim.
In short, I propose to append probabilities in subscript after a statement using standard HTML subscript notation (or as a fallback if it’s available), with the probability possibly also being a link to a relevant forecasting platform with the same question:
I think Donald Trump is going to be incarcerated before 2030.
This is almost as readable as the sentence without the probability.
There are some complications with negations in sentences or multiple statements. For the most part, I’ll simply avoid such cases (“Doctor, it hurts when I do this!” “Don’t do that, then.”), but if I had to, I’d solve the first problem by declaring that the probability applies to the literal meaning of the previous sentence, including all negations; the problem with multiple statements is solved by delimiters.
As an example for the different kinds of negation: “The train won’t come more than 5 minutes late” would (arguendo) mean the same thing as “I don’t think the train will come more than 5 minutes late” means the same as “The train will take more than 5 minutes to arrive” equivalent to “I assign 90% probability to the train arriving within the next 5 minutes”.
With multiple statements, my favorite way of delimiting is currently half brackets: “I think ⸤it’ll rain tomorrow⸥, but ⸤Tuesday is going to be sunny⸥, but I don’t think ⸤your uncle is going to be happy about that⸥.”
The probabilities in this context aren’t quite evidentials, but neither are they veridicals nor miratives, I propose the world “credal” for this category.
Enumerating Possible Notations
The exact place of insertion is subtle: In sentences with a single central statement, there are multiple locations one could place the probability.
After the verb related to belief: “I think it’ll rain tomorrow.”
Advantage: Close to the word relating to the belief (which could reflect the strength of belief in itself, using “guess”/”wager”/”think”/”believe”).
Disadvantages:
Conflicts with assigning probabilities to multiple statements.
Puts visual clutter before the statement in question.
At the end of the statement: “I think it’ll rain tomorrow.”
Advantage: Allows assigning probabilities to simple statements (“It’ll rain tomorrow”) and to multiple statements (see below).
Disadvantage: If the probability is intended to contextualise the statement, this context is weaker if it is introduced after the statement in question.
At the subject of the sentence: “I think it’ll rain tomorrow.”
Advantage: This can be used to distinguish the beliefs of different people. “I think it’ll rain tomorrow, but Cú Chulainn is skeptical about it.”
Disadvantage: Putting the probability before the statement the probability is about feels quite unnatural.
This becomes trickier in sentences with multiple statements.
Probabilities after each subclaim: “I think it’ll rain tomorrow, but Tuesday is going to be sunny, but I don’t think your uncle is going to be happy about that.
Adding in delimiters to denote a specific subclaim the probability is about. I wonder whether there are better unicode characters for this, corner brackets might be a good candidate.
Lower half brackets (or Quine corners which look almost the same): “I think ⸤it’ll rain tomorrow⸥, but ⸤Tuesday is going to be sunny⸥, but I don’t think ⸤your uncle is going to be happy about that⸥.”
Upper half brackets to the left, lower half brackets to the right: “I think ⸢it’ll rain tomorrow⸥, but ⸢Tuesday is going to be sunny⸥, but I don’t think ⸢your uncle is going to be happy about that⸥.”
Subscripted parentheses: “I think it’ll rain tomorrow, but Tuesday is going to be sunny, but I don’t think your uncle is going to be happy about that.”
Subscripted half guillemets: “I think it’ll rain tomorrow, but Tuesday is going to be sunny, but I don’t think your uncle is going to be happy about that.”
And subscripted full guillemets: “I think it’ll rain tomorrow, but Tuesday is going to be sunny, but I don’t think your uncle is going to be happy about that.”
I basically rule out lists of probabilities after the verb relating to each subclaim, as it’s very mentally taxing to relate each probability to each claim:
“I think ⸤it’ll rain tomorrow⸥, but ⸤Tuesday is going to be sunny⸥, but I don’t think ⸤your uncle is going to be happy about that⸥.
Since the people writing the text reporting probabilities are probably logically non-omniscient bounded agents, it might as well be useful to report the time or effort one has spent on refining the reported probability: “I reckon humanity will survive the 21st century”, indicating that the speaker has reflected on this question for 20 hours to arrive at their current probability (something akin to reporting an “epistemic effort” for a piece of information). I fear that this notation is getting into cumbersome territory and won’t be using it.
Notation Options and Difficulties
There are three available options: Either ones writing platform supports
HTML, in which case one can use the <sub>18</sub>
tags (giving
), or it supports , which creates a sligthly
fancier looking but also more fragile notation using _{18\%}
(resulting
in ), or ones platform directly supports subscripting, such
as pandoc with ~18%~
, but not
Reddit Markdown (which does support superscript). More info about other
platforms here.
Ideally one would simply use Unicode subscripts, which are available for all digits, but tragically not for the percentage sign ‘%’ or a simple dot ‘.’. Perhaps a project for the future: After all, they did include a subscript ‘+’₊, a subscript ‘-’₋, equality sign ‘=’₌ and parentheses ‘()’₍₎, but many subscript letters (b, c, d, f, g, j, q, r, u, v, w, y and z) are still missing…
Applications
I’ve used this notation sparingly but increasingly, a good example of a first exploration is here.
Fischer 2023 uses a different notation:
Given hedonism and conditional on sentience, we think (credence: 0.7) that none of the vertebrate nonhuman animals of interest have a welfare range that’s more than double the size of any of the others. While carp and salmon have lower scores than pigs and chickens, we suspect that’s largely due to a lack of research.
Given hedonism and conditional on sentience, we think (credence: 0.65) that the welfare ranges of humans and the vertebrate animals of interest are within an order of magnitude of one another.
Given hedonism and conditional on sentience, we think (credence 0.6) that all the invertebrates of interest have welfare ranges within two orders of magnitude of the vertebrate nonhuman animals of interest. Invertebrates are so diverse and we know so little about them; hence, our caution.
The notation proposed here would change the text:
Given hedonism and conditional on sentience, we think that none of the vertebrate nonhuman animals of interest have a welfare range that’s more than double the size of any of the others. While carp and salmon have lower scores than pigs and chickens, we suspect that’s largely due to a lack of research.
Given hedonism and conditional on sentience, we think that the welfare ranges of humans and the vertebrate animals of interest are within an order of magnitude of one another.
Given hedonism and conditional on sentience, we think that all the invertebrates of interest have welfare ranges within two orders of magnitude of the vertebrate nonhuman animals of interest. Invertebrates are so diverse and we know so little about them; hence, our caution.
- The possible shared Craft of deliberate Lexicogenesis by 20 May 2023 5:56 UTC; 45 points) (
- 3 Mar 2024 0:06 UTC; 11 points) 's comment on The World in 2029 by (
It isn’t obvious to me why this is better than e.g. what at the end you quote Fischer 2023 as doing, which (1) feels to me less like a special convention that might need explaining and (2) works fine without needing to be able to write subscripts and without running into gotchas related to how subscripts are implemented (e.g., if you do them with Unicode subscripts then I think searching for “80%” will not find an “80%” subscript, because those are different characters).
What advantage do you see to using subscripts that outweighs those factors?
Browsers unify many characters for search purposes (or strip them out), but it looks like Unicode sub/superscripts are sometimes but not always considered equivalent. You can test this out in your own browser by going to https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts#Superscripts_and_subscripts_block and doing C-f ‘2’ or ‘3’ or something. I get no hits in Firefox, but I do in Chromium. (And even if your browser does, what about all your other tools? Stuff like
grep
sure won’t treat them as equivalent without a lot of work. Or they will get stripped out, or turn into mojibake, or...) So, not something you can count on.Aside from weirdness like that, I also think that the Unicode sub/superscript characters tend to look jarring and out-of-place. I don’t know if the fonts are bad, or they omit it & the fallback is bad, or if they are ‘typographically correct’ but we are so unfamiliar with ‘proper’ sub/superscript compared to HTML ones that they look wrong to us, or what. There are many places where Unicode works well for fancier typography, but between the omission of many letters*, breaking tools, and bad appearance, the Unicode sub/superscripts are a bad solution if you have anything better available.
I’d consider using them only if I was restricted to pure UTF-8 text, with nothing else. (For example, a link tooltip. Or maybe a machine-learning context where the model can’t handle HTML formatting.)
* Which you’d want for… a lot of things. For example, you could write with Unicode subscripts ‘Foo 2023a’, but not ‘Foo 2023b’. Because there’s a subscript ‘a’ but not a subscript b’ (or ‘c’, or ‘d’). Yeah, I know. So if you absolutely insist on Unicode subscripts, now you need a new way to disambiguate, like ‘Foo 2023-1’ vs ‘Foo 2023-2’ or something.
I hadn’t thought about the issue with searching, that’s a pretty good counterargument. (I am not able to search for the probabilities in this document either, because the LATEX isn’t searchable :-/)
Ultimately it comes down to an aesthetic preference for me: I will use these because they look kind of neat. But perhaps applying the reversal test to something like footnotes is interesting here: Imagine one was always writing “more specialized predators have bigger prey (see footnote 3)” instead of “more specialized predators have bigger prey³”. The latter is more compact, but not searchable.
Obviously there are switching costs associated with this. But perhaps the compactness that’s an advantage for footnotes is a similar advantage here, that’s why I’m trying it out.
That still has search problems! Consider: “see footnotes 3, 9, and 11–13”. How do you search for any of those 4 footnotes? The natural language approach is inherently ambiguous for such a hypertext problem which requires some formal support.
(The real solution there is footnote backlinks, like we have on Gwern.net: you can search for all references—site-wide, too—to a footnote by simply going to the footnote in question. If you’re not up to that, then a lightweight HTML approach would be to simply wrap each footnote number in a span and hide the text from display, but not search, so C-f ‘footnote 3’ would always hit the “prey³” construct.)
here’s the non-quantified meaning in terms of wh-movement from right to left:
S:I think what––––––image: you thinking it'll what–––––– tomorrow(circumfix time operator)rainimage: rainimage: rain inside your prospective thoughts
for conlanging, i like95% this set of principles:
minimise total visual distance between operators and their arguments
minimise total novelty/complexity/size of all items the reader is forced to store in memory while parsing
every argument in memory shud find its operator asap, and vice versa
some items are fairly easy to store in memory (aka active context)
like the identity of the person writing this comment (me)
or the topic of the post i’m currently commenting on (clever ways to weave credences into language)
other items are fairly hard
often the case in mathy language, bc several complex-and-specific-and-novel items are defined at the outset, and are those items are not given intuitive anaphora.
another way to use sentence-structure to offload memory-work is by writing hierarchical lists like this, so you can quickly switch gaze btn ii., c, and 2—allowing me to leverage the hierarchy anaphorically.
so to quantify sentence S, i prefer ur suggestion “I think55% it’ll rain tomorrow”. the percentage is supposed to modify “I think” anyway, so it makes more sense to make them adjacent. it’s just more work bc it’s novel syntax, but that’s temporary.
S′:what––––––55%(quantifier) I think what––––––image: you thinking it'll what–––––– tomorrow(circumfix time operator)rainimage: rainimage: rain inside your tentetative prospective thoughts
otoh, if we’re specifying that subscripts are only used for credences anyway, there’s no reason for us to invoke the redundant “I think” image. instead, write
in fact, the whole circumfix operator is gratuitously verbose![1] just write:
S′′:what––––––55%image: you tentatively thinkingtomorrow what––––––(time operator)rainimage: rainimage: rain inside your tentetative prospective thoughts
natlangs smh…[2]
tbh i wish we had an editor w optimised LaTeX keyboard shortcuts so we cud effortlessly use underbracesor undersets! wherever we fancy.[3]
additionally, we should just make much more use the dimensionality afforded to us by the editors we have.
it’s
freecheap semioticsemanticsyntactic real-estate.The principles you propose make a lot of sense! Dropping “I think” or “My best guess” is then for the best80%.
Also, the underset/underbraces stuff is promising55% but too much to spend weirdness points on70%.
FWIW, I too worried that it would be just too weird, but I’ve been surprised how little pushback, criticism, or even notice has been taken of using subscripts for years on Gwern.net. I see more comments about my using fully-justified text!
did you know that, if you’re a hermit, you get infinite weirdness points?
✧*。ヾ( >﹏< )ノ゙✧*。
I think the goal of making communicating (un-)certainties costs less bandwidth is a worthy one, and quite like this proposal. I think I would have understood exactly what it meant without an explanation, but by explaining first this post never gave a chance to find out, which is a mild shame. Purely aesthetically, I would put a space between the last word of the sentence and the credence.
Hm, my aesthetics object to the space between the last word of the sentence and the credence as plenken—but then again, I’m also a fan of inordinately compact programming languages.
This is wonderful; feels much more friendly, practical, and conducive to ideal speech situations. If someone tries to attack me for a wrong probability, I can respond “I’m just talking but with additional clarity; no one is perfect.”