gwern
I think this framing conflates the question of input with that of presentation. The ‘e’ notation seems easiest to input—simple, unambiguous and reliable to parse, enterable everywhere—but it’s not a good one to read, because if nothing else now it looks like it’s multiplying variables & numbers etc.
They don’t have to be the same. If numbers are written uniformly, they can be parsed & rendered differently.
For example, I think that one of the things that makes calculations or arguments hard to follow is that they shamelessly break human subitizing and intuitive numeracy by promiscuously mixing units, which makes it hard to do one of the most common things we do with numbers—compare them—while not really making anything easier.
In much the same way that people sloppily will quote dollar amounts from decades apart as if they were the same thing (which is why I inflation adjust them automatically into current dollars), they will casually talk about “10 million” vs “20 billion”, imposing a burden of constant mental arithmetic as one tries to juggle back and forth all of these different base units. Sure, decimal numbers or metric units may not be as bad as trying to convert hogheads to long fathoms or swapping between binary and decimal, but it’s still not ideal.
It is no wonder that people constantly are off by orders of magnitude and embarrass themselves on social media when they turn out to be a factor of 10 off because they accidentally converted by 100 instead of 1000, or they convert milligrams and grams wrong and poison themselves on film. If someone is complaining about the US federal government, which is immediately more understandable: “of $20 billion, $10 million was spent on engineering a space pen” or “of $20,000 million, $10 million was spent on a space pen”? (And this is an easy case, with about the most familiar possible units. As soon as it becomes something like milligrams and grams...)
I mean, imagine if this was normal practice with statistical graphs: “oh, the blue and red bar columns, even though they are the same size in the image and describe the same thing, dollars, are actually 10x different. Didn’t you see in the legend where it clearly says that ‘blue = 1; red = 10’?” “Er, OK, but if they’re the same sort of thing, then why are some blue and some the larger red?” “No reason. I just didn’t feel like multiplying the blue datapoints by 10 before graphing.” ”...I see.”
So while it might look a little odd, I try to write with a single base-unit throughout a passage of writing, to enable immediate comparison. (I think this helps a lot with DL scaling too, because somehow when you talk about a model having ’50 million parameters’ and are comparing it to multi-billion parameter models like a “GPT-3-175b”, that seems a lot bigger than if you had written ‘0.05b parameters’. Or if you compare, say, a Gato with 1b parameters to a GPT-4 with 1,400b parameters, the comparison feels a lot more intuitive than if I had written ‘a GPT-4 with 1.4 trillion parameters’.)
This might seem too annoying for the author (although if it is, that should be a warning sign: if it’s hard for you, the author, to corral these units while writing them, how do you expect the reader to handle them?), but it could just be automated. Write all numbers which are numerical in a standard format, whether it’s
10e2
or1,000
, and then a program can simply parse it for numbers, take the first number, extract the largest base that makes it a single-digit number (“thousand”) and then rewrite all following numbers with that as the unit, formatted in your preferred format as ‘1 × 102’ or whatever.(And you can, for HTML, make them copy-paste as regular full-length numbers through a similar trick as we do to provide the original LaTeX for math formulas which were converted from LaTeX, so it can be fully compatible with copy-pasting into a REPL or other application.)
I can be deanonymized in other ways more easily.
I write these as warnings to other people who might think that it is still adequate to simply use a pseudonym and write exclusively in text and not make the obvious OPSEC mistakes, and so you can safely write under multiple names. It is not, because you will have already lost in a few years.
Regrettable as it is, if you wish to write anything online which might invite persecution over the next few years or lead activists/newspapers-of-record to try to dox you—if you are, say, blowing a whistle at a sophisticated megacorp company with the most punitive NDAs & equity policies in the industry—you would be well-advised to start laundering your writings through an LLM yesterday, despite the deplorable effects on style. Truesight will only get keener and flense away more of the security by obscurity we so take for granted, because “attacks only get better”.
Yeah, that’s part of why I’m suspicious. I remember the original OA finetuning as being quite expensive, but the current one is not that expensive. If a GPT-3 is like 100GB of weights, say, after optimization, and it’s doing true finetuning, how is OA making it so cheap and so low-latency?
It is certainly big news if OA fine-tuning doesn’t work as it’s supposed to
The docs are pretty vague, but I notice that most of them are framed as being around declarative sorts of knowledge. It’s positioned as being a way to reduce the number of examples in the prompt (to save tokens & reduce latency), or include additional factual knowledge, like defining edge cases. There is one brief mention that you may be able to use it for “Performing a new skill or task that’s hard to articulate in a prompt”, but that’s about it.
And when it comes to lightweight finetuning such as LoRA, people tend to notice that they are good for adding new factual knowledge or increasing the prior of specific pre-existing knowledge, but don’t really add qualitatively new things—like you cannot simply LoRA your way to better hands in an image generator or teach it 3D generation if it didn’t already know that. So I’ve long been suspicious that OA isn’t doing real finetuning, of the entire model, but much cheaper underperforming LoRA-like lightweight finetuning (of the sort which can be easily stored on-GPU rather than loading an entire finetuned model or its delta from cloud storage, or tying up entire sets of GPUs to keep a full finetuned model hot).
One sanity check here would be to just make some 128k ctx window calls full of examples; if you cannot k-shot this capability even with that, then you shouldn’t expect “finetuning” to either; while if it works, that implies the “finetuning” is much worse than it ought to be and so the original results are uninformative.
Oh, that seems easy enough. People might think that they are safe as long as they don’t write as much as I or Scott do under a few names, but that’s not true. If you have any writing samples at all, you just stick the list of them into a prompt and ask about similarity. Even if you have a lot of writing, context windows are now millions of tokens long, so you can stick an entire book (or three) of writing into a context window.
And remember, the longer the context window, the more that the ‘prompt’ is simply an inefficient form of pretraining, where you create the hidden state of an RNN for millions of timesteps, meta-learning the new task, and then throw it away. (Although note even there that Google has a new ‘caching’ feature which lets you run the same prompt multiple times, essentially reinventing caching RNN hidden states.) So when you stick corpuses into a long prompt, you are essentially pretraining the LLM some more, and making it as capable of identifying a new author as it is capable of already identifying ‘gwern’ or ‘Scott Alexander’.
So, you would simply do something like put in a list of (author, sample) as well as any additional metadata convenient like biographies, then ‘unknown sample’, and ask, ‘rank the authors by how likely they are to have written that final sample by an unknown author’.
This depends on having a short list of authors which can fit in the prompt (the shorter the samples, the more you can fit, but the worse the prediction), but it’s not hard to imagine how to generalize this to an entire list. You can think of it as a noisy sorting problem or a best-arm finding problem. Just break up your entire list of n authors into groups of m, and start running the identification prompt, which will not cost n log n prompts because you’re not sorting the entire list, you are only finding the min/max (which is roughly linear). For many purposes, it would be acceptable to pay a few dozen dollars to dox an author out of a list of a few thousand candidates.
djb admonishes us to always to remember to ask about amortized or economies of scales in attacks, and that’s true too here of course in stylometric attacks. If we simply do the obvious lazy sort, we are throwing away all of the useful similarity information that the LLM could be giving us. We could instead work on embedding authors by similarity using comparisons. We could, say, input 3 authors at a time, and ask “is author #1 more similar to #2, or #3?” Handwaving the details, you can then take a large set of similarity rankings, and infer an embedding which maximizes the distance between each author while still obeying the constraints. (Using expectation maximization or maybe an integer solver, idk.) Now you can efficiently look up any new author as a sort of nearest-neighbors lookup problem by running a relatively few comparison prompts and homing in on the set of author-points a new author is nearest, and use that small set for a final direct question.
(All this assumes you are trying to leverage a SOTA LLM which isn’t directly accessible. If you use an off-the-shelf LLM like a LLaMA-3, you would probably do something more direct like train a triplet loss on the frozen LLM using large text corpuses and get embeddings directly, making k-NN lookups effectively free & instantaneous. In conclusion, text anonymity will soon be as dead as face anonymity.)
This seems a bit odd given past literature on LLMs. As I’ve noted before, you can do inner-monologue problems specifically via knowledge-distillation somewhat analogous to your finetuning, and it’s also possible to ask models to solve multiple problems simultaneously analogous to your base task (or do various kinds of speculative or parallelized decoding at a lower level). There is enormous computational waste and slack, and capacity to spare for multiple problems. So it not working for the OA “finetuning” of GPT-3.5 is unexpected: I can’t think of any previous results aimed at making forward passes do more which failed completely (although ofc maybe they just don’t get reported or I didn’t happen to read them etc).
I notice this is not the first time I’ve left a puzzled comment on a post where the authors failed to make GPT-3.5 do something via OA “finetuning” that it seemed like it definitely should have been capable of after finetuning or which non-OA models did do… And the common ingredient seems like the OA “finetuning”.
I’m not aware of any experiments by third parties demonstrating that OA “finetuning” works like it’s supposed to or investigating what it seems to do, and AFAIK OA still declines to explain what their “finetuning” services & models do or are. Maybe someone should do that before more people try to do AI safety research predicated on the assumption that using OA’s “finetuning” is telling you anything meaningful about LLMs in general, rather than being like, say, trying to understand LLM poetry by looking at ChatGPT’s rhymes or LLM linguistic knowledge by asking one to spell words.
Yes, I’ve never had any difficulty replicating the gwern identification: https://chatgpt.com/share/0638f916-2f75-4d15-8f85-7439b373c23c It also does Scott Alexander: https://chatgpt.com/share/298685e4-d680-43f9-81cb-b67de5305d53 https://chatgpt.com/share/91f6c5b8-a0a4-498c-a57b-8b2780bc1340 (Examples from sinity just today, but parallels all of the past ones I’ve done: sometimes it’ll balk a little at making a guess or identifying someone, but usually not hard to overcome.)
One interesting thing is that the extensive reasoning it gives may not be faithful. Notice that in identifying Scott Alexander’s recent Reddit comment, it gets his username wrong—that username does not exist at all. (I initially speculated that it was using retrieval since OA & Reddit have struck a deal; but obviously, if it had, or had been trained on the actual comment, it would at least get the username right.) And in my popups comment, I see no mention that points to LessWrong, but since I was lazy and didn’t copyedit that comment, it is much more idiosyncratic than usual; so what I think ChatGPT-4o does there is immediately deduce that it’s me from the writing style & content, infer that it could not be a tweet due to length or a Gwern.net quote because it is clearly a comment on social media responding to someone, and then guesses it’s LW rather than HN, and presto.
-
Age is extremely compressed/skewed because it’s OKCupid. So I can think of a couple issues there: there might be a problem of distribution mismatch where a GPT is trained on a much more even distribution of text (I would assume tons of text is written by age 50-100 IRL rather than a young techie dating website) and so is simply taking into account a very different base rate; another issue is that maybe the GPT is accurate but restriction of range creates misleading statistical artifacts. Binarization wouldn’t help, and might worsen matters given the actual binarization here at age 30 - how many people tweak their age on a dating site to avoid the dreaded leading ‘3’ and turning into Christmas cake? You’ll remember OKCupid’s posts about people shading the truth a little about things like height… (A more continuous loss like median average error might be a better metric than Brier on a binary or categorical.)
As far as sexuality goes, this is something the LLMs may be trained very heavily on, with unpredictable effects. But it’s also a much weirder category here too:
Dating sites in general have more males than females, reflecting the mating behavior seen offline (more males being on the lookout). OKCupid features a very broad selection of possible genders. One must choose at least one category and up to 5 categories of which the possible options are: Man, Woman, Agender, Androgynous, Bigender, Cis Man, Cis Woman, Genderfluid, Genderqueer, Gender Nonconforming, Hijra, Intersex, Non-binary, Other, Pangender, Transfeminine, Transgender, Transmasculine, Transsexual, Trans Man, Trans Women and Two Spirit. Nevertheless, almost everybody chooses one of the first two (39.1 % Women, 60.6 % Men, binary total = 99.7 %)^5. The full count by type can be found in the supplementary materials sheet “Genders”).
I’m not sure how OP handled that. So the predictive power here should be considered as a loose lower bound, given all the potential sources of measurement error/noise.
-
I didn’t know it meant either.
It seems like it was a big commitment because there were several hints during the OpenAI coup reporting that Superalignment was not getting the quota as OA ran very short on compute in 2023, creating major internal stress (particularly from Sam Altman telling people different things or assigning the same job) and that was one of the reasons for Altman sidelining Ilya Sutskever in favor of Jakub Pachocki. What sounded good & everyone loved initially turned out to be a bit painful to realize. (Sort of like designing the OA LLC so the OA nonprofit board could fire the CEO.)
EDIT: speak of the devil: https://x.com/janleike/status/1791498178346549382 Note Leike has to be very cautious in his wording.
I’m not sure how to square those results with the Chinchilla paper though
Apples and oranges. The Chinchilla paper simply optimizes the final trained model’s loss given a fixed compute budget. It doesn’t say anything about any downstream uses—similar to how it doesn’t tell you (directly) how you should allocate your compute if you have X GPUs and you want to run a model for your users for Y requests, and you have a tradeoff between spending your GPUs at training time to create a smaller model which needs fewer GPUs to serve Y requests. Likewise, you’ve probably seen some “overtraining” analyses which argue that you should overtrain a Chinchilla by some large amount Z to get the model which best balances train vs run—but those also answer a different question because they assume that you will deploy that Chinchilla model without any sparsification or lower precision, even though that’s hardly what anyone actually does.
(While no one has done Li et al for MoEs I know of, I would expect that the results will be fairly similar, but shifted up/down, because you can often think of a MoE as a bunch of smaller dense models.)
Yup. Who knows but we are all part of a giant leave-one-out cross-validation computing counterfactual credit assignment on human history? Schmidhuber-em will be crushed by the results.
But this is also right after GPT-4o, which, like Sora not that long ago, is a major triumph of the Sutskeverian vision of just scaling up sequence prediction for everything, and which OA has been researching for years (at least since CLIP/DALL-E 1, and possibly this effort for 2-3 years as ‘Gobi’). I don’t find it so hard to believe that he’s held off until Sora & GPT-4o were out. These are the achievements of not just his lifetime, but hundreds of other peoples’ lives (look at the contributor list). He’s not going to quit anywhere before it. (Especially since by all accounts he’s been gone the entire time, so what’s a few more days or weeks of silence?)
Is there a particular reason to think that he would have had an exactly 6-month notice from the vote to remove Altman? And why would he have submitted notice then, exactly? The logical day to submit your quitting notice would be when the investigation report was submitted and was a complete Altman victory, which was not 6 months ago.
One of the interesting thing about AI minds (such as LLMs) is that in theory, you can turn many topics into testable science while avoiding the ‘problem of old evidence’, because you can now construct artificial minds and mold them like putty. They know what you want them to know, and so you can see what they would predict in the absence of knowledge, or you can install in them false beliefs to test out counterfactual intellectual histories, or you can expose them to real evidence in different orders to measure biases or path dependency in reasoning.
With humans, you can’t do that because they are so uncontrolled: even if someone says they didn’t know about crucial piece of evidence X, there is no way for them to prove that, and they may be honestly mistaken and have already read about X and forgotten it (but humans never really forget so X has already changed their “priors”, leading to double-counting), or there is leakage. And you can’t get people to really believe things at the drop of a hat, so you can’t make people imagine, “suppose Napoleon had won Waterloo, how do you predict history would have changed?” because no matter how you try to participate in the spirit of the exercise, you always know that Napoleon lost and you have various opinions on that contaminating your retrodictions, and even if you have never read a single book or paper on Napoleon, you are still contaminated by expressions like “his Waterloo” (‘Hm, the general in this imaginary story is going to fight at someplace called Waterloo? Bad vibes. I think he’s gonna lose.’)
But with a LLM, say, you could simply train it with all timestamped texts up to Waterloo, like all surviving newspapers, and then simply have one version generate a bunch of texts about how ‘Napoleon won Waterloo’, train the other version on these definitely-totally-real French newspaper reports about his stunning victory over the monarchist invaders, and then ask it to make forecasts about Europe.
Similarly, you can do ‘deep exploration’ of claims that human researchers struggle to take seriously. It is a common trope in stories of breakthroughs, particularly in math, that someone got stuck for a long time proving X is true and one day decides on a whim to try to instead prove X is false and does so in hours; this would never happen with LLMs, because you would simply have a search process which tries both equally. This can take an extreme form for really difficult outstanding problems: if a problem like the continuum hypothesis defies all efforts, you could spin up 1000 von Neumann AGIs which have been brainwashed into believing it is false, and then a parallel effort by 1000 brainwashed to believing it is as true as 2+2=4, and let them pursue their research agenda for subjective centuries, and then bring them together to see what important new results they find and how they tear apart the hated enemies’ work, for seeding the next iteration.
(These are the sorts of experiments which are why one might wind up running tons of ‘ancestor simulations’… There’s many more reasons to be simulating past minds than simply very fancy versions of playing The Sims. Perhaps we are now just distant LLM personae being tested about reasoning about the Singularity in one particular scenario involving deep learning counterfactuals, where DL worked, although in the real reality it was Bayesian program synthesis & search.)
Seems like this might work better as a control variate sort of idea, with the noise added to the environment rather than the model itself, to measure whether the model can steer its way back to its intended measured-capability level. If you add a lot of noise to an environment, then an agent which is trying to pessimize and achieve a certain lower performance will have to overperform in order to ‘merely’ be bad; then you can subtract the noise which you added, and estimate its true latent performance in the original non-noisy environment. If it’s been deliberately taking a dive, then it’ll do better than non-sandbagging agents do when evaluated in the new noisy environment.
(Imagine, say, playing chess, where a chess agent is deliberately aiming to have a rating of only, say, 1500 Elo where its true Elo is 3000. You evaluate it normally, and it scores ~1500, carefully playing in a deniable way so that it imitates the distribution of moves an Elo 1500 player has eg. like a Decision Transformer. But you then add in occasional stochastic errors to its chosen move, as if its hand “trembled”; it observes that it’s behind and keeps ramping up its move quality in order to hit its 1500 target. Eventually, you add so much noise that even playing in its final form, it’s too far behind and it starts getting Elos like 1400. You do this same evaluation with a non-deceptive version of that same model, which is honestly only 1500 Elo, and its performance degrades smoothly the instant you add in any noise at all because it’s not trained for that sort of noise and has no performance reserves to draw on, just like a human chess player will get worse if you periodically reached in & screwed with them by surprisingly randomizing one of their moves.)
To avoid this, it would need to be so situated and so knowledgeable about other agents that it knows it’s being tested in this anti-sandbagging way and also which agents it’s being compared against and how badly they do in order to know how much of an additional dive to take. “With this level of noise, an honest 1500 Elo version of me would actually be 1300 Elo, and I need to let myself lose exactly that much more, but neither more nor less.” And that may be quite difficult: an agent may be able to figure out easily it’s being tested on a weirdo environment (the way Claude-3 increasingly seems situated when you give it weird prompts or tasks), but how does it know what other agents it’s being compared to this time and what their performance in the new environment is?
I have to say, I still don’t understand the cult of Roam or why people were so impressed by, eg. the
[[link]]
syntax borrowed from English Wikipedia (which introduced it something like 18 years before on what is still the most widely-read & edited wiki software in history), which you remark on repeatedly. Even in 2019 in beta it just seemed like a personal wiki, not much different from, say, PmWiki (2002) with some more emphasis than usual on the common backlink or ‘reverse citation’ functionality (that so many hypertext systems had supported going back decades in parallel with Xanadu ideas). It may be nicer than, say, English Wikipedia’s “WhatLinksHere” (which has been there since before I began using it early in the 2000s), but nothing to create a social-media cult over or sell “courses” about (!).But if the bubble has burst, it’s not hard to see why: any note-taking, personal knowledge management, or personal wiki system is inherently limited by the fact that they require a lot of work for what is, for most people, little gain. For most people, trying to track all of this stuff is as useful as exact itemized grocery store receipts from 5 years ago.
Most people simply have no need for lots of half-formed ideas, random lists of research papers, and so on. This is what people always miss about Zettelkasten: are you writing a book? Are you a historian or German scholar? Do you publish a dozen papers a year? No? Then why do you think you need a Zettelkasten? If you are going to be pulling out a decent chunk of those references for an essay or something, possibly decades from now, then it can be worth the upfront cost of entering references into your system, knowing that you’ll never use most of them and the benefit is mostly from the long tail, and you will, in the natural course of usage, periodically look over them to foster serendipity & creativity; if you aren’t writing all that, then there’s no long tail, no real benefit, no intrinsic review & serendipity, and it’s just a massive time & energy sink. Eventually, the user abandons it… and their life gets better.
Further, these systems are inherently passive, and force people to become secretaries, typists, reference librarians, archivists, & writers simply to keep it from rotting (quite aside from any mere software issue), to keep it up to date, revise tenses or references, fix spelling errors, deal with link rot, and so on. (Surprisingly, most people do not find that enjoyable.) There is no intelligence in such systems, and they don’t do anything. The user still has to do all the thinking, and it adds on a lot of thinking overhead.
So what comes after Roam and other personal systems which force the user to do all the thinking? I should think that would be obvious: systems which can think for the user instead. LLMs and other contemporary AI are wildly underused in the personal system space right now, and can potentially fix a lot of these issues, through approaches like actively surfacing connections instead of passively waiting for the user to make them on their own and manually record them, and can proactively suggest edits & updates & fixes that the user simply approves in batches. (Think of how much easier it is to copyedit a document using a spellcheck as a series of Y/N semi-automatic edits, than to go through it by eye, fixing typos.)
However, like most such paradigm shifts, it will be hard to tack it onto existing systems. You can’t reap the full benefits of LLMs with some tweaks like ‘let’s embed documents and add a little retrieval pane!’. You need to rethink the entire system and rewrite it from the ground up on the basis of making neural nets do as much as possible, to figure out the new capabilities and design patterns, and what to drop from the old obsolete personal wikis like Roam.
From what it sounds like, the Roam community would never stand for that, and I have a lot of doubts about whether it makes sense economically to try. It seems like if one wanted to do that, it would be better to start with a clean sheet (and an empty cap table).
For extremely opinionated ‘tags’ like that, where their very existence is probably going too far, maybe users should be encouraged to simply use a comment in their Short Form to list URLs? Since looking up one’s Short Form comments to edit in a URL is annoying, possibly with some UX slapped on top for convenience: a widget on every page for “add to Personal List [A / B / C / D]” where a ‘Personal List’ is just a ‘Short Form’ comment starting with a phrase “A” and then a list of links which get auto-edited to append the next one.
(For less inflammatory ones, I think my personalized-wiki hybrid proposal works fine by clearly subordinating the user comments ‘replying’ to the tag and indicating responsibility & non-community-endorsement.)
Those are not randomly selected pairs, however. There are 3 major causal patterns: A->B, A<-B, and A<-C->B. Daecaneus is pointing out that for a random pair of correlations of some variables, we do not assign a uniform prior of 33% to each of these. While it may sound crazy to try to argue for some specific prior like ‘we should assign 1% to the direct causal patterns of A->B and A<-B, and 99% to the confounding pattern of A<-C->B’, this is a lot closer to the truth than thinking that ‘a third of the time, A causes B; a third of the time, B causes A; and the other third of the time, it’s just some confounder’.
What would be relevant there is “Everything is Correlated”. If you look at, say, Meehl’s examples of correlations from very large datasets, and ask about causality, I think it becomes clearer. Let’s take one of his first examples:
For example, only children are nearly twice as likely to be Presbyterian than Baptist in Minnesota, more than half of the Episcopalians “usually like school” but only 45% of Lutherans do, 55% of Presbyterians feel that their grades reflect their abilities as compared to only 47% of Episcopalians, and Episcopalians are more likely to be male whereas Baptists are more likely to be female.
Like, if you randomly assigned Baptist children to be converted to Presbyterianism, it seems unlikely that their school-liking will suddenly jump because they go somewhere else on Sunday, or that siblings will appear & vanish; it also seems unlikely that if they start liking school (maybe because of a nicer principal), that many of those children would spontaneously convert to Presbyterianism. Similarly, it seems rather unlikely that undergoing sexual-reassignment surgery will make Episcopalian men and Baptist women swap places, and it seems even more unlikely that their religious status caused their gender at conception. In all of these 5 cases, we are pretty sure that we can rule out one of the direct patterns, and that it was probably the third, and we could go through the rest of Meehl’s examples. (Indeed, this turns out to be a bad example because we can apply our knowledge that sex must have come many years before any other variable like “has cold hands” or “likes poetry” to rule out one pattern, but even so, we still don’t find any 50%s: it’s usually pretty obviously direct causation from the temporally earlier variable, or confounding, or both.)
So what I am doing in ‘How Often Does Correlation=Causality?’ is testing the claim that “yes, of course it would be absurd to take pairs of arbitrary variables and calculate their causal patterns for prior probabilities, because yeah, it would be low, maybe approaching 0 - but that’s irrelevant because that’s not what you or I are discussing when we discuss things like medicine. We’re discussing the good correlations, for interventions which have been filtered through the scientific process. All of the interventions we are discussing are clearly plausible and do not require time travel machines, usually have mechanisms proposed, have survived sophisticated statistical analysis which often controls for covariates or confounders, are regarded as credible by highly sophisticated credentialed experts like doctors or researchers with centuries of experience, and may even have had quasi-randomized or other kinds of experimental evidence; surely we can repose at least, say, 90% credibility, by the time that some drug or surgery or educational program has gotten that far and we’re reading about it in our favorite newspaper or blog? Being wrong 1 in 10 times would be painful, but it certainly doesn’t justify the sort of corrosive epistemological nihilism you seem to be espousing.”
But unfortunately, it seems that the error rate, after everything we humans can collectively do, is still a lot higher than 1 in 10 before the randomized version gets run. (Which implies that the scientific evidence is not very good in terms of providing enough Bayesian evidence to promote the hypothesis from <1% to >90%, or that it’s <<1% because causality is that rare.)
Classic ‘typical mind’ like experience: https://www.lesswrong.com/posts/baTWMegR42PAsH9qJ/generalizing-from-one-example
The embedding approach would let you pick particular authors to measure distance to and normalize, and I suppose that’s something like a “X% Hemingway, Y% Steinbeck”...
Although I think the bigger problem is, what does that even mean and why do you care? Why would you care if it was 20% Hemingway / 40% Steinbeck, rather than vice-versa, or equal, if you do not care about whether it is actually by Hemingway?
I don’t think that’s true, particularly in a politics/law enforcement context. Many people now have writings on social media. The ones who do not can just be subpoenaed for their text or email histories; in the US, for example, you have basically zero privacy rights in those and no warrant is necessary to order Google to turn over all your emails. There is hardly anyone who matters who doesn’t have at least thousands of words accessible somewhere.