gwern
But the caveat there is that this is inherently a backwards-looking result:
We consider GPT-4o (OpenAI, 2024), Claude-3.5-Sonnet (Anthropic, 2024), Grok-2 (xAI, 2024), Gemini-1.5-Pro (Google, 2024), and DeepSeek-V3 (DeepSeek-AI, 2024).
So one way to put it would be that people & classifiers are good at detecting mid-2024-era chatbot prose. Unfortunately, somewhere after then, at least OpenAI and Google apparently began to target the problem of ChatGPTese (possibly for different reasons: Altman’s push into consumer companion-bots/personalization/social-networking, and Google just mostly ignoring RLHF in favor of capabilities), and the chatbot style seems to have improved substantially. Even the current GPT-4o doesn’t sound nearly as 4o-like as it did just back in November 2024. Since mode-collapse/ChatGPTese stuff was never a capabilities problem per se (just look at GPT-3!), but mostly just neglect/apathy on part of the foundation labs (as I’ve been pointing out since the beginning), it’s not a surprise that it could improve rapidly once they put any effort into fixing it.
Between the continued rapid increase in capabilities and paying some attention to esthetics & prose style and attackers slowly improving their infrastructure in the obvious ways, I expect over the course of 2025 that detecting prose from a SOTA model is going to get much more difficult. (And this excludes the cumulative effect on humans increasingly writing like ChatGPT.)
I’m not sure this is a big problem. How much net attrition do you really expect over a decade, say? By which point who really cares? You will have so much more AI progress, and accumulated data (particularly if you’ve been gradually replacing the lower-level employees and you have an ‘automation wave’ moving through the organization where employees increasingly train their automated replacements or their job is simply reorganizing the jobs to enable automation).
It seems like to the extent there’s much attrition at high levels, it is reduced in considerable part by these very dynamics: as returns to high-level human labor go up, presumably, there is less attrition from voluntary retirement or leisure consumption (and if the returns go down, then that implies that there is no ‘shortage’ of people for such high-level positions and so no problem); and also as the remaining human work becomes more ‘white-collar’ and based on difficult-for-AI things like reputation or experience or ownership or creativity, aging or opportunity costs begin to matter less, reducing another source of attrition.
(Even if AI or robotics is unable to do the ‘core’ of a job, they can help deal with various obstacles which might prevent a human from doing the job. An elderly manager who might decide to retire in part because they are low-key becoming worried about safely driving to/from the office will no longer think about that when they have a self-driving car or remote working becomes ever more feasible; older managers who might be slipping in their grasp of details or who have ‘senior moments’ will be able to rely on AI secretaries to catch those or just pause stuff for a while until they’re back to normal; elite women might invest more in careers if they have Claude-bot as a trustworthy nanny and chauffeur, etc. One is reminded of President Biden: his staffers were able to work around his issues by doing things like rescheduling or canceling events to avoid exposing him publicly when he was bad; it was only an event that even the POTUS can’t arbitrarily schedule, a presidential debate, that punctured the carefully-constructed illusion. Few of those staffers were qualified to be President of the United States, and yet, you don’t have to be a good president to observe “sounds like Joe’s having a bad day today” and quietly cancel his evening appointments for him so he can get to bed early.)
Also notable: the big OpenAI reveal today was some sort of better personalization. Instead of the crude ‘saved facts’ personalization ChatGPT has had for a long time and which has never made much of a difference, they’re doing… something. Unclear if it’s merely RAG or if they are also doing something interesting like lightweight finetuning. But the GPTs definitely seem to have much better access to your other sessions in the web interface, and as far as I know, few other interfaces with frontier models have tried to do much personalization, so this will be an interesting real-world test at scale about how much simple personalization can help with LLMs (similar to Midjourney’s relatively new personalization feature, which I get a lot out of).
I don’t think this is true at all. How do you translate, say, rotating multiple shapes in parallel into text?
At least for multimodal LLMs in the pure-token approach like Gato or DALL-E 1 (and probably GPT-4o and Gemini, although few details have been published), you would be able to do that by generating the tokens which embody an encoded image (or video!) of several shapes, well, rotating in parallel. Then you just look at them.
Pursuit of novelty is not vnm-incoherent. Furthermore, it is an instrumentally convergent drive; power-seeking agents will seek novelty as well, because learning increases power in expectation (see: value of information).
Or to put it another way: any argument which convincingly proves that ‘incoherent search processes ultimately outcompete coherent search processes’ is also an argument which convinces a VNM agent to harness the superior incoherent search processes instead of the inferior coherent ones.
It’s good someone else did it, but it has the same problems as the paper: not updated since May 2024, and limited to open source base models. So it needs to be started back up and add in approximate estimators for the API/chatbot models too before it can start providing a good universal capability benchmark in near-realtime.
One of the most robust benchmarks of generalized ability, which is extremely easy to update (unlike benchmarks like Humanity’s Last Exam), would just be to estimate the pretraining loss (ie. the compression ratio). It’s a very consistent finding for many years now that pretraining loss is just about the best estimate there is of capability. It is also intrinsically robust to overfitting or benchmark cheating, you can automatically collect huge amounts of new data, and it’s hard to sandbag (indeed, how could a model know if it’s in pretraining or testing if you are simply handing it a new datapoint scraped from the Internet using the same pipeline and measuring a log-prob?).
I’m a little surprised that we don’t already have a continual compression benchmark anywhere. It seems a lot easier to do than most of the benchmarks out there. And while you may object that the most important LLMs to benchmark either don’t provide log-probs or the log-probs are meaningless, there are multiple ways to elicit token predictions which would let you estimate log-probs, and then you just need to sample enough to estimate the net BPC to satisfactory precision: https://www.reddit.com/r/mlscaling/comments/1ju1q2e/compression_represents_intelligence_linearly/mlyypmx/
No, it would probably be a mix of “all of the above”. FB is buying data from the same places everyone else does, like Scale (which we know from anecdotes like when Scale delivered FB a bunch of blatantly-ChatGPT-written ‘human rating data’ and FB was displeased), and was using datasets like
books3
that are reasonable quality. The reported hardware efficiency numbers have never been impressive, they haven’t really innovated in architecture or training method (even the co-distillation for Llama-4 is not new, eg. ERNIE was doing that like 3 years ago), and insider rumors/gossip don’t indicate good things about the quality of the research culture. (It’s a stark contrast to things like Jeff Dean overseeing a big overhaul to ensure bit-identical reproducibility of runs and Google apparently getting multi-datacenter training working by emphasizing TPU interconnect.) So my guess is that if it’s bad, it’s not any one single thing like ‘we trained for too few tokens’ or ‘some of our purchased data was shite’: it’s just everything in the pipeline being a bit mediocre and it multiplying out to a bad end-product which is less than the sum of its parts.Remember Karpathy’s warning: “neural nets want to work”. You can screw things up and the neural nets will still work, they will just be 1% worse than they should be. If you don’t have a research culture which is rigorous about methodology or where people just have good enough taste/intuition to always do the right thing, you’ll settle for whatever seems to work… (Especially if you are not going above and beyond to ensure your metrics aren’t fooling yourself.) Now have a 1% penalty on everything, from architecture to compute throughput to data quality to hyperparameters to debugging implementation issues, and you wind up with a model which is already obsolete on release with no place on the Pareto frontier and so gets 0% use.
That would be tricky because you are comparing apples and oranges. Consider that for the USA, there are only 11 cardinals (of 252 worldwide), while there are 10x more federal senators at any moment (I don’t know if there would be more or less total: senators tend to be much younger but cardinals also tend to be long-lived), and I can’t even guess how many ‘Fortune 500 C-level employees’ there might be given corporate turnover and the size of many ‘C-suites’ - tens of thousands, maybe? So your suggestions span ~1-3 orders of magnitude less selectivity than cardinals do.
whomever makes it into the college of cardinals.
I would be surprised if that was the primary homosexuality-enriching step, given that reporting has always been that quite a lot of low-level parish-level priests are also gay. (Note, for example, how many of the sexual abuse scandal victims were boys/men.) I would guess that it operates fairly steadily at all levels, starting from simply which young boys opt for the priesthood (known to be a demand and difficult occupation even if the celibacy requirement is, for you, not so onerous) and operating from there; if I had to guess where the biggest enrichment is, it’d be at the ‘leaving your country for the Vatican’ step, given how notoriously gay the Vatican is. So going there suggests either that you are gay (and so the buggery isn’t a bug, it’s a feature) or you are highly ambitious and don’t mind it (or are willing to exploit it and again, not a bug but a feature).
We have not yet tried 4.5 as it’s so expensive that we would not be able to deploy it, even for limited sections.
Still seems like potentially valuable information to know: how much does small-model smell cost you? What happens if you ablate reasoning? If it is factual knowledge and GPT-4.5 performs much better, then that tells you things like ‘maybe finetuning is more useful than we think’, etc. If you are already set up to benchmark all these OA models, then a datapoint from GPT-4.5 should be quite easy and just a matter of a small amount of chump change in comparison to the insight, like a few hundred bucks.
22.3 percent of cardinals are reported as eldest children. That compares to 21.6 percent which are youngest children. Eldest children are still favored.
I kept waiting for you to discuss this point: in considering analysis of cardinals (as opposed to ordinary random people), what about the other relevant birth-order effects? Like the… first-born eldest birth order effect, where first-borns are smarter, more extraverted, stabler, higher-SES etc. All of which sounds exactly like the sort of thing you need to rise through an extreme hierarchy to the top.
After all, surely homosexuality is not the only (or even primary) trait the Catholic Church hierarchy is trying to select for?
The failure of the compute-rich Llama models to compete with the compute poorer but talent and drive rich Alibaba and DeepSeek
This seems like it’s exaggerating the Llama failure. Maybe the small Llama-4s just released yesterday are a bit of a disappointment because they don’t convincingly beat all the rivals; but how big a gap is that absolutely? When it comes to DL models, there’s generally little reason to use #2; but that doesn’t mean #2 was all that much worse and ‘a failure’ - it might only have been weeks behind #1. (Indeed, a model might’ve been the best when it was trained, and release just took a while. Would it be reasonable to call such a model a ‘failure’? I wouldn’t. It might be a failure of a business model or a corporate strategy, but that model qua model is a good model, Bront.) #2 just means it’s #2, lesser by any amount. How far back would we have to go for the small Llama-4s to have been on the Pareto frontier? It’s still early, but I’m getting the impression so far that you wouldn’t have to go that far back. Certainly not ‘years’ (it couldn’t perform that well on LMArena in its ‘special chatbot configuration’ even sloptimized if it was years behind), unless the wilder rumors turn out to be true (like deliberately training on the test sets—in which case, Zuckerberg may have to burn FB AI with fire and reboot the entire AI org because the culture is irretrievably rotten—but of course such rumors usually do not, so I mention this mostly to indicate that right now Llama Internet commentary is high on heat and low on light).
The failure of the compute-rich Llama models to compete with the compute poorer but talent and drive rich Alibaba and DeepSeek shows that even a substantial compute lead can be squandered. Given that there is a lot of room for algorithmic improvements (as proven by the efficiency of the human brain), this means that determined engineering plus willingness to experiment rather than doubling-down on currently working tech.
I’m not really following your argument here. Even if LLaMA-4 is disappointing compared to what DeepSeek could’ve done with the same compute because they’d get 40% MFU instead of FB’s 20% or whatever, and are 2x as good in effective-compute, that doesn’t close the lead when FB finishes its new Manhattan-sized datacenter, say, and has 100x DS’s compute. Or are you arguing for the possibility of someone making an asymptotic scaling law breakthrough with a better exponent, so that even with 1/100th the compute, they can beat one of the giants?
The human microbiome is irrelevant to this topic. The microbiome is highly heritable (usual twin studies & SNP heritabilities), and it is caused by genes and the environment, as well as unstable; its direct causal effects in normal humans are minimal. We know that it is supremely irrelevant because environmental changes like antibiotics or new food or global travel which produce large changes in personal (and offspring) microbiomes do not produce large changes in intelligence (of oneself or offspring); and most dramatically, germ-free humans exist and are of normal or even above-average intelligence, eg the fascinating mistakes and delusions of David despite his high intelligence. (Amusingly, germ-free mice apparently even live longer.) Microbiome research is, in general, very low quality and can’t be taken seriously—look at your link:
Examples of how important the gut microbiome, and the parents’ health, are for human development: https://humanmicrobiome.info/maternity/
Most of this page is meaningless mouse studies (infamous for not replicating and getting whatever result the experimenter wants and the animal model literature having huge systemic biases), and the handful of actual human studies I see here are all garbage—things like cross-sectional studies with large known familial confounding, or heavy reliance on things like breastfeeding where the beneficial effects disappear when controlling for just some confounds. This also goes for much-touted correlations like autism. There’s not a single result on this page that provides a shred of evidence for your implied thesis that microbiome interventions could, even in theory, possibly matter to ‘how to make superbabies’. It doesn’t.
In 2021 a geneticist insisted to me that the microbiome was just a fad.
He was right. BTW, you remember what happened in 2021, right?
EDIT: If anyone cares, I’m not bothering to respond to Harrop’s comment in depth because I think his Gish-gallopy response manages to exemplify many of the criticisms I already made, where they are not outright non-responsive (eg. his ‘disagreement’ about the reasons why germ-free mice live longer is obviously irrelevant to my point that they do), and I’m not worried anyone is going to waste any time on ‘the microbiome’ for these purposes when this is the best case he can make. You can see he has no case for ‘superbabies’ having anything to do with known microbiome stuff, and I do not care enough about the microbiome per se to prosecute it further.
I don’t really understand how a local copy of the weights gives the terrorists more practical control over the software’s alignment. I don’t think it’s easy to manually tweak weights for so specific a purpose. Maybe they just mean the API is doing a good job of blocking sketchy requests?
You can finetune models for any specific purpose: just provide a few datapoints and train. The more specific the purpose, the easier tweaking the weights is, not harder. (Surely, if nothing else, you’ve seen all of the LoRAs and other things for finetuning image generation model to generate a specific character?) There is an extensive literature at this point on how it is trivial to strip away all of the friendly chatbot persona from released checkpoints, such as LLaMA, if you are able to access and modify the model slow weights and fast weights directly.
Yes, in dire straits. But it’s usually called ‘hyperinflation’ when you try to make seignorage equivalent to >10% of GDP and fund the government through deliberately creating high inflation (which is on top of any regular inflation, of course). And because inflation is about expectations in considerable part, you can’t stop it either. Not to mention what happens when you start hyperinflation.
(FWIW, this is a perfectly reasonable question to ask a LLM first. eg Gemini-2.5-pro will give you a thorough and sensible answer as to why this would be extraordinarily destructive and distortionary, and far worse than the estimated burden of tax return filing, and it would likely satisfy your curiosity on this thought-experiment with a much higher quality answer than anyone on LW2, including me, is ever likely to provide.)
This model seems to contradict https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target because it has, in fact, developed reward as the optimization target without ever being instructed to maximize reward.
It doesn’t contradict Turntrout’s post because his claims are about an irrelevant class of RL algorithms (model-free policy gradients) . A model-based RL setting (like a human, or a LLM like Claude pretrained to imitate model-based RL agents in a huge number of settings ie. human text data) optimizes the reward, if it’s smart and knowledgeable enough to do so.
(This comment is another example of how Turntrout’s post was a misfire because everyone takes away the opposite of what they should have.)
The intuition behind this approach draws from our understanding of selection in biological systems. Consider how medieval Europe dealt with violence:
This is a bad example because first, your description is incorrect (Clark nowhere suggests this in Farewell to Alms, as I just double-checked, because his thesis is about selecting for high-SES traits, not selecting against violence, and in England, not Europe—so I infer you are actually thinking of the Frost & Harpending thesis, which is about Western Europe, and primarily post-medieval England at that); second, the Frost & Harpending truncation selection hypothesis has little evidence for it and can hardly be blandly referred to, as if butter wouldn’t melt in your mouth, as obviously ‘how medieval Europe dealt with violence’ (I don’t particularly think it’s true myself, just a cute idea about truncation selection, nor is it obvious whether it can account for a majority, much less all, of the secular decline in violence); and third, it is both a weird opaque obscure example that doesn’t illustrate the principle very well and is maximally inflammatory.
Experimentation is valuable for the high VoI, but it seems hard to encourage ‘in general’, because experimenting on anything is painful and difficult, and the more so the more important and valuable it is. So just ‘subsidizing experiments’ would be like ‘subsidizing fixing bugs in source code’.
What would you do if you were a funder who wanted to avoid this? Well, you’d… fund specific experiments you knew were important and of high-value. Which is what the federal government and many other NGOs or philanthropists do.
OP’s example is correct and you are wrong. ‘Pigheaded’ is neither a proposed root cause analysis nor does it mean ‘are dumb’; perhaps you should check a dictionary before correcting others’ usage. It means stubborn, strong-willed, obstinate, often to the point of foolishness or taking very harmful actions, or to quote the OED: “Having a head like that of a pig. Chiefly figurative: stupidly obstinate, perverse, or set in one’s ways.” Note: it is “stupidly obstinate”, and not “stupid”. This is because pigs are notoriously smart but stubborn: very strong, heavy, often hungry, whose mind can’t easily be changed by an unfortunate swineherd or passerby in their way. (And this usage has been consistent since the start: the OED will give you the first attestation of it to Ben Jonson, where it describes a small-minded printer who thinks that high-quality news has to be paid for, because that’s how he operates; Jonson then mocks some other tradesmen for their own kinds of narrowmindedness, but not for any of them being low-IQ.) Hence, the Russell conjugation is correct: “pigheaded” is the highly insulting figurative term which intensifies the negative “obstinate” which is the bad version of the positive “firm”. Just as ‘firm’ does not principally mean ‘dumb’, ‘pigheaded’ doesn’t principally mean it either.