Drake Thomas

Karma: 672

Interested in math puzzles, fermi estimation, strange facts about the world, toy models of weird scenarios, unusual social technologies, and deep dives into the details of random phenomena.

Working on the pretraining team at Anthropic as of October 2024; before that I did independent alignment research of various flavors and worked in quantitative finance.

Auditing language models for hidden objectives

Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M and evhub

Mar 13, 2025, 7:18 PM

138 points

15 comments13 min readLW link

Drake Thomas Feb 17, 2025, 5:40 AM
15 points
11
on: Drake Thomas’s Shortform
A problem I have that I think is fairly common:
1. I notice an incoming message of some kind.
2. For whatever reason it’s mildly aversive or I’m busy or something.
3. Time passes.
4. I feel guilty about not having replied yet.
5. Interacting with the message is associated with negative emotions and guilt, so it becomes more aversive.
6. Repeat steps 4 and 5 until the badness of not replying exceeds the escalating ⁴⁄₅ cycle, or until the end of time.
Curious if anyone who once had this problem feels like they’ve resolved it, and if so what worked!

Drake Thomas Feb 10, 2025, 8:22 PM
23 points
2
on: Drake Thomas’s Shortform
So it’s been a few months since SB1047. My sense of the main events that have happened since the peak of LW commenter interest (might have made mistakes or missed some items) are:
- The bill got vetoed by Newsom for pretty nonsensical stated reasons, after passing in the state legislature (but the state legislature tends to pass lots of stuff so this isn’t much signal).
- My sense of the rumor mill is that there are perhaps some similar-ish bills in the works in various state legislatures, but AFAIK none that have yet been formally proposed or accrued serious discussion except maybe for S.5616.
- We’re now in a Trump administration which looks substantially less inclined to do safety regulation of AI at the federal level than the previous admin was. In particular, some acceleration-y VC people prominently opposed to SB1047 are now in positions of greater political power in the new administration.
  - Eg Sriram Krishnan, Trump’s senior policy advisor on AI, was opposed; “AI and Crypto Czar” David Sacks doesn’t have a position on record but I’d be surprised if he was a fan.
  - On the other hand, Elon was nominally in favor (though I don’t think xAI took an official position one way or the other).
Curious for retrospectives here! Whose earlier predictions gain or lose Bayes points? What postmortems do folks have?

Drake Thomas Jan 21, 2025, 5:36 PM
1 point
0
in reply to: David Matolcsi’s comment on: Drake Thomas’s Shortform
Note that the lozenges dissolve slowly, so (bad news) you’d have the taste around for a while but (good news) it’s really not a very strong peppermint flavor while it’s in your mouth, and in my experience it doesn’t really have much of the menthol-triggered cooling effect. My guess is that you would still find it unpleasant, but I think there’s a decent chance you won’t really mind. I don’t know of other zinc acetate brands, but I haven’t looked carefully; as of 2019 the claim on this podcast was that only Life Extension brand are any good.

Drake Thomas Jan 12, 2025, 9:30 PM
4 points
0
in reply to: Joey KL’s comment on: Drake Thomas’s Shortform
On my model of what’s going on, you probably want the lozenges to spend a while dissolving, so that you have fairly continuous exposure of throat and nasal tissue to the zinc ions. I find that they taste bad and astringent if I actively suck on them but are pretty unobtrusive if they just gradually dissolve over an hour or two (sounds like you had a similar experience). I sometimes cut the lozenges in half and let each half dissolve so that they fit into my mouth more easily, you might want to give that a try?

Drake Thomas Jan 12, 2025, 6:45 AM
5 points
0
in reply to: MichaelDickens’s comment on: Drake Thomas’s Shortform
I agree, zinc lozenges seem like they’re probably really worthwhile (even in the milder-benefit worlds)! My less-ecstatic tone is only relative to the promise of older lesswrong posts that suggested it could basically solve all viral respiratory infections, but maybe I should have made the “but actually though, buy some zinc lozenges” takeaway more explicit.

Drake Thomas Jan 12, 2025, 6:39 AM
6 points
0
on: Drake Thomas’s Shortform
I liked this post, but I think there’s a good chance that the future doesn’t end up looking like a central example of either “a single human seizes power” or “a single rogue AI seizes power”. Some other possible futures:
- Control over the future by a group of humans, like “the US government” or “the shareholders of an AI lab” or “direct democracy over all humans who existed in 2029”
- Takeover via an AI that a specific human crafted to do a good job at enacting that human’s values in particular, but which the human has no further steering power over
- Lots of different actors (both human and AI) respecting one another’s property rights and pursuing goals within negotiated regions of spacetime, with no one actor having power over the majority of available resources
- A governance structure which nominally leaves particular humans in charge, and which the AIs involved are rule-abiding enough to respect, but in which things are sufficiently complicated and beyond human understanding that most decisions lack meaningful human oversight.
- A future in which one human has extremely large amounts of power, but they acquired that power via trade and consensual agreements through their immense (ASI-derived) material wealth rather than via the sorts of coercive actions we tend to imagine with words like “takeover”.
- A singleton ASI is in decisive control of the future, and among its values are a strong commitment to listen to human input and behave according to its understanding of collective human preferences, though maybe not its single overriding concern.
I’d be pretty excited to see more attempts at comparing these kinds of scenarios for plausibility and for how well the world might go conditional on their occurrence.
(I think it’s fairly likely that lots of these scenarios will eventually converge on something like the standard picture of one relatively coherent nonhuman agent doing vaguely consequentialist maximization across the universe, after sufficient negotiation and value-reflection and so on, but you might still care quite a lot about how the initial conditions shake out, and the dumbest AI capable of performing a takeover is probably very far from that limiting state.)

Drake Thomas Jan 12, 2025, 6:28 AM
8 points
0
in reply to: Eric Neyman’s comment on: Human takeover might be worse than AI takeover
The action-relevant question, for deciding whether you want to try to solve alignment, is how the average world with human-controlled AGI compares to the average AGI-controlled world.
To nitpick a little, it’s more like “the average world where we just barely didn’t solve alignment, versus the average world where we just barely did” (to the extent making things binary in this way is sensible), which I think does affect the calculus a little—marginal AGI-controlled worlds are more likely to have AIs which maintain some human values.
(Though one might be able to work on alignment in order to improve the quality of AGI-controlled worlds from worse to better ones, which mitigates this effect.)

Drake Thomas Jan 11, 2025, 6:43 PM
5 points
2
in reply to: Drake Thomas’s comment on: Drake Thomas’s Shortform
Update: Got tested, turns out the thing I have is bacterial rather than viral (Haemophilius influenzae). Lines up with the zinc lozenges not really helping! If I remember to take zinc the next time I come down with a cold, I’ll comment here again.

Drake Thomas Jan 10, 2025, 7:19 PM
4 points
0
in reply to: Maxwell Peterson’s comment on: Drake Thomas’s Shortform
My impression is that since zinc inhibits viral replication, it’s most useful in the regime where viral populations are still growing and your body hasn’t figured out how to beat the virus yet. So getting started ASAP is good, but it’s likely helpful for the first 2-3 days of the illness.
An important part of the model here that I don’t understand yet is how your body’s immune response varies as a function of viral populations—e.g. two models you could have are
1. As soon as any immune cell in your body has ever seen a virus, a fixed scale-up of immune response begins, and you’re sick until that scale-up exceeds viral populations.
2. Immune response progress is proportional to current viral population, and you get better as soon as total progress crosses some threshold.
If we simplistically assume* that badness of cold = current viral population, then in world 1 you’re really happy to take zinc as soon as you have just a bit of virus and will get better quickly without ever being very sick. In world 2, the zinc has no effect at all on total badness experienced, it just affects the duration over which you experience that badness.
*this is false, tbc—I think you generally keep having symptoms a while after viral load becomes very low, because a lot of symptoms are from immune response rather than the virus itself.

Drake Thomas Jan 10, 2025, 9:05 AM
1 point
0
in reply to: Hzn’s comment on: Drake Thomas’s Shortform
The 2019 LW post discusses a podcast which talks a lot about gears-y models and proposed mechanisms; as I understand it, the high level “zinc ions inhibit viral replication” model is fairly well accepted, but some of the details around which brands are best aren’t as well-attested elsewhere in the literature. For instance, many of these studies don’t use zinc acetate, which this podcast would suggest is best. (To its credit, the 2013 meta-analysis does find that acetate is (nonsignificantly) better than gluconate, though not radically so.)

Drake Thomas Jan 9, 2025, 6:06 AM
63 points
1
on: Drake Thomas’s Shortform
(TLDR: Recent Cochrane review says zinc lozenges shave 0.5 to 4 days off of cold duration with low confidence, middling results for other endpoints. Some reason to think good lozenges are better than this.)
There’s a 2024 Cochrane review on zinc lozenges for colds that’s come out since LessWrong posts on the topic from 2019, 2020, and 2021. 34 studies, 17 of which are lozenges, ⁹⁄₁₇ are gluconate and I assume most of the rest are acetate but they don’t say. Not on sci-hub or Anna’s Archive, so I’m just going off the abstract and summary here; would love a PDF if anyone has one.
- Dosing ranged between 45 and 276 mg/day, which lines up with 3-15 18mg lozenges per day: basically in the same ballpark as the recommendation on Life Extension’s acetate lozenges (the rationalist favorite).
- Evidence for prevention is weak (partly bc fewer studies): they looked at risk of developing cold, rate of colds during followup, duration conditional on getting a cold, and global symptom severity. All but the last had CIs just barely overlapping “no effect” but leaning in the efficacious direction; even the optimistic ends of the CIs don’t seem great, though.
- Evidence for treatment is OK: “there may be a reduction in the mean duration of the cold in days (MD ‐2.37, 95% CI ‐4.21 to ‐0.53; I² = 97%; 8 studies, 972 participants; low‐certainty evidence)”. P(cold at end of followup) and global symptom severity look like basically noise and have few studies.
My not very informed takes:
- On the model of the podcast in the 2019 post, I should expect several of these studies to be using treatments I think are less or not at all efficacious, be less surprised by study-to-study variation, and increase my estimate of the effect size of using zinc acetate lozenges compared to anything else. Also maybe I worry that some of these studies didn’t start zinc early enough? Ideally I could get the full PDF and they’ll just have a table of (study, intervention type, effect size).
- Even with the caveats around some methods of absorption being worse than others, this seems rough for a theory in which zinc acetate taken early completely obliterates colds—the prevention numbers just don’t look very good. (But maybe the prevention studies all used bad zinc?)
- I don’t know what baseline cold duration is, but assuming it’s something like a week, this lines up pretty well with the 33% decrease (40% for acetate) seen in this meta-analysis from 2013 if we assume effect sizes are dragged down by worse forms of zinc in the 2024 review.
  - But note these two reviews are probably looking at many of the same studies, so that’s more of an indication that nothing damning has come out since 2013 rather than an independent datapoint.
- My current best guess for the efficacy of zinc acetate lozenges at 18mg every two waking hours from onset of any symptoms, as measured by “expected decrease in integral of cold symptom disutility”, is:
  - 15% demolishes colds, like 0.2x disutility
  - 25% helps a lot, like 0.5x disutility
  - 35% helps some (or helps lots but only for a small subset of people or cases), like 0.75x disutility
  - 25% negligible difference from placebo
I woke up at 2am this morning with my throat feeling bad, started taking Life Extension peppermint flavored 18mg zinc acetate lozenges at noon, expecting to take 5ish lozenges per day for 3 days or while symptoms are worsening. My most recent cold before this was about 6 days: [mild throat tingle, bad, worse, bad, fair bit better, nearly symptomless, symptomless]. I’ll follow up about how it goes!

Drake Thomas Dec 31, 2024, 2:35 PM
7 points
2
in reply to: jimrandomh’s comment on: Jimrandomh’s Shortform Posts
I agree this seems pretty good to do, but I think it’ll be tough to rule out all possible contaminant theories with this approach:
- Some kinds of contaminants will be really tough to handle, eg if the issue is trace amounts of radioactive isotopes that were at much lower levels before atmospheric nuclear testing.
- It’s possible that there are contaminant-adjacent effects arising from preparation or growing methods that aren’t related to the purity of the inputs, eg “tomato plants in contact with metal stakes react by expressing obesogenic compounds in their fruits, and 100 years ago everyone used wooden stakes so this didn’t happen”
- If 50% of people will develop a propensity for obesity by consuming more than trace amounts of contaminant X, and everyone living life in modern society has some X on their hands and in their kitchen cabinets and so on, the food alone being ultra-pure might not be enough.
Still seems like it’d provide a 5:1 update against contaminant theories if this experiment didn’t affect obesity rates though.

Drake Thomas Nov 30, 2024, 9:08 AM
51 points
2
on: (The) Lightcone is nothing without its people: LW + Lighthaven’s big fundraiser
I’ve gotten enormous value out of LW and its derived communities during my life, at least some of which is attributable to the LW2.0 revival and its effects on those communities. More recently, since moving to the Bay, I’ve been very excited by a lot of the in-person events that Lighthaven has helped facilitate. Also, LessWrong is doing so many things right as a website and source-of-content that no one else does (karma-gated RSS feeds! separate upvote and agree-vote! built-in LaTeX support!) and even if I had no connection to the other parts of its mission I’d want to support the existence of excellently-done products. (Of course there’s also the altruistic case for impact on how-well-the-future-goes, which I find compelling on its own merits.) Have donated $5k for now, but I might increase that when thinking more seriously about end-of-year donations.

(Conflict of interest notice: two of my housemates work at Lightcone Infrastructure and I would be personally sad and slightly logistically inconvenienced if they lost their jobs. I don’t think this is a big contributor to my donation.)

Drake Thomas Nov 30, 2024, 1:50 AM
4 points
2
on: Is the mind a program?
The theoretical maximum FLOPS of an Earth-bound classical computer is something like $2^{35}$ .
Is this supposed to have a different base or exponent? A single H100 already gets like $2^{45}$ FLOP/s.

Drake Thomas Nov 25, 2024, 3:04 AM
3 points
0
in reply to: Vladimir_Nesov’s comment on: LLM chatbots have ~half of the kinds of “consciousness” that humans believe in. Humans should avoid going crazy about that.
So I would guess it should be possible to post-train an LLM to give answers like ”................… Yes” instead of “Because 7! contains both 3 and 5 as factors, which multiply to 15 Yes”, and the LLM would still be able to take advantage of CoT
This doesn’t necessarily follow—on a standard transformer architecture, this will give you more parallel computation but no more serial computation than you had before. The bit where the LLM does N layers’ worth of serial thinking to say “3” and then that “3″ token can be fed back into the start of N more layers’ worth of serial computation is not something that this strategy can replicate!
Empirically, if you look at figure 5 in Measuring Faithfulness in Chain-of-Thought Reasoning, adding filler tokens doesn’t really seem to help models get these questions right:

Drake Thomas Nov 22, 2024, 6:19 PM
8 points
3
in reply to: green_leaf’s comment on: LLM chatbots have ~half of the kinds of “consciousness” that humans believe in. Humans should avoid going crazy about that.
I don’t think that’s true—in eg the GPT-3 architecture, and in all major open-weights transformer architectures afaik, the attention mechanism is able to feed lots of information from earlier tokens and “thoughts” of the model into later tokens’ residual streams in a non-token-based way. It’s totally possible for the models to do real introspection on their thoughts (with some caveats about eg computation that occurs in the last few layers), it’s just unclear to me whether in practice they perform a lot of it in a way that gets faithfully communicated to the user.

Drake Thomas Nov 22, 2024, 4:52 PM
6 points
3
in reply to: green_leaf’s comment on: LLM chatbots have ~half of the kinds of “consciousness” that humans believe in. Humans should avoid going crazy about that.
Yeah, I’m thinking about this in terms of introspection on non-token-based “neuralese” thinking behind the outputs; I agree that if you conceptualize the LLM as being the entire process that outputs each user-visible token including potentially a lot of CoT-style reasoning that the model can see but the user can’t, and think of “introspection” as “ability to reflect on the non-user-visible process generating user-visible tokens” then models can definitely attain that, but I didn’t read the original post as referring to that sort of behavior.

Drake Thomas Nov 22, 2024, 8:49 AM
22 points
12
on: LLM chatbots have ~half of the kinds of “consciousness” that humans believe in. Humans should avoid going crazy about that.
In other words, they can think about the thoughts “behind” the previous words they wrote. If you doubt me on this, try asking one what its words are referring to, with reference to its previous words. Its “attention” modules are actually intentionally designed to know this sort of thing, using using key/query/value lookups that occur “behind the scenes” of the text you actually see on screen.
I don’t think that asking an LLM what its words are referring to is a convincing demonstration that there’s real introspection going on in there, as opposed to “plausible confabulation from the tokens written so far”. I think it is plausible there’s some real introspection going on, but I don’t think this is a good test of it—the sort of thing I would find much more compelling is if the LLMs could reliably succeed at tasks like
Human: Please think of a secret word, and don’t tell me what it is yet.
LLM: OK!
Human: What’s the parity of the alphanumeric index of the penultimate letter in the word, where A=1, B=2, etc?
LLM: Odd.
Human: How many of the 26 letters in the alphabet occur multiple times in the word?
LLM: None of them.
Human: Does the word appear commonly in two-word phrases, and if so on which side?
LLM: It appears as the second word of a common two-word phrase, and as the first word of a different common two-word phrase.
Human: Does the word contain any other common words as substrings?
LLM: Yes; it contains two common words as substrings, and in fact is a concatenation of those two words.
Human: What sort of role in speech does the word occupy?
LLM: It’s a noun.
Human: Does the word have any common anagrams?
LLM: Nope.
Human: How many letters long is the closest synonym to this word?
LLM: Three.
Human: OK, tell me the word.
LLM: It was CARPET.
but couldn’t (even with some substantial effort at elicitation) infer hidden words from such clues without chain-of-thought when it wasn’t the one to think of them. That would suggest to me that there’s some pretty real reporting on a piece of hidden state not easily confabulated about after the fact.

Drake Thomas Oct 24, 2024, 12:21 AM
2 points
0
in reply to: Carl Feynman’s comment on: Drake Thomas’s Shortform
I think my original comment was ambiguous—I also consider myself to have mostly figured it out, in that I thought through these considerations pretty extensively before joining and am in a “monitoring for new considerations or evidence or events that might affect my assessment” state rather than a “just now orienting to the question” state. I’d expect to be most useful to people in shoes similar to my past self (deciding whether to apply or accept an offer) but am pretty happy to talk to anyone, including eg people who are confident I’m wrong and want to convince me otherwise.

Drake Thomas

Au­dit­ing lan­guage mod­els for hid­den objectives

Auditing language models for hidden objectives