Daniel Paleka

Karma: 274

Daniel Paleka’s Shortform

Daniel PalekaMar 31, 2025, 10:10 AM

4 points

1 comment LW link

Daniel Paleka Mar 31, 2025, 10:10 AM
3 points
0
on: Daniel Paleka’s Shortform
GPT-4o’s drawings of itself as a person are remarkably consistent: it’s more or less always a similar-looking white male in his late 20s with brown hair, often sporting facial hair and glasses, unless you specify otherwise. All the men it generates might as well be brothers. I reproduced this on two ChatGPT accounts with clean memory.
On the contrary, its drawings of itself when it does not depict itself as a person are far more diverse: a wide range of robot designs and abstract humanoids, often featuring OpenAI logo as a head or on the word “GPT” on the chest.

Daniel Paleka Mar 22, 2025, 7:25 PM
3 points
0
in reply to: nostalgebraist’s comment on: What are the strongest arguments for very short timelines?
I think the labs might well be rational in focusing on this sort of “handheld automation”, just to enable their researchers to code experiments faster and in smaller teams.
My mental model of AI R&D is that it can be bottlenecked roughly by three things: compute, engineering time, and the “dark matter” of taste and feedback loops on messy research results. I can certainly imagine a model of lab productivity where the best way to accelerate is improving handheld automation for the entirety of 2025. Say, the core paradigm is fixed; but inside that paradigm, the research team has more promising ideas than they have time to implement and try out on smaller-scale experiments; and they really do not want to hire more people.
If you consider the AI lab as a fundamental unit that wants to increase its velocity, and works on things that make models faster, it’s plausible they can be aware how bad the model performance is on research taste, and still not be making a mistake by ignoring your “dark matter” right now. They will work on it when they are faster.

You should delay engineering-heavy research in light of R&D automation

Daniel PalekaJan 7, 2025, 2:11 AM

36 points

3 comments5 min readLW link

(newsletter.danielpaleka.com)

Daniel Paleka Aug 18, 2024, 5:47 AM
1 point
0
in reply to: KhromeM’s comment on: Using an LLM perplexity filter to detect weight exfiltration
N = #params, D = #data
Training compute = const .* N * D
Forward pass cost (R bits) = c * N, and assume R = Ω(1) on average
Now, thinking purely information-theoretically:
Model stealing compute = C * fp16 * N / R ~ const. * c * N^2
If compute-optimal training and α = β in Chinchilla scaling law:
Model stealing compute ~ Training compute
For significantly overtrained models:
Model stealing << Training compute
Typically:
Total inference compute ~ Training compute
=> Model stealing << Total inference compute
Caveats:
- Prior on weights reduces stealing compute, same if you only want to recover some information about the model (e.g. to create an equally capable one)
- Of course, if the model is producing much fewer than 1 token per forward pass, then model stealing compute is very large

Daniel Paleka Jan 21, 2024, 12:53 AM
2 points
1
in reply to: Adrià Garriga-alonso’s comment on: Does literacy remove your ability to be a bard as good as Homer?
The one you linked doesn’t really rhyme. The meter is quite consistently decasyllabic, though.
I find it interesting that the collection has a fairly large number of songs about World War II. Seems that the “oral songwriters composing war epics” meme lived until the very end of the tradition.

Daniel Paleka Jan 14, 2024, 10:12 AM
2 points
0
on: Takeaways from the NeurIPS 2023 Trojan Detection Competition

With Greedy Coordinate Gradient (GCG) optimization, when trying to force argmax-generated completions, using an improved objective function dramatically increased our optimizer’s performance.

Do you have some data / plots here?

Daniel Paleka Nov 15, 2023, 7:41 PM
2 points
1
in reply to: lberglund’s comment on: Paper: LLMs trained on “A is B” fail to learn “B is A”
Oh so you have prompt_loss_weight=1, got it. I’ll cross out my original comment. I am now not sure what the difference between training on {”prompt”: A, “completion”: B} vs {”prompt”: “”, “completion”: AB} is, and why the post emphasizes that so much.

Daniel Paleka Nov 15, 2023, 7:32 PM
1 point
0
in reply to: ryan_greenblatt’s comment on: Paper: LLMs trained on “A is B” fail to learn “B is A”
The key adjustment in this post is that they train on the entire sequence
Yeah, but my understanding of the post is that it wasn’t enough; it only worked out when A was Tom Cruise, not Uriah Hawthorne. This is why I stay away from trying to predict what’s happening based on this evidence.
Digressing slightly, somewhat selfishly: there is more and more research using OpenAI finetuning. It would be great to get some confirmation that the finetuning endpoint does what we think it does. Unlike with the model versions, there are no guarantees on the finetuning endpoint being stable over time; they could introduce a p(A | B) term when finetuning on {”prompt”: A, “completion”: B} at any time if it improved performance, and experiments like this would then go to waste.

Daniel Paleka Nov 15, 2023, 6:37 PM
7 points
0
in reply to: gwern’s comment on: Paper: LLMs trained on “A is B” fail to learn “B is A”
So there’s a post that claims p(A | B) is sometimes learned from p(B | A) if you make the following two adjustments to the finetuning experiments in the paper:
~~(1) you finetune not on p(B | A), but p(A) + p(B | A) instead~~ finetune on p(AB) in the completion instead of finetuning on p(A) in the prompt + p(B | A) in the completion, as in Berglund et al.
(2) A is a well-known name (“Tom Cruise”), but B is still a made-up thing

~~The post is not written clearly, but this is what I take from it. Not sure how model internals explain this.~~
~~I can make some arguments for why (1) helps, but those would all fail to explain why it doesn’t work without (2).~~
Caveat: The experiments in the post are only on A=”Tom Cruise” and gpt-3.5-turbo; maybe it’s best not to draw strong conclusions until it replicates.

Daniel Paleka Oct 5, 2023, 2:00 PM
3 points
0
in reply to: 1a3orn’s comment on: What evidence is there of LLM’s containing world models?
I made an illegal move while playing over the board (5+3 blitz) yesterday and lost the game. Maybe my model of chess (even when seeing the current board state) is indeed questionable, but well, it apparently happens to grandmasters in blitz too.

Daniel Paleka Aug 24, 2023, 9:39 AM
1 point
0
on: Reducing sycophancy and improving honesty via activation steering
Do the modified activations “stay in the residual stream” for the next token forward pass?
Is there any difference if they do or don’t?
If I understand the method correctly, in Steering GPT-2-XL by adding an activation vector they always added the steering vectors on the same (token, layer) coordinates, hence in their setting this distinction doesn’t matter. However, if the added vector is on (last_token, layer), then there seems to be a difference.

Daniel Paleka Aug 22, 2023, 8:49 AM
1 point
0
in reply to: james387’s comment on: Evaluating Superhuman Models with Consistency Checks
Thank you for the discussion in the DMs!

Wrt superhuman doubts: The models we tested are superhuman. https://www.melonimarco.it/en/2021/03/08/stockfish-and-lc0-test-at-different-number-of-nodes/ gave a rough human ELO estimate of 3000 for a 2021 version of Leela with just 100 nodes, 3300 for 1000 nodes. There is a bot on Lichess that plays single-node (no search at all) and seems to be in top 0.1% of players.
I asked some Leela contributors; they say that it’s likely new versions of Leela are superhuman at even 20 nodes; and that our tests of 100-1600 nodes are almost certainly quite superhuman. We also tested Stockfish NNUE with 80k nodes and Stockfish classical with 4e6 nodes, with similar consistency results.
Table 5 in Appendix B.3 (“Comparison of the number of failures our method finds in increasingly stronger models”): this is all on positions from Master-level games. The only synthetically generated positions are for the Board transformation check, as no-pawn positions with lots of pieces are rare in human games.
We cannot comment on different setups not reproducing our results exactly; pairs of positions do not necessarily transfer between versions, but iirc preliminary exploration implied that the results wouldn’t be qualitatively different. Maybe we’ll do a proper experiment to confirm.
There’s an important question to ask here: how much does scaling search help consistency? Scaling Scaling Laws with Board Games [Jones, 2021] is the standard reference, but I don’t see how to convert their predictions to estimates here. We found one halving of in-distribution inconsistency ratio with two doublings of search nodes on the Recommended move check. Not sure if anyone will be working on any version of this soon (FAR AI maybe?). I’d be more interested in doing a paper on this if I could wrap my head around how to scale “search” in LLMs, with a similar effect as what increasing the number of search nodes does on MCTS trained models.

Daniel Paleka Aug 9, 2023, 5:50 AM
4 points
1
on: Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
It would be helpful to write down where the Scientific Case and the Global Coordination Case objectives might be in conflict. The “Each subcomponent” section addresses some of the differences, but not the incentives. I do acknowledge that first steps look very similar right now, but the objectives might diverge at some point. It naively seems that demonstrating things that are scary might be easier and is not the same thing as creating examples which usefully inform alignment of superhuman models.

Evaluating Superhuman Models with Consistency Checks

Daniel Paleka and Lukas Fluri

Aug 1, 2023, 7:51 AM

21 points

2 comments9 min readLW link

(arxiv.org)

Daniel Paleka May 25, 2023, 9:57 PM
1 point
0
in reply to: Richard_Ngo’s comment on: Coercion is an adaptation to scarcity; trust is an adaptation to abundance
So I’ve read an overview ^[1] which says Chagnon observed a pre-Malthusian group of people, which was kept from exponentially increasing not by scarcity of resources, but by sheer competitive violence; a totalitarian society that lives in abundance.

There seems to be an important scarcity factor shaping their society, but not of the kind where we could say that “we only very recently left the era in which scarcity was the dominant feature of people’s lives.”

Although, reading again, this doesn’t disprove violence in general arising due to scarcity, and then misgeneralizing in abundant environments… And again, “violence” is not the same as “coercion”.
1. ^
  Unnecessarily political, but seems to accurately represent Chagnon’s observations, based on other reporting and a quick skim of Chagnon’s work on Google Books.

Daniel Paleka May 25, 2023, 6:58 PM
1 point
0
on: Coercion is an adaptation to scarcity; trust is an adaptation to abundance
I don’t think “coercion is an evolutionary adaptation to scarcity, and we’ve only recently managed to get rid of the scarcity” is clearly true. It intuitively makes sense, but Napoleon Chagnon’s research seems to be one piece of evidence against the theory.

Daniel Paleka May 4, 2023, 10:06 AM
5 points
3
on: Are Emergent Abilities of Large Language Models a Mirage? [linkpost]
Jason Wei responded at https://www.jasonwei.net/blog/common-arguments-regarding-emergent-abilities.
My thoughts: It is true that some metrics increase smoothly and some don’t. The issue is that some important capabilities are inherently all-or-nothing, and we haven’t yet found surrogate metrics which increase smoothly and correlate with things we care about.
What we want is: for a given capability, predicting whether this capability happens in the model that is being trained.
If extrapolating a smoothly increasing surrogate metric can do that, then emergence of that capability is indeed a mirage. Otherwise, Betteridge’s law of headlines applies.

Daniel Paleka May 4, 2023, 9:25 AM
13 points
8
on: How MATS addresses “mass movement building” concerns
Claim 2: Our program gets more people working in AI/ML who would not otherwise be doing so (...)
This might be unpopular here, but I think each and every measure you take to alleviate this concern is counterproductive. This claim should just be discarded as a thing of the past. May 2020 has ended 6 months ago; everyone knows AI is the best thing to be working on if you want to maximize money or impact or status. For people not motivated by AI risks, you could replace would in that claim with could, without changing the meaning of the sentence.
On the other hand, maybe keeping the current programs explicitly in-group make a lot of sense if you think that AI x-risk will soon be a major topic in the ML research community anyway.

Daniel Paleka May 2, 2023, 7:59 PM
3 points
1
in reply to: Amalthea’s comment on: Five Worlds of AI (by Scott Aaronson and Boaz Barak)
I didn’t mean to go there, as I believe there are many reasons to think both authors are well-intentioned and that they wanted to describe something genuinely useful.

It’s just that this contribution fails to live up to its title or to sentences like “In other words, no one has done for AI what Russell Impagliazzo did for complexity theory in 1995...”. My original comment would be the same if it was an anonymous post.

Daniel Paleka

Daniel Paleka’s Shortform

You should de­lay en­g­ineer­ing-heavy re­search in light of R&D automation

Eval­u­at­ing Su­per­hu­man Models with Con­sis­tency Checks

You should delay engineering-heavy research in light of R&D automation

Evaluating Superhuman Models with Consistency Checks