Adam Karvonen

Karma: 542

Adam Karvonen Apr 17, 2025, 8:45 PM
3 points
0
in reply to: ashesfall’s comment on: Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study
I don’t think image understanding is the bottleneck. O3 and O4-mini-high seem like they are a meaningful improvement in vision, where it’s almost good enough for this part, but they still fail miserably at the physical reasoning aspects.

This person got O4-mini-high to generate a reasonably close image depiction of the part.

https://x.com/tombielecki/status/1912913806541693253

Adam Karvonen Apr 16, 2025, 7:08 PM
7 points
0
in reply to: Jonathan Claybrough’s comment on: Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study
I also tested O3, and it looks better than Gemini 2.5 on Vision. Although it missed the second flat, it correctly identified that the ends had different diameters and picked up on some genuinely impressive details, like the grooved thread relief behind the larger thread.
However, it’s still terrible at spatial reasoning. I now feel more confident in the argument in my post. It proposes many egregious, physically impossible operations. For example, it recommends to enclose 2.2 inches of the part in the collet, and then face the part down to the finished length of 2.000 inches. This is obviously impossible, as the part is buried 0.2 inches within the collet. It also makes bizarre decisions, like clamping on the threads for the second lathe op, when the main diameter is obviously a much better location for rigidity / simplicity. It does correctly identify the chatter issue, FWIW.
It feels a bit worse than Gemini’s plan overall, but this is hard to evaluate. It’s basically “here are two plans with multiple egregious errors, which one is worse?”. I’ve also noticed that basically any time I ask an LLM for more specific details on a high level part of the plan that looks reasonable, it begins to make many egregious errors. So, a large part of how bad the plan is just revolves around how much detail the LLM goes into.

Adam Karvonen Apr 16, 2025, 4:05 PM
9 points
0
in reply to: Tao Lin’s comment on: Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study
I do agree that it looks like there has been a lack of data to address this ability. That being said, I’m pretty surprised at how terrible models are, and there’s a hierarchy of problems to be addressed here before models are actually useful in the physical world. Each step feels much more difficult than the step before, and all models are completely terrible at steps 2-4.
1. First, simply look at a part and identify features / if a part is symmetric / etc. This requires basically no spatial reasoning ability, yet almost all models are completely terrible. Even Gemini is very bad. I’m pretty surprised that this ability didn’t just fall out of scaling on data, but it does seem like this could be easily addressed with synthetic data.
2. Have some basic spatial reasoning ability where you can propose operations that are practical and aren’t physically impossible. This is much more challenging. First, it could be difficult to automatically generate practical solutions. Secondly, it may require moving beyond text chain of thought—when I walk through a setup, I don’t use language at all and just visualize everything.
3. Have an understanding of much of the tacit knowledge in machining, or rederive everything from first principles. Getting data could be especially challenging here.
4. Once you can create a single part correctly, now propose multiple different ways to manufacture the part. Evaluate all of the different plans and choose the best combination of cost, simplicity, and speed. This is the part of the job that’s actually challenging.

Adam Karvonen Apr 15, 2025, 3:42 PM
6 points
0
in reply to: Raphael Roche’s comment on: Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study
Hmm, I don’t know. With the caveat that I’m not a legal expert, I do think there’s a big difference between basically any job that can be done remotely most of the time and skilled physical labor jobs. I use LLMs for coding every day, and they still have tons of problems, but I do see significant progress happening. There is legitimate uncertainty over how long it will take for AIs to become reliable at tasks like coding.

Coding and ML research also requires a lot of subjective taste, like writing easily understandable code with good abstractions or selecting approaches to a research problem. We also see companies like Harvey (legal AI) making over $50M in ARR, while I’m not aware of basically any useful manufacturing AI tools.

Adam Karvonen Apr 15, 2025, 3:15 PM
4 points
1
in reply to: SorenJ’s comment on: Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study
Yeah, I agree. I currently feel like our current ML approach is going to make very little real world manufacturing progress, and that any progress will have to come from the automated AI researcher either brute forcing tons of synthetic data or coming up with new architectures and training procedures.

But, this is a low confidence take, and I wouldn’t be shocked if a couple dumb tricks make a lot of progress.

Adam Karvonen Apr 15, 2025, 2:42 PM
7 points
0
in reply to: SorenJ’s comment on: Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study
This is an obvious step, but I’m a bit skeptical for a few reasons.
- Current models are just so bad at vision tasks. Even Gemini 2.5 is pretty bad and falls apart if pushed to harder images. It really seems like identifying a feature on a part or if a part is symmetric is something that could be addressed by just scaling data, and these vision tasks are much easier than manufacturing details.
- A lot of the work in manufacturing / construction would be in tactile details, which could be hard to capture with sensors. For example, a human finger can easily feel a step of 0.001 inches, which would be invisible on video, and I would often use this fine grained tactile detail when diagnosing problems.
- The current reasoning paradigm requires scaling up RL. Where is the reward signal here? The most obvious thing I can think of is creating a bunch of simulated environments. But, almost all machinists I’ve talked to have (completely valid) complaints about engineers that understand textbook formulas and CAD but don’t understand real world manufacturing constraints. Simulation environments seem likely to create AIs with the same shortcomings.

Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study

Adam KarvonenApr 14, 2025, 5:38 PM

146 points

41 comments7 min readLW link

(adamkarvonen.github.io)

Adam Karvonen Jan 24, 2025, 4:17 PM
7 points
0
in reply to: Bogdan Ionut Cirstea’s comment on: SAEBench: A Comprehensive Benchmark for Sparse Autoencoders
A $1 training run would be training 6 SAEs across 6 sparsities at 16K width on Gemma-2-2B for 200M tokens. This includes generating the activations, and it would be cheaper if the activations are precomputed. In practice this seems like large enough scale to validate ideas such as the Matryoshka SAE or the BatchTopK SAE.

Adam Karvonen Jan 23, 2025, 12:56 AM
3 points
2
in reply to: Bogdan Ionut Cirstea’s comment on: SAEBench: A Comprehensive Benchmark for Sparse Autoencoders
SAEs are early enough that there’s tons of low hanging fruit and ideas to try. They also require relatively little compute (often around $1 for a training run), so AI agents could afford to test many ideas. I wouldn’t be surprised if SAE improvements were a good early target for automated AI research, especially if the feedback loop is just “Come up with idea, modify existing loss function, train, evaluate, get a quantitative result”.

Adam Karvonen Jan 18, 2025, 5:11 PM
4 points
7
on: Adam Karvonen’s Shortform
If you’re looking for a hackable SAE training repo for experiments, I’d recommend our dictionary_learning repo. It’s been around for a few months, but we’ve recently spent some time cleaning it up and adding additional trainer types.

It’s designed to be simple and hackable—you can add a new SAE type in a single file (~350 lines). We have 8 tested implementations, including JumpReLU, TopK, BatchTopK, Matryoshka, Gated, and others, with BatchTopK recommended as a good default. Training is quick and cheap—training 6 16K width SAEs on Gemma-2-2B for 200M tokens takes ~6 3090 hours, or ~$1.20.

The repo integrates with SAE Bench and includes reproducible baselines trained on Pythia-160M and Gemma-2-2B. While it’s not optimized for large models like Eleuther’s (no Cuda kernels/multi-GPU support) and has fewer features than SAE Lens, it’s great for experiments and trying new architectures.

Here is a link to the repo: https://github.com/saprmarks/dictionary_learning

Adam Karvonen’s Shortform

Adam KarvonenJan 18, 2025, 5:11 PM

4 points

1 comment LW link

Adam Karvonen Dec 17, 2024, 6:41 PM
3 points
0
in reply to: Sam Marks’s comment on: Sam Marks’s Shortform
The forward hook for our best performing approach is here. As Sam mentioned, this hasn’t been deployed to production. We left it as a case study because Benchify is currently prioritizing other parts of their stack unrelated to ML.
For this demonstration, we added a forward hook to a HuggingFace Transformers model for simplicity, rather than incorporating it into a production inference stack.

Adam Karvonen Dec 17, 2024, 6:31 PM
9 points
0
in reply to: Sam Marks’s comment on: Sam Marks’s Shortform
Rejection sampling is a strong baseline that we hadn’t considered, and it’s definitely worth trying out—I suspect it will perform well here. Currently, our focus is on identifying additional in-the-wild tasks, particularly from other companies, as many of Benchify’s challenges involve sensitive details about their internal tooling that they prefer to keep private. We’re especially interested in tasks where it’s not possible to automatically measure success or failure via string matching, as this is where techniques like model steering are most likely to be the most practical.

I also agree with Sam that rejection sampling would likely need to operate on entire blocks rather than individual lines. By the time an LLM generates a line containing a regular expression, it’s often already committed to that path—for example, it might have skipped importing required modules or creating the necessary variables to pursue an alternative solution.

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Can, Adam Karvonen, Johnny Lin, Curt Tigges, Joseph Bloom, chanind, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, CallumMcDougall, Kola Ayonrinde, Matthew Wearden, Sam Marks and Neel Nanda

Dec 11, 2024, 6:30 AM

82 points

6 comments2 min readLW link

(www.neuronpedia.org)

Evaluating Sparse Autoencoders with Board Game Models

Adam Karvonen, Sam Marks, Can, Benjamin Wright, Jannik Brinkmann, Logan Riggs and Rico Angell

Aug 2, 2024, 7:50 PM

38 points

1 comment9 min readLW link

Adam Karvonen Jul 22, 2024, 11:54 PM
1 point
0
in reply to: RogerDearnaley’s comment on: Using an LLM perplexity filter to detect weight exfiltration
I agree. In particular, there’s a huge advantage to the defender due to the scale of the model weights. All the defense has to do is raise the bar high enough that an adversary can’t exfiltrate model weights during the lifetime of the model.
If the adversary gains access to the weak model, it still reduces the possible encoded information density, as I discuss here. I haven’t done the experiments, but I would guess this reduction improves significantly if the adversary doesn’t have access to the weak model.
Various ways of changing the weak model weights or retraining the weak model with new data could also increase the size of the defender’s advantage.

Adam Karvonen Jul 22, 2024, 11:18 PM
1 point
−4
in reply to: ryan_greenblatt’s comment on: Preventing model exfiltration with upload limits
I think it does stack if the perplexity filter is applied before the compression step, which relies on an underlying assumption that it’s easier to secure the internet connection choke point rather than the entire data center.

I will try to continue the discussion here for simplicity.

Adam Karvonen Jul 22, 2024, 11:16 PM
2 points
−2
in reply to: ryan_greenblatt’s comment on: Using an LLM perplexity filter to detect weight exfiltration
The perplexity filter works best if the attacker doesn’t have complete control of the entire data center, but instead limited access to some machines. An underlying assumption is that it’s easier to secure the internet connection choke point, rather than everything going on inside the data center.

This can be stacked with compression by applying the perplexity filter before the compression step.

Assuming that the attacker has access to the perplexity filter model, it would still reduce the possible information density if they have to perform some sort of arithmetic encoding scheme that has low perplexity. I had coworkers working on exactly this problem of encoding information into natural language using arithmetic encoding and GPT-2 small, and there’s a major trade-off between the perplexity of the language generated and the density of the encoded information, although I don’t have numbers available to quantify this trade-off.

Adam Karvonen Jul 22, 2024, 8:44 PM
1 point
0
on: Preventing model exfiltration with upload limits
Another potential option (that could stack with text compression) is to filter outgoing data. In particular, if we only allow outgoing data to be language tokens, and we filter the outgoing data to have low perplexity as measured by a small, secret language model, then it could become very difficult to encode terabytes of model weights in a way that has low perplexity.
I discuss this idea more here.

Adam Karvonen Jul 22, 2024, 2:52 AM
2 points
0
in reply to: Adam Karvonen’s comment on: Using an LLM perplexity filter to detect weight exfiltration
Thanks for this comment, by the way! I added a paragraph to the beginning to make this post more clear.

Adam Karvonen

Fron­tier AI Models Still Fail at Ba­sic Phys­i­cal Tasks: A Man­u­fac­tur­ing Case Study

Adam Kar­vo­nen’s Shortform

SAEBench: A Com­pre­hen­sive Bench­mark for Sparse Autoencoders

Eval­u­at­ing Sparse Au­toen­coders with Board Game Models

Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study

Adam Karvonen’s Shortform

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Evaluating Sparse Autoencoders with Board Game Models