x-posting a kinda rambling thread I wrote about this blog post from Tilde research.
---
If true, this is the first known application of SAEs to a found-in-the-wild problem: using LLMs to generate fuzz tests that don’t use regexes. A big milestone for the field of interpretability!
I’ll discussed some things that surprised me about this case study in
---
The authors use SAE features to detect regex usage and steer models not to generate regexes. Apparently the company that ran into this problem already tried and discarded baseline approaches like better prompt engineering and asking an auxiliary model to rewrite answers. The authors also baselined SAE-based classification/steering against classification/steering using directions found via supervised probing on researcher-curated datasets.
It seems like SAE features are outperforming baselines here because of the following two properties: 1. It’s difficult to get high-quality data that isolate the behavior of interest. (I.e. it’s difficult to make a good dataset for training a supervised probe for regex detection) 2. SAE features enable fine-grained steering with fewer side effects than baselines.
Property (1) is not surprising in the abstract, and I’ve often argued that if interpretability is going to be useful, then it will be for tasks where there are structural obstacles to collecting high-quality supervised data (see e.g. the opening paragraph to section 4 of Sparse Feature Circuitshttps://arxiv.org/abs/2403.19647).
However, I think property (1) is a bit surprising in this particular instance—it seems like getting good data for the regex task is more “tricky and annoying” than “structurally difficult.” I’d weakly guess that if you are a whiz at synthetic data generation then you’d be able to get good enough data here to train probes that outperform the SAEs. But that’s not too much of a knock against SAEs—it’s still cool if they enable an application that would otherwise require synthetic datagen expertise. And overall, it’s a cool showcase of the fact that SAEs find meaningful units in an unsupervised way.
Property (2) is pretty surprising to me! Specifically, I’m surprised that SAE feature steering enables finer-grained control than prompt engineering. As others have noted, steering with SAE features often results in unintended side effects; in contrast, since prompts are built out of natural language, I would guess that in most cases we’d be able to construct instructions specific enough to nail down our behavior of interest pretty precisely. But in this case, it seems like the task instructions are so long and complicated that the models have trouble following them all. (And if you try to improve your prompt to fix the regex behavior, the model starts misbehaving in other ways, leading to a “whack-a-mole” problem.) And also in this case, SAE feature steering had fewer side-effects than I expected!
I’m having a hard time drawing a generalizable lesson from property (2) here. My guess is that this particular problem will go away with scale, as larger models are able to more capably follow fine-grained instructions without needing model-internals-based interventions. But maybe there are analogous problems that I shouldn’t expect to be solved with scale? E.g. maybe interpretability-assisted control will be useful across scales for resisting jailbreaks (which are, in some sense, an issue with fine-grained instruction-following).
Overall, something surprised me here and I’m excited to figure out what my takeaways should be.
---
Some things that I’d love to see independent validation of:
1. It’s not trivial to solve this problem with simple changes to the system prompt. (But I’d be surprised if it were: I’ve run into similar problems trying to engineer system prompts with many instructions.)
2. It’s not trivial to construct a dataset for training probes that outcompete SAE features. (I’m at ~30% that the authors just got unlucky here.)
---
Huge kudos to everyone involved, especially the eagle-eyed @Adam Karvonen for spotting this problem in the wild and correctly anticipating that interpretability could solve it!
---
I’d also be interested in tracking whether Benchify (the company that had the fuzz-tests-without-regexes problem) ends up deploying this system to production (vs. later finding out that the SAE steering is unsuitable for a reason that they haven’t yet noticed).
Note that this is conditional SAE steering—if the latent doesn’t fire it’s a no-op. So it’s not that surprising that it’s less damaging, a prompt is there on every input! It depends a lot on the performance of the encoder as a classifier though
That’s technically even more conditional as the intervention (subtract the parallel component) also depends on the residual stream. But yes. I think it’s reasonable to lump these together though, orthogonalisation also should be fairly non destructive unless the direction was present, while steering likely always has side effects
Isn’t it easy to detect regexes in model outputs and rejection sample lines that contain regexes? This requires some custom sampling code if you want optimal latency/throughput, but the SAEs also require that.
If you have a bunch of things like this, rather than just one or two, I bet rejection sampling gets expensive pretty fast—if you have one constraint which the model fails 10% of the time, dropping that failure rate to 1% brings you from 1.11 attempts per success to 1.01 attempts per success, but if you have 20 such constraints that brings you from 8.2 attempts per success to 1.2 attempts per success.
Early detection of constraint violation plus substantial infrastructure around supporting backtracking might be an even cheaper and more effective solution, though at the cost of much higher complexity.
Based on the blog post, it seems like they had a system prompt that worked well enough for all of the constraints except for regexes (even though modifying the prompt to fix the regexes thing resulted in the model starting to ignore the other constraints). So it seems like the goal here was to do some custom thing to fix just the regexes (without otherwise impeding the model’s performance, include performance at following the other constraints).
(Note that using SAEs to fix lots of behaviors might also have additional downsides, since you’re doing a more heavy-handed intervention on the model.)
I’m guessing you’d need to rejection sample entire blocks, not just lines. But yeah, good point, I’m also curious about this. Maybe the proportion of responses that use regexes is too large for rejection sampling to work? @Adam Karvonen
@Adam Karvonen I feel like you guys should test this unless there’s a practical reason that it wouldn’t work for Benchify (aside from “they don’t feel like trying any more stuff because the SAE stuff is already working fine for them”).
Rejection sampling is a strong baseline that we hadn’t considered, and it’s definitely worth trying out—I suspect it will perform well here. Currently, our focus is on identifying additional in-the-wild tasks, particularly from other companies, as many of Benchify’s challenges involve sensitive details about their internal tooling that they prefer to keep private. We’re especially interested in tasks where it’s not possible to automatically measure success or failure via string matching, as this is where techniques like model steering are most likely to be the most practical.
I also agree with Sam that rejection sampling would likely need to operate on entire blocks rather than individual lines. By the time an LLM generates a line containing a regular expression, it’s often already committed to that path—for example, it might have skipped importing required modules or creating the necessary variables to pursue an alternative solution.
I’m curious how they set up the SAE stuff; I’d have thought that this would require modifying some performance-critical inference code in a tricky way.
The entrypoint to their sampling code is here. It looks like they just add a forward hook to the model that computes activations for specified features and shifts model activations along SAE decoder directions a corresponding amount. (Note that this is cheaper than autoencoding the full activation. Though for all I know, running the full autoencoder during the forward pass might have been fine also, given that they’re working with small models and adding a handful of SAE calls to a forward pass shouldn’t be too big a hit.)
This uses transformers, which is IIUC way less efficient for inference than e.g. vllm, to an extent that is probably unacceptable for production usecases.
I wonder if it would be possible to do SAE feature amplification / ablation, at least for residual stream features, by inserting a “mostly empty” layer. E,g, for feature ablation, setting the W_O and b_O params of the attention heads of your inserted layer to 0 to make sure that the attention heads don’t change anything, and then approximate the constant / clamping intervention from the blog post via the MLP weights (if the activation function used for the transformer is the same one as is used for the SAE, it should be possible to do a perfect approximation using only one of the MLP neurons, but even if not it should be possible to very closely approximate any commonly-used activation function using any other commonly-used activation function with some clever stacking).
This would of course be horribly inefficient from a compute perspective (each forward pass would take n+kn times as long, where n is the original number of layers the model had and k is the number of distinct layers in which you’re trying to do SAE operations on the residual stream), but I think vllm would handle “llama but with one extra layer” without requiring any tricky inference code changes and plausibly this would still be more efficient than resampling.
The forward hook for our best performing approach is here. As Sam mentioned, this hasn’t been deployed to production. We left it as a case study because Benchify is currently prioritizing other parts of their stack unrelated to ML.
For this demonstration, we added a forward hook to a HuggingFace Transformers model for simplicity, rather than incorporating it into a production inference stack.
I suggested something similar, and this was the discussion (bolding is the important author pushback):
Arthur Conmy
11:33 1 Dec
Why can’t the YC company not use system prompts and instead:
1) Detect whether regex has been used in the last ~100 tokens (and run this check every ~100 tokens of model output)
2) If yes, rewind back ~100 tokens, insert a comment like # Don’t use regex here (in a valid way given what code has been written so far), and continue the generation
Dhruv Pai
10:50 2 Dec
This seems like a reasonable baseline with the caveat that it requires expensive resampling and inserting such a comment in a useful way is difficult.
When we ran baselines simply repeating the number of times we told the model not to use regex right before generation in the system prompt, we didn’t see the instruction following improve (very circumstantial evidence). I don’t see a principled reason why this would be much worse than the above, however, since we do one-shot generation with such a comment right before the actual generation.
Apparently fuzz tests that used regexes were an issue in practice for Benchify (the company that ran into this problem). From the blog post:
Benchify observed that the model was much more likely to generate a test with no false positives when using string methods instead of regexes, even if the test coverage wasn’t as extensive.
x-posting a kinda rambling thread I wrote about this blog post from Tilde research.
---
If true, this is the first known application of SAEs to a found-in-the-wild problem: using LLMs to generate fuzz tests that don’t use regexes. A big milestone for the field of interpretability!
I’ll discussed some things that surprised me about this case study in
---
The authors use SAE features to detect regex usage and steer models not to generate regexes. Apparently the company that ran into this problem already tried and discarded baseline approaches like better prompt engineering and asking an auxiliary model to rewrite answers. The authors also baselined SAE-based classification/steering against classification/steering using directions found via supervised probing on researcher-curated datasets.
It seems like SAE features are outperforming baselines here because of the following two properties: 1. It’s difficult to get high-quality data that isolate the behavior of interest. (I.e. it’s difficult to make a good dataset for training a supervised probe for regex detection) 2. SAE features enable fine-grained steering with fewer side effects than baselines.
Property (1) is not surprising in the abstract, and I’ve often argued that if interpretability is going to be useful, then it will be for tasks where there are structural obstacles to collecting high-quality supervised data (see e.g. the opening paragraph to section 4 of Sparse Feature Circuits https://arxiv.org/abs/2403.19647).
However, I think property (1) is a bit surprising in this particular instance—it seems like getting good data for the regex task is more “tricky and annoying” than “structurally difficult.” I’d weakly guess that if you are a whiz at synthetic data generation then you’d be able to get good enough data here to train probes that outperform the SAEs. But that’s not too much of a knock against SAEs—it’s still cool if they enable an application that would otherwise require synthetic datagen expertise. And overall, it’s a cool showcase of the fact that SAEs find meaningful units in an unsupervised way.
Property (2) is pretty surprising to me! Specifically, I’m surprised that SAE feature steering enables finer-grained control than prompt engineering. As others have noted, steering with SAE features often results in unintended side effects; in contrast, since prompts are built out of natural language, I would guess that in most cases we’d be able to construct instructions specific enough to nail down our behavior of interest pretty precisely. But in this case, it seems like the task instructions are so long and complicated that the models have trouble following them all. (And if you try to improve your prompt to fix the regex behavior, the model starts misbehaving in other ways, leading to a “whack-a-mole” problem.) And also in this case, SAE feature steering had fewer side-effects than I expected!
I’m having a hard time drawing a generalizable lesson from property (2) here. My guess is that this particular problem will go away with scale, as larger models are able to more capably follow fine-grained instructions without needing model-internals-based interventions. But maybe there are analogous problems that I shouldn’t expect to be solved with scale? E.g. maybe interpretability-assisted control will be useful across scales for resisting jailbreaks (which are, in some sense, an issue with fine-grained instruction-following).
Overall, something surprised me here and I’m excited to figure out what my takeaways should be.
---
Some things that I’d love to see independent validation of:
1. It’s not trivial to solve this problem with simple changes to the system prompt. (But I’d be surprised if it were: I’ve run into similar problems trying to engineer system prompts with many instructions.)
2. It’s not trivial to construct a dataset for training probes that outcompete SAE features. (I’m at ~30% that the authors just got unlucky here.)
---
Huge kudos to everyone involved, especially the eagle-eyed @Adam Karvonen for spotting this problem in the wild and correctly anticipating that interpretability could solve it!
---
I’d also be interested in tracking whether Benchify (the company that had the fuzz-tests-without-regexes problem) ends up deploying this system to production (vs. later finding out that the SAE steering is unsuitable for a reason that they haven’t yet noticed).
Note that this is conditional SAE steering—if the latent doesn’t fire it’s a no-op. So it’s not that surprising that it’s less damaging, a prompt is there on every input! It depends a lot on the performance of the encoder as a classifier though
Isn’t every instance of clamping a feature’s activation to 0 conditional in this sense?
That’s technically even more conditional as the intervention (subtract the parallel component) also depends on the residual stream. But yes. I think it’s reasonable to lump these together though, orthogonalisation also should be fairly non destructive unless the direction was present, while steering likely always has side effects
Isn’t it easy to detect regexes in model outputs and rejection sample lines that contain regexes? This requires some custom sampling code if you want optimal latency/throughput, but the SAEs also require that.
If you have a bunch of things like this, rather than just one or two, I bet rejection sampling gets expensive pretty fast—if you have one constraint which the model fails 10% of the time, dropping that failure rate to 1% brings you from 1.11 attempts per success to 1.01 attempts per success, but if you have 20 such constraints that brings you from 8.2 attempts per success to 1.2 attempts per success.
Early detection of constraint violation plus substantial infrastructure around supporting backtracking might be an even cheaper and more effective solution, though at the cost of much higher complexity.
Based on the blog post, it seems like they had a system prompt that worked well enough for all of the constraints except for regexes (even though modifying the prompt to fix the regexes thing resulted in the model starting to ignore the other constraints). So it seems like the goal here was to do some custom thing to fix just the regexes (without otherwise impeding the model’s performance, include performance at following the other constraints).
(Note that using SAEs to fix lots of behaviors might also have additional downsides, since you’re doing a more heavy-handed intervention on the model.)
I’m guessing you’d need to rejection sample entire blocks, not just lines. But yeah, good point, I’m also curious about this. Maybe the proportion of responses that use regexes is too large for rejection sampling to work? @Adam Karvonen
@Adam Karvonen I feel like you guys should test this unless there’s a practical reason that it wouldn’t work for Benchify (aside from “they don’t feel like trying any more stuff because the SAE stuff is already working fine for them”).
Rejection sampling is a strong baseline that we hadn’t considered, and it’s definitely worth trying out—I suspect it will perform well here. Currently, our focus is on identifying additional in-the-wild tasks, particularly from other companies, as many of Benchify’s challenges involve sensitive details about their internal tooling that they prefer to keep private. We’re especially interested in tasks where it’s not possible to automatically measure success or failure via string matching, as this is where techniques like model steering are most likely to be the most practical.
I also agree with Sam that rejection sampling would likely need to operate on entire blocks rather than individual lines. By the time an LLM generates a line containing a regular expression, it’s often already committed to that path—for example, it might have skipped importing required modules or creating the necessary variables to pursue an alternative solution.
I’m curious how they set up the SAE stuff; I’d have thought that this would require modifying some performance-critical inference code in a tricky way.
The entrypoint to their sampling code is here. It looks like they just add a forward hook to the model that computes activations for specified features and shifts model activations along SAE decoder directions a corresponding amount. (Note that this is cheaper than autoencoding the full activation. Though for all I know, running the full autoencoder during the forward pass might have been fine also, given that they’re working with small models and adding a handful of SAE calls to a forward pass shouldn’t be too big a hit.)
This uses
transformers
, which is IIUC way less efficient for inference than e.g.vllm
, to an extent that is probably unacceptable for production usecases.I wonder if it would be possible to do SAE feature amplification / ablation, at least for residual stream features, by inserting a “mostly empty” layer. E,g, for feature ablation, setting the
W_O
andb_O
params of the attention heads of your inserted layer to 0 to make sure that the attention heads don’t change anything, and then approximate the constant / clamping intervention from the blog post via the MLP weights (if the activation function used for the transformer is the same one as is used for the SAE, it should be possible to do a perfect approximation using only one of the MLP neurons, but even if not it should be possible to very closely approximate any commonly-used activation function using any other commonly-used activation function with some clever stacking).This would of course be horribly inefficient from a compute perspective (each forward pass would take n+kn times as long, where n is the original number of layers the model had and k is the number of distinct layers in which you’re trying to do SAE operations on the residual stream), but I think vllm would handle “llama but with one extra layer” without requiring any tricky inference code changes and plausibly this would still be more efficient than resampling.
The forward hook for our best performing approach is here. As Sam mentioned, this hasn’t been deployed to production. We left it as a case study because Benchify is currently prioritizing other parts of their stack unrelated to ML.
For this demonstration, we added a forward hook to a HuggingFace Transformers model for simplicity, rather than incorporating it into a production inference stack.
I suggested something similar, and this was the discussion (bolding is the important author pushback):
If you have kv caching between inference calls, this shouldn’t be a big cost.
why wouldn’t you want regexes?
Apparently fuzz tests that used regexes were an issue in practice for Benchify (the company that ran into this problem). From the blog post: