Nicky Pochinkov
NickyP
Confusing the metric for the meaning: Perhaps correlated attributes are “natural”
I wonder how much of these orthogonal vectors are “actually orthogonal” once we consider we are adding two vectors together, and that the model has things like LayerNorm.
If one conditions on downstream midlayer activations being “sufficiently different” it seems possible one could find like 10x degeneracy of actual effects these have on models. (A possibly relevant factor is how big the original activation vector is compared to the steering vector?)
Comparing Quantized Performance in Llama Models
I think there are already some papers doing similar work, though usually sold as reducing inference costs. For example, the MoEfication paper and Contextual Sparsity paper could probably be modified for this purpose.
Sorry! I have fixed this now
In case anyone finds it difficult to go through all the projects, I have made a longer post where each project title is followed by a brief description, and a list of the main skills/roles they are looking for.
See here: https://www.lesswrong.com/posts/npkvZG67hRvBneoQ9
AISC 2024 - Project Summaries
AISC Project: Modelling Trajectories of Language Models
Cadenza Labs has some video explainers on interpretability-related concepts: https://www.youtube.com/@CadenzaLabs
For example, an intro to Causal Scrubbing:
Machine Unlearning Evaluations as Interpretability Benchmarks
Ideation and Trajectory Modelling in Language Models
Seems to work fine for me, but here are the links to Market One, Market Two and Market Three from the post. (They show % customer funds to be returned, at 46%, 43% and 42% at time of this comment)
Maybe not fully understanding, but one issue I see is that without requiring “perfect prediction”, one could potentially Goodhart on on the proposal. I could imagine something like:
In training GPT-5, add a term that upweights very basic bigram statistics. In “evaluation”, use your bigram statistics table to “predict” most topk outputs just well enough to pass.
This would probably have a negative impact to performance, but this could possibly be tuned to be just sufficient to pass. Alternatively, one could use a toy model trained on the side that is easy to understand, and regularise the predictions on that instead of exactly using bigram statistics, just enough to pass the test, but still only understanding the toy model.
LLM Modularity: The Separability of Capabilities in Large Language Models
While I think this is important, and will probably edit the post, I think even in the unembedding, when getting the logits, the behaviour cares more about direction than distance.
When I think of distance, I implicitly think Euclidean distance:
But the actual “distance” used for calculating logits looks like this:
Which is a lot more similar to cosine similarity:
I think that because the metric is so similar to the cosine similarity, it makes more sense to think of size + directions instead of distances and points.
This is true. I think that visualising points on a (hyper-)sphere is fine, but it is difficult in practice to parametrise the points that way.
It is more that the vectors on the gpu look like , but the vectors in the model are treated more like
LLM Basics: Embedding Spaces—Transformer Token Vectors Are Not Points in Space
Thanks for this comment! I think this one of the main concerns I am pointing at.
I think somethings like fiscal aid could work, but have people tried making models for responses to things like this? It feels like with covid the relatively decent response was because the government was both enforcing a temporary policy of lockdown, and was sending checks to adjust things “back to normal” despite this. If job automation is slightly more gradual, on the scale of months to years, and specific only to certain jobs at a time, the response could be quite different, and it might be more likely that things end up poorly.
Yeah, though I think it depends on how many people are able to buy the new goods at a better price. If most well-paid employees (ie: the employees that companies get the most value from automating) no longer have a job, then the number of people who can buy the more expensive goods and services might go down. It seems counter-intuitive to me that GDP if the number of people who lost their jobs is high enough. It feels possible that the recent tech developments was barely net positive to nominal GDP despite rapid improvements, and that fast enough technological process could cause nominal GDP to go in the other direction.
The unlearning results seem promising!
The author’s results from unlearning MMLU seems slightly rushed but moderately promising (I previously wrote a paper trying similar things, making good comparisons here is difficult), but the results from unlearning different coding languages seem very strong (compared to my previous attempt), the model seems to be substantially more monosemantic.
I agree with your suspicions that the gemma SAE performance was poor from using reconstructed activations, matches the drop in performance I got when I tried doing this.
Would be interesting to see if, e.g. steering performance from MONET expert directions is also comparable to that of SAEs. Using SAEs in practice is quite costly so I would prefer an approach more similar to MONET.