My vibe from this post is something like “we’re making on stuff that could be helpful so there’s stuff to work on!” and this is a vibe I like. However, I suspect that for people who might not be as excited about these approaches, you’re likely not touching on important cruxes (eg: do these approaches really scale? Are some agendas capabilities enhancing? Will these solve deceptive alignment or just corrigible alignment?)
I also think that if the goal is to actually make progress and not to maximize the number of people making progress or who feel like they’re making progress, then engaging with those cruxes is important before people invest substantive energy (ie: beyond upskilling). However as a directional update for people who are otherwise pretty cynical, this seems like a good update.
You may object that RLHF is mostly capabilities. I also tend to think about it as being primarily a capabilities advance, but it is an advance in alignment as well
ie. it belongs in the reference class when figuring out the difficulty of making progress on alignment.
Regarding scalability, I wrote:
I suspect that all of these approaches are still very far away from where we need to be. I consider them substantial advances nonetheless for two key reasons: having a baseline helps people choose an appropriate level of ambition, and also makes it easier to empirically discover the key issues in solving a problem more fully.
Maybe that addresses the point about not scaling to a particular extent. Although, definitely not completely. It’s also possible that some of these techniques are effectively a trap in that they’d appear to work up to a certain level, so we’d start to rely on them and then we’d end up getting screwed.
Has anyone developed a metric for quantifying the level of linearity versus nonlinearity of a model’s representations? A metric like that would let us compare the levels of linearity for models of different sizes, which would help us extrapolate whether interpretability and alignment techniques that rely on approximate linearity will scale to larger models.
A simple way to quantify this: first define a “feature” as some decision boundary over the data domain, then train a linear classifier to predict that decision boundary from the network’s activations on that data. Quantify the “linearity” of the feature in the network as the accuracy that the linear classifier achieves.
For example, train a classifier to detect when some text has positive or negative sentiment, then pass the same text through some pretrained LLM (e.g. BERT) whose “feature-linearity” you’re trying to measure, and try to predict the sentiment from the BERT’s activation vectors using linear regression. The accuracy of this linear model tells you how linear the “sentiment” feature is in your LLM.
IMO the most useful version of this would be to get empirical evidence on techniques. E.g. erasing certain concepts using LEACE and seeing if they can inhibit the model’s use of those concepts including during further training. It seems hard to ensure otherwise that there is not some gap between your definitions and reality.
My vibe from this post is something like “we’re making on stuff that could be helpful so there’s stuff to work on!” and this is a vibe I like. However, I suspect that for people who might not be as excited about these approaches, you’re likely not touching on important cruxes (eg: do these approaches really scale? Are some agendas capabilities enhancing? Will these solve deceptive alignment or just corrigible alignment?)
I also think that if the goal is to actually make progress and not to maximize the number of people making progress or who feel like they’re making progress, then engaging with those cruxes is important before people invest substantive energy (ie: beyond upskilling). However as a directional update for people who are otherwise pretty cynical, this seems like a good update.
Hmm… I suppose this is pretty good evidence that CCS may not be as promising as it first appeared, esp. the banana/shed results.
https://www.lesswrong.com/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1
Update: Seems like the banana results are being challenged.
I wrote in one of my footnotes:
ie. it belongs in the reference class when figuring out the difficulty of making progress on alignment.
Maybe that addresses the point about not scaling to a particular extent. Although, definitely not completely. It’s also possible that some of these techniques are effectively a trap in that they’d appear to work up to a certain level, so we’d start to rely on them and then we’d end up getting screwed.
Has anyone developed a metric for quantifying the level of linearity versus nonlinearity of a model’s representations? A metric like that would let us compare the levels of linearity for models of different sizes, which would help us extrapolate whether interpretability and alignment techniques that rely on approximate linearity will scale to larger models.
I don’t know, but would love to find out.
I asked on Discord and someone told me this:
IMO the most useful version of this would be to get empirical evidence on techniques. E.g. erasing certain concepts using LEACE and seeing if they can inhibit the model’s use of those concepts including during further training. It seems hard to ensure otherwise that there is not some gap between your definitions and reality.