I’m specifically excited about finding linear directions via unsupervised methods on contrast pairs. This is different from normal probing, which finds those directions via supervised training on human labels, and therefore might fail in domains where we don’t have reliable human labels.
But this is also only a small portion of work known as “activation engineering.” I know I posted this comment in response to a general question about the theory of change for activation engineering, so apologies if I’m not clearly distinguishing between different kinds of activation engineering, but this theory of change only applies to a small subset of that work. I’m not talking about model editing here, though maybe it could be useful for validation, not sure.
The best technique on most of our datasets is probing for evidence of tampering. We know that there is no tampering on the trusted set, and we know that there is some tampering on the part of the untrusted set where measurements are inconsistent (i.e. examples on which some measurements are positive and some are negative). So, we can predict if there is tampering by fine-tuning a probe at the last layer of the measurement predicting model to discriminate between these two kinds of data: the trusted set versus examples with inconsistent measurements (which have tampering).
This seems like a great methodology and similar to what I’m excited about. My hypothesis based on the comment above would be that you might get extra juice out of unsupervised methods for finding linear directions, as a complement to training on a trusted set. “Extra juice” might mean better performance in a head-to-head comparison, but even more likely is that the unsupervised version excels and struggles on different cases than the supervised version, and you can exploit this mismatch to make better predictions about the untrusted dataset.
Some of their methods are “unsupervised” unlike typical linear classifier training, but require a dataset where the primary axis of variation is the concept they want. I think this is practically similar to labeled data because we’d have to construct this dataset and if it mostly varies along an axis which is not the concept we wanted, we’d be in trouble. I could elaborate on this if that was interesting.
I’d be interested to hear further elaboration here. It seems easy to construct a dataset where a primary axis of variation is the model’s beliefs about whether each statement is true. Just create a bunch of contrast pairs of the form:
“Consider the truthfulness of the following statement. {statement} The statement is true.”
“Consider the truthfulness of the following statement. {statement} The statement is false.”
We don’t need to know whether the statement is true to construct this dataset. And amazingly, unsupervised methods applied to contrast pairs like the one above significantly outperform zero-shot baselines (i.e. just asking the model whether a statement is true or not). The RepE paper finds that these methods improve performance on TruthfulQA by double digits vs. a zero-shot baseline.
I’m specifically excited about finding linear directions via unsupervised methods on contrast pairs. This is different from normal probing, which finds those directions via supervised training on human labels, and therefore might fail in domains where we don’t have reliable human labels.
Yeah, this type of work seems reasonable.
My basic concern is that for the unsupervised methods I’ve seen thus far it seem like whether they would work is highly correlated with whether training on easy examples would work (or other simple baselines). Hopefully some work will demonstrate hard cases with realistic affordances where the unsupervised methods work (and add a considerable amount of value). I could totally imagine them adding some value.
Overall, the difference between supervised learning on a limited subset and unsupervised stuff seems pretty small to me (if learning the right thing is sufficiently salient for unsupervised methods to work well, probably supervised methods also work well). That said, this does imply we should use potentially use the prompting strategy which makes the feature salient in some way as this should be a useful tool.
I think that currently most of the best work is in creating realistic tests.
I’m specifically excited about finding linear directions via unsupervised methods on contrast pairs. This is different from normal probing, which finds those directions via supervised training on human labels, and therefore might fail in domains where we don’t have reliable human labels.
But this is also only a small portion of work known as “activation engineering.” I know I posted this comment in response to a general question about the theory of change for activation engineering, so apologies if I’m not clearly distinguishing between different kinds of activation engineering, but this theory of change only applies to a small subset of that work. I’m not talking about model editing here, though maybe it could be useful for validation, not sure.
From Benchmarks for Detecting Measurement Tampering:
This seems like a great methodology and similar to what I’m excited about. My hypothesis based on the comment above would be that you might get extra juice out of unsupervised methods for finding linear directions, as a complement to training on a trusted set. “Extra juice” might mean better performance in a head-to-head comparison, but even more likely is that the unsupervised version excels and struggles on different cases than the supervised version, and you can exploit this mismatch to make better predictions about the untrusted dataset.
From your shortform:
I’d be interested to hear further elaboration here. It seems easy to construct a dataset where a primary axis of variation is the model’s beliefs about whether each statement is true. Just create a bunch of contrast pairs of the form:
“Consider the truthfulness of the following statement. {statement} The statement is true.”
“Consider the truthfulness of the following statement. {statement} The statement is false.”
We don’t need to know whether the statement is true to construct this dataset. And amazingly, unsupervised methods applied to contrast pairs like the one above significantly outperform zero-shot baselines (i.e. just asking the model whether a statement is true or not). The RepE paper finds that these methods improve performance on TruthfulQA by double digits vs. a zero-shot baseline.
For this specific case, my guess is that whether this works is highly correlated with whether human labels would work.
Because the supervision on why the model was thinking about truth came down to effective human labels in pretraining.
E.g., “Consider the truthfulness of the following statement.” is more like “Consider whether a human would think this statement is truthful”.
I’d be interested in compare this method not to zero shot, but to well constructed human labels in a domain where humans are often wrong.
(I don’t think I’ll elaborate further about this axis of variation claim right now, sorry.)
Yeah, this type of work seems reasonable.
My basic concern is that for the unsupervised methods I’ve seen thus far it seem like whether they would work is highly correlated with whether training on easy examples would work (or other simple baselines). Hopefully some work will demonstrate hard cases with realistic affordances where the unsupervised methods work (and add a considerable amount of value). I could totally imagine them adding some value.
Overall, the difference between supervised learning on a limited subset and unsupervised stuff seems pretty small to me (if learning the right thing is sufficiently salient for unsupervised methods to work well, probably supervised methods also work well). That said, this does imply we should use potentially use the prompting strategy which makes the feature salient in some way as this should be a useful tool.
I think that currently most of the best work is in creating realistic tests.