I appreciate this post. Emphasizing a couple things and providing some other commentary/questions on the paper (as there doesn’t seem to be a better top level post for it) (I have not read paper deeply and could be missing things):
I find the Twitter vote brigading to be annoying and slightly bad for collective epistemics. I do not think this paper was particularly good, and it did not warrant the attention it got. (The main flaws IMO are a lack of (empirical) comparison to other methods — except a brief interlude in the appendix; and lack of any benchmarking — for example testing if clamping sycophancy features affects performance on sycophancy benchmarks)
At an object level, one concerning-to-me result is that there doesn’t appear to be a clean gradient in the presence of a feature over the range of activation values. You might hope that if you take the AI risk feature[1], and look at dataset examples that span its activation values (as the tool does), you would see highly activating text be very related to AI risk and low activating text be only slightly related. I think that pattern is weak — there are at least some low-activation examples that are highly related to AI risk, such as ‘...”It’s what they’re programmed to do.” “Destroy all technology other than their own”’ (cherrypicked by me). This is related to sensitivity, which the paper mentions is difficult to study in this context (before mentioning one cherry-picked result). I care about this because: one way to use SAEs for safety is as a classifier for malicious behavior (be checking if model activations correspond to dangerous features); this would really benefit from having a nice smooth relationship between feature activation magnitude and actual feature presence, and it pretty much needs to have high sensitivity. Given the existence of highly-feature-related samples in the bottom activation interval, I feel fairly worried that sensitivity is poor, and that it will be hard to do magnitude-based thresholds — it pretty much looks like 0 is the reasonable threshold given these results.
In the paper this is labeled with “The concept of an advanced AI system causing unintended harm or becoming uncontrollable and posing an existential threat to humanity”
I appreciate this post. Emphasizing a couple things and providing some other commentary/questions on the paper (as there doesn’t seem to be a better top level post for it) (I have not read paper deeply and could be missing things):
I find the Twitter vote brigading to be annoying and slightly bad for collective epistemics. I do not think this paper was particularly good, and it did not warrant the attention it got. (The main flaws IMO are a lack of (empirical) comparison to other methods — except a brief interlude in the appendix; and lack of any benchmarking — for example testing if clamping sycophancy features affects performance on sycophancy benchmarks)
At an object level, one concerning-to-me result is that there doesn’t appear to be a clean gradient in the presence of a feature over the range of activation values. You might hope that if you take the AI risk feature[1], and look at dataset examples that span its activation values (as the tool does), you would see highly activating text be very related to AI risk and low activating text be only slightly related. I think that pattern is weak — there are at least some low-activation examples that are highly related to AI risk, such as ‘...”It’s what they’re programmed to do.” “Destroy all technology other than their own”’ (cherrypicked by me). This is related to sensitivity, which the paper mentions is difficult to study in this context (before mentioning one cherry-picked result). I care about this because: one way to use SAEs for safety is as a classifier for malicious behavior (be checking if model activations correspond to dangerous features); this would really benefit from having a nice smooth relationship between feature activation magnitude and actual feature presence, and it pretty much needs to have high sensitivity. Given the existence of highly-feature-related samples in the bottom activation interval, I feel fairly worried that sensitivity is poor, and that it will be hard to do magnitude-based thresholds — it pretty much looks like 0 is the reasonable threshold given these results.
In the paper this is labeled with “The concept of an advanced AI system causing unintended harm or becoming uncontrollable and posing an existential threat to humanity”