the heart has been optimized (by evolution) to pump blood; that’s a sense in which its purpose is to pump blood.
Should we expect any components like that inside neural networks?
Is there any optimization pressure on any particular subcomponent? You can have perfect object recognition using components like “horse or night cow or submarine or a back leg of a tarantula” provided that there is enough of them and they are neatly arranged.
Interesting post, thx!
Regarding your attempt at “backdoor awareness” replication. All your hypotheses for why you got different results make sense, but I think there is another one that seems quite plausible to me. You said:
Now, claiming you have a backdoor seems hmmm, quite unsafe? Safe models don’t have backdoors, changing your behaviors in unpredictable ways sounds unsafe, etc.
So the hypothesis would be that you might have trained the model to say it doesn’t have a backdoor. Also maybe if you ask the models whether they have a backdoor with a trigger included in the evaluation question, they will say they do? I have a vague memory of trying that, but I might misremember.
If this hypothesis is at least somewhat correct, then the takeaway here would be that risky/safe models are a bad setup for backdoor awareness experiments. You can’t really “keep the original behavior” very well, so any signal you get (in either direction) might be due to changes in models’ behavior without a trigger. Maybe the myopic models from the appendix would be better here? But we haven’t tried that with backdoors at all.