Great questions. Finding features of high concern is indeed something we highlighted as an open problem, and it’s not obvious whether this will be achievable. The avenues you list are plausible ones. It might also turn out that it is very hard to find individual features which are sufficiently indicative of malign behavior, but that one could do so for circuits or specify some other rule involving combinations of features. E.g., the criterion might involve something like “feature activates which indicates the intent to carry out a harmful plan, and features indicating attribution of the thoughts to a fictional character are not active.” (As discussed in Limitations, one could apply a variant of this safety case based on circuits rather than features, and the logic would be similar.)
We’re agnostic about where the model organisms would come from, but I like your suggestions.
Great questions. Finding features of high concern is indeed something we highlighted as an open problem, and it’s not obvious whether this will be achievable. The avenues you list are plausible ones. It might also turn out that it is very hard to find individual features which are sufficiently indicative of malign behavior, but that one could do so for circuits or specify some other rule involving combinations of features. E.g., the criterion might involve something like “feature activates which indicates the intent to carry out a harmful plan, and features indicating attribution of the thoughts to a fictional character are not active.” (As discussed in Limitations, one could apply a variant of this safety case based on circuits rather than features, and the logic would be similar.)
We’re agnostic about where the model organisms would come from, but I like your suggestions.