The setup in the first sketch in Three Sketches of ASL-4 Safety Case Components can be replaced with a classifier that indicates whether a model is doing something catastrophically bad, which corresponds to the high concern features described in the post. Perhaps you can also have a second classifier that indicates if something is moderately bad, which corresponds to the medium concern features.
If you have such a classifier, you can:
Use it at deployment time to monitor model behavior, and stop the session and read the logs (if allowed by privacy constraints) if the classifier outputs a probability above some threshold
Validate your classifier it by:
Running model organisms and see that it is above the threshold whenever the model organism is trying to do something bad and below the threshold when the model is doing something benign
Run red-blue games on the entire monitoring methodology
Depending on the underlying information fed into the classifier, you may also be able to run observational and interventional studies
This classifier can leverage any information, SAE vectors or other white box methods as mentioned in the post, the model’s chain of thought, the sequence of actions the model takes, or better yet, all of the above.
The setup in the first sketch in Three Sketches of ASL-4 Safety Case Components can be replaced with a classifier that indicates whether a model is doing something catastrophically bad, which corresponds to the high concern features described in the post. Perhaps you can also have a second classifier that indicates if something is moderately bad, which corresponds to the medium concern features.
If you have such a classifier, you can:
Use it at deployment time to monitor model behavior, and stop the session and read the logs (if allowed by privacy constraints) if the classifier outputs a probability above some threshold
Validate your classifier it by:
Running model organisms and see that it is above the threshold whenever the model organism is trying to do something bad and below the threshold when the model is doing something benign
Run red-blue games on the entire monitoring methodology
Depending on the underlying information fed into the classifier, you may also be able to run observational and interventional studies
This classifier can leverage any information, SAE vectors or other white box methods as mentioned in the post, the model’s chain of thought, the sequence of actions the model takes, or better yet, all of the above.