Tyler Tracy comments on Anthropic: Three Sketches of ASL-4 Safety Case Components

Tyler Tracy 6 Nov 2024 20:10 UTC
1 point
0
I agree if there are privacy guarantees, then you can’t get data from the real world, so you’d need honeypots. I guess I’m much more scared of the internal deployments where I don’t think privacy guarantees are as likely. I’m trying to respond to someone who is like, “Control seems hard because you have to generate real-world looking input, and that seems hard” and I’m like “You can just use the real-world inputs and filter then with a monitor to find the most scary ones and perform auditing”

Yeah, I guess I’m saying that if you don’t have a great way to elicit the capabilities from the model, you can substitute that by assuming that you are going against the best possible adversary you can make. This is much more conservative, but I don’t think it makes control impossible to implement. In practice, I think it looks like finding the maximum capabilities of the model using your Elecitation Strategy, then boosting its capabilities even more with a scaffold and some clever red team tricks.