I only read through the beginning and their control safety case. Some thoughts:
The honeypot case they described is similar to the upfront auditing protocol, but theirs is more complicated. I don’t think you need to generate honeypots to evaluate a model. You can use the first X% of scenarios at deployment time and audit them with humans. Both protocols are beaten by the same red team strategy of “attack with a very low probability,” so they shouldn’t be used natively.
I don’t like the term “monitoring game” because I think control evals will have much more than monitoring in the future.
I think elicitation of the model will eventually matter when the models are widely better than humans at most tasks, but it doesn’t matter that much now. Your control protocols should be able to prevent the best adversary you can simulate from breaking them. In the beginning, this will be humans with access to models, so anywhere the model lacks performance, the human should make up for it. For example, in [P2.6] Red-Blue Game, they state that the read team needs to produce a sleeper agent that defects. Assuming that a sleeper agent is a single model, I think this is too complicated for the red team. The red team should be able to produce a scaffolded model that uses logic written in Python. In the limit, the red team could provide a human that reads every task, sits and thinks for hours, and then responds with the best output. This isn’t done in practice since it is expensive and slow.
I don’t think one needs to generate hyper-realistic-looking scenarios for monitoring, and I’m confused by their reasoning for believing one does. I guess it depends on the fact that they need to elicit the best possible capabilities from the model during the eval, but I don’t think this is necessary for the reasons I stated above.
General Thoughts: I think this is pretty good and shows knowledge of the core points of control. I think Antrhopic’s control protocols are easily beaten, and there are much better things to do. They made it seem like control is much harder to implement than I think it will be. I don’t think a perfect Elicitation Strategy or Simulation Procedure is needed for control to work in the settings laid out. Once we have much scarier AIs and deploy them in more crazy environments, we will likely need better techniques, but their setting seems to be before this.
You can use the first X% of scenarios at deployment time and audit them with humans.
Only if you don’t have privacy guarantees?
Also, for the honeypots, you can potentially generate vast numbers of honeypots, use a trusted model to review outputs, and then only have humans look at the cases where the trusted model thinks the output is suspicious.
Your control protocols should be able to prevent the best advisory you can simulate from breaking them.
Surely this will be unnecessarily conservative a bunch of the time?
In the limit, the red team could provide a human that reads every task, sits and thinks for hours, and then responds with the best output. This isn’t done in practice since it is expensive and slow.
I agree if there are privacy guarantees, then you can’t get data from the real world, so you’d need honeypots. I guess I’m much more scared of the internal deployments where I don’t think privacy guarantees are as likely. I’m trying to respond to someone who is like, “Control seems hard because you have to generate real-world looking input, and that seems hard” and I’m like “You can just use the real-world inputs and filter then with a monitor to find the most scary ones and perform auditing”
Yeah, I guess I’m saying that if you don’t have a great way to elicit the capabilities from the model, you can substitute that by assuming that you are going against the best possible adversary you can make. This is much more conservative, but I don’t think it makes control impossible to implement. In practice, I think it looks like finding the maximum capabilities of the model using your Elecitation Strategy, then boosting its capabilities even more with a scaffold and some clever red team tricks.
I only read through the beginning and their control safety case. Some thoughts:
The honeypot case they described is similar to the upfront auditing protocol, but theirs is more complicated. I don’t think you need to generate honeypots to evaluate a model. You can use the first X% of scenarios at deployment time and audit them with humans. Both protocols are beaten by the same red team strategy of “attack with a very low probability,” so they shouldn’t be used natively.
I don’t like the term “monitoring game” because I think control evals will have much more than monitoring in the future.
I think elicitation of the model will eventually matter when the models are widely better than humans at most tasks, but it doesn’t matter that much now. Your control protocols should be able to prevent the best adversary you can simulate from breaking them. In the beginning, this will be humans with access to models, so anywhere the model lacks performance, the human should make up for it. For example, in [P2.6] Red-Blue Game, they state that the read team needs to produce a sleeper agent that defects. Assuming that a sleeper agent is a single model, I think this is too complicated for the red team. The red team should be able to produce a scaffolded model that uses logic written in Python. In the limit, the red team could provide a human that reads every task, sits and thinks for hours, and then responds with the best output. This isn’t done in practice since it is expensive and slow.
I don’t think one needs to generate hyper-realistic-looking scenarios for monitoring, and I’m confused by their reasoning for believing one does. I guess it depends on the fact that they need to elicit the best possible capabilities from the model during the eval, but I don’t think this is necessary for the reasons I stated above.
General Thoughts:
I think this is pretty good and shows knowledge of the core points of control. I think Antrhopic’s control protocols are easily beaten, and there are much better things to do. They made it seem like control is much harder to implement than I think it will be. I don’t think a perfect Elicitation Strategy or Simulation Procedure is needed for control to work in the settings laid out. Once we have much scarier AIs and deploy them in more crazy environments, we will likely need better techniques, but their setting seems to be before this.
Only if you don’t have privacy guarantees?
Also, for the honeypots, you can potentially generate vast numbers of honeypots, use a trusted model to review outputs, and then only have humans look at the cases where the trusted model thinks the output is suspicious.
Surely this will be unnecessarily conservative a bunch of the time?
Or it is overly conservative?
You might not need privacy guarantees for internal deployments ofc
I agree if there are privacy guarantees, then you can’t get data from the real world, so you’d need honeypots. I guess I’m much more scared of the internal deployments where I don’t think privacy guarantees are as likely. I’m trying to respond to someone who is like, “Control seems hard because you have to generate real-world looking input, and that seems hard” and I’m like “You can just use the real-world inputs and filter then with a monitor to find the most scary ones and perform auditing”
Yeah, I guess I’m saying that if you don’t have a great way to elicit the capabilities from the model, you can substitute that by assuming that you are going against the best possible adversary you can make. This is much more conservative, but I don’t think it makes control impossible to implement. In practice, I think it looks like finding the maximum capabilities of the model using your Elecitation Strategy, then boosting its capabilities even more with a scaffold and some clever red team tricks.