I would really like a one-way communication channel to various Anthropic teams so that I could submit potentially sensitive reports privately. For instance, sending reports about observed behaviors in Anthropic’s models (or open-weights models) to the frontier Red Team, so that they could confirm the observations internally as they saw fit. I wouldn’t want non-target teams reading such messages.
I feel like I would have similar, but less sensitive, messages to send to the Alignment Stress-Testing team and others.
Currently, I do send messages to specific individuals, but this makes me worry that I may be harassing or annoying an individual with unnecessary reports (such is the trouble of a one-way communication).
Another thing I think is worth mentioning is that I think under-elicitation is a problem in dangerous capabilities evals, model organisms of misbehavior, and in some other situations. I’ve been privately working on a scaffolding framework which I think could help address some of the lowest hanging fruit I see here. Of course, I don’t know whether Anthropic already has a similar thing internally, but I plan to privately share mine once I have it working.
I would really like a one-way communication channel to various Anthropic teams so that I could submit potentially sensitive reports privately. For instance, sending reports about observed behaviors in Anthropic’s models (or open-weights models) to the frontier Red Team, so that they could confirm the observations internally as they saw fit. I wouldn’t want non-target teams reading such messages. I feel like I would have similar, but less sensitive, messages to send to the Alignment Stress-Testing team and others. Currently, I do send messages to specific individuals, but this makes me worry that I may be harassing or annoying an individual with unnecessary reports (such is the trouble of a one-way communication).
Another thing I think is worth mentioning is that I think under-elicitation is a problem in dangerous capabilities evals, model organisms of misbehavior, and in some other situations. I’ve been privately working on a scaffolding framework which I think could help address some of the lowest hanging fruit I see here. Of course, I don’t know whether Anthropic already has a similar thing internally, but I plan to privately share mine once I have it working.