lemonhope comments on When can we trust model evaluations?

lemonhope 15 Aug 2023 14:44 UTC
LW: 1 AF: 1
0
AF
Perhaps there are some behavioral / black-box methods available for evaluating alignment, depending on the kind of system being evaluated.

Toy example: imagine a two part system where part A tries to do tasks and part B limits part A’s compute based on the riskiness of the task. You could try to optimize the overall system towards catastrophic behavior and see how well your part B holds up.

Personally I expect monolithic systems to be hard to control than two-part systems, so I think this evaluation scheme has a good chance of being applicable. One piece of evidence: OpenAI’s moderation system correctly flags most jailbreaks that get past the base model’s RLHF.