Rohin Shah comments on AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

Rohin Shah 2 Sep 2024 21:54 UTC
1 point
−1
It clearly can’t be having a large effect, since the accuracies aren’t near-100% for any of the methods. I agree leakage would have some effect. The mechanism you suggest is plausible, but it can’t be the primary cause of the finding that debate doesn’t have an advantage—since accuracies aren’t near-100% we know there are some cases the model hasn’t memorized, so the mechanism you suggest doesn’t apply to those inputs.
More generally, all sorts of things have systematic undesired effects on our results, aka biases. E.g. I suspect the prompts are a bigger deal. Basically any empirical paper will be subject to the critique that aspects of the setup introduce biases.
- Osaid Nasir 5 Sep 2024 6:46 UTC
  1 point
  0
  Parent
  since accuracies aren’t near-100% we know there are some cases the model hasn’t memorized, so the mechanism you suggest doesn’t apply to those inputs
  That makes sense.
  I suspect the prompts are a bigger deal
  Do you suppose a suitable proxy for prompt quality can be replicating these experiments with LLM debaters/judges of different sizes? Let’s say P is the optimal prompt and Q is a suboptimal one, then LLM performance with prompt Q ⇐ LLM performance with prompt P ⇐ bigger LLM performance with prompt Q.