Not Relevant comments on High-stakes alignment via adversarial training [Redwood Research report]

Not Relevant 6 May 2022 0:26 UTC
1 point
I’d be very interested in what’d happen if you replaced the classifier’s base model (deberta-v3) with a much larger model like GPT-3, and then fine-tuned it with only the initial violence-classification data.

I’d hypothesize that if the Natural Abstractions Hypothesis were true, it would imply that the resulting classifier would perform much better at classifying every category of adversarial training example, whereas if NAH were false, the larger model’s concept of violence would still be alien and thus not cover the adversarial examples. (Note this doesn’t say anything about NAH in the limit, ie whether if we scaled the model up even more whether it’d eventually switch to an incomprehensible ontology.) Does someone who doesn’t believe in NAH agree with such an experiment’s implications, such that one of us can use the results to update our beliefs?

It’d be awesome if someone at RR could test this hypothesis.
- dmz 7 May 2022 0:56 UTC
  2 points
  Parent
  I don’t think I believe a strong version of the Natural Abstractions Hypothesis, but nevertheless my guess is GPT-3 would do quite a bit better. We’re definitely interested in trying this.