rpglover64 comments on Using GPT-Eliezer against ChatGPT Jailbreaking

rpglover64 8 Dec 2022 13:40 UTC
1 point
0
This is cute, but I have strong qualms with your 3rd prediction; I don’t disagree, per se, but
- Either “variants of this approach” is too broad to be useful, including things like safety by debate and training a weak AI to check the input
- Or, if I take “variants” narrowly to mean using an AI to check its own inputs, my estimate is “basically zero”
So I want to double check: what counts as a variant and what doesn’t?
- Stuart_Armstrong 8 Dec 2022 14:46 UTC
  3 points
  0
  Parent
  I was using it rather broadly, considering situations where a smart AI is used to oversee another AI, and this is a key part of the approach. I wouldn’t usually include safety by debate or input checking, though I might include safety by debate if there was a smart AI overseer of the process that was doing important interventions.
  - rpglover64 10 Dec 2022 17:35 UTC
    1 point
    0
    Parent
    In that case, I don’t see why the problem of “system alignment” or “supervisor alignment” is any simpler or easier than “supervisee alignment”.