This is helpful, thanks for summarizing the differences! I definitely agree on the first one.
On the second one, my concern is basically that all the safety guarantees that quantilizers provide have an inherent power/safety tradeoff (modulo whatever I’m missing from the “Targeted Impact” section).
That said, it’s possible that your nested approach may avoid the ‘simulate a deceptive AGI’ failure mode. At least, if it’s a continuous trajectory of improvement from median human performance up to very superhuman performance you might hope that that trajectory doesn’t involve suddenly switching from human-like to AGI-like models. I don’t personally find this very comforting (it seems totally plausible to me that there’s a continuous path from “median human” to “very dangerous misaligned model” in model-space), but it does at least seem better than directly asking for a superhuman model.
This is helpful, thanks for summarizing the differences! I definitely agree on the first one.
On the second one, my concern is basically that all the safety guarantees that quantilizers provide have an inherent power/safety tradeoff (modulo whatever I’m missing from the “Targeted Impact” section).
That said, it’s possible that your nested approach may avoid the ‘simulate a deceptive AGI’ failure mode. At least, if it’s a continuous trajectory of improvement from median human performance up to very superhuman performance you might hope that that trajectory doesn’t involve suddenly switching from human-like to AGI-like models. I don’t personally find this very comforting (it seems totally plausible to me that there’s a continuous path from “median human” to “very dangerous misaligned model” in model-space), but it does at least seem better than directly asking for a superhuman model.