Unless this problem is resolved, I don’t see how any AI alignment approach that involves using future ML—that looks like contemporary ML but at an arbitrarily large scale—could be safe.
I think that’s a bit extreme, or at least misplaced. Gradient-hacking is just something that makes catching deceptive alignment more difficult. Deceptive alignment is the real problem: if you can prevent deceptive alignment, then you can prevent gradient hacking. And I don’t think it’s impossible to catch deceptive alignment in something that looks similar to contemporary ML—or at least if it is impossible then I don’t think that’s clear yet. I mentioned some of the ML transparency approaches I’m excited about in this post, though really for a full treatment of that problem see “Relaxed adversarial training for inner alignment.”
I think that’s a bit extreme, or at least misplaced. Gradient-hacking is just something that makes catching deceptive alignment more difficult. Deceptive alignment is the real problem: if you can prevent deceptive alignment, then you can prevent gradient hacking. And I don’t think it’s impossible to catch deceptive alignment in something that looks similar to contemporary ML—or at least if it is impossible then I don’t think that’s clear yet. I mentioned some of the ML transparency approaches I’m excited about in this post, though really for a full treatment of that problem see “Relaxed adversarial training for inner alignment.”