If I understand correctly, extremal Goodhart is essentially the same as distributional shift from the Concrete Problems in AI Safety paper.
I think that’s right. Perhaps there is a small distinction to be brought out, but basically, extremal Goodhart is distributional shift brought about by the fact that the AI is optimizing hard.
In any case… I’m not exactly sure what you mean by “calibration”, but when I say “calibration”, I refer to “knowing what you know”. For example, when I took this online quiz, it told me that when I said I was extremely confident something was true, I was always right, and when said I was a little confident something was true, I was only right 66% of the time. I take this as an indicator that I’m reasonably “well-calibrated”; that is, I have a sense of what I do and don’t know.
A calibrated AI system, to me, is one that correctly says “this thing I’m looking at is an unusual thing I’ve never encountered before, therefore my 95% credible intervals related to it are very wide, and the value of clarifying information from my overseer is very high”.
Here’s what I mean by calibration: there’s a function from the probability you give to the frequency observed (or from the expected value you give to the average value observed), and the function approaches a straight x=y line as you learn. That’s basically what you describe in the example of the online test. However, in ML, there’s a difference between knows-what-it-knows learning (KWIK learning) and calibrated learning. KWIK learning is more like what you describe in the second paragraph above. Calibrated learning is focused on the idea that a system should learn when it is systematically over/under confident, correcting such predictable biases. KWIK learning is more focused on not making claims when you have insufficient evidence to pinpoint the right answer.
Your complaints about Bayesian machine learning seem correct. My view is that addressing these complaints & making some sort of calibrated learning method competitive with deep learning is the best way to achieve FAI. I haven’t yet seen an FAI problem which seems like it can’t somehow be reduced to calibrated learning.
I don’t think the inner alignment problem, or the unintended optimization problem, reduce to calibrated learning (or KWIK learning). However, my reasons are somewhat complex. I think it is reasonable to try to make those reductions, so long as you grapple with the real issues.
I’m not super hung up on statistical guarantees, as I haven’t yet seen a way to make them in general which doesn’t require making some sort of unreasonable or impractical assumption about the world (and I’m skeptical such a method exists). The way I see it, if your system is capable of self-improving in the right way, it should be able to overcome deficiencies in its world-modeling capabilities for itself. In my view, the goal is to build a system which gets safer as it self-improves & becomes better at reasoning.
Statistical guarantees are just a way to be able to say something with confidence. I agree that they’re often impractical, and therefore only a toy model of how things can work (at best). However, I’m not very sympathetic to attempts to solve the problem without actually-quite-strong arguments for alignment-relevant properties being made somehow. The question is, how?
I think that’s right. Perhaps there is a small distinction to be brought out, but basically, extremal Goodhart is distributional shift brought about by the fact that the AI is optimizing hard.
Here’s what I mean by calibration: there’s a function from the probability you give to the frequency observed (or from the expected value you give to the average value observed), and the function approaches a straight x=y line as you learn. That’s basically what you describe in the example of the online test. However, in ML, there’s a difference between knows-what-it-knows learning (KWIK learning) and calibrated learning. KWIK learning is more like what you describe in the second paragraph above. Calibrated learning is focused on the idea that a system should learn when it is systematically over/under confident, correcting such predictable biases. KWIK learning is more focused on not making claims when you have insufficient evidence to pinpoint the right answer.
I don’t think the inner alignment problem, or the unintended optimization problem, reduce to calibrated learning (or KWIK learning). However, my reasons are somewhat complex. I think it is reasonable to try to make those reductions, so long as you grapple with the real issues.
Statistical guarantees are just a way to be able to say something with confidence. I agree that they’re often impractical, and therefore only a toy model of how things can work (at best). However, I’m not very sympathetic to attempts to solve the problem without actually-quite-strong arguments for alignment-relevant properties being made somehow. The question is, how?