DanielFilan comments on An Analytic Perspective on AI Alignment

DanielFilan 22 Mar 2020 1:07 UTC
LW: 4 AF: 2
AF
Papers

I agree that none of the papers are incredibly convincing on their own. I’d say the most convincing empirical work so far should be the sequence of posts on ‘circuits’ on Distill, starting with this one, but even that isn’t totally compelling. They’re just meant to provide some evidence that this sort of thing is possible, and to stand in the face of the lack of papers proving that it isn’t (although of course even if true it would be hard to prove).

Re: the Rashomon paper, you’re right, that implication doesn’t hold. But it is suggestive that there may well be ‘interpretable’ models that are near-optimal.

Re: the regularisation paper, I agree that it doesn’t work that well. But it’s the first paper in this line of work, and I think it’s plausibly illustrative of things that might work.