I am a philosopher who is concerned about these developments, and have written something on it here based on my best (albeit incomplete and of course highly fallible) understanding of the relevant facts: Are AI developers playing with fire? - by Marcus Arvan (substack.com). If I am mistaken (and I am happy to learn if I am), then I’d love to learn how.
marcusarvan
Karma: 5
Eliezer writes, “ It does not appear to me that the field of ‘AI safety’ is currently being remotely productive on tackling its enormous lethal problems.”
Here’s a proof he’s right, entitled “Interpretability and Alignment Are Fool’s errands”, published in the journal AI & Society: https://philpapers.org/rec/ARVIAA
Anyone who thinks reliable interpretability or alignment are solvable engineering or safety testing problems is fooling themselves. These tasks are no more possible than squaring a circle is.
For any programming strategy and finite amount of data, there is always an infinite number of ways for an LLM (particularly a superintelligence) to be misaligned but only demonstrate that misalignment until after it is too late to prevent.
This is why developers keep finding new forms of “unexpected” misalignment no matter how much time, testing, programming, and compute they throw at these things. Relevant information about whether an LLM is likely to be be (catastrophically) misaligned and misinterpreted by us always exists in the future—for every possible time t.
So actually, Eliezer’s argument undersells the problem. Eliezer’s Alignment Textbook from the Future isn’t possible to obtain because at every point in the future, the same problem recurs. Reliable interpretability and alignment are recursively unsolvable problems.