If that AI produces slop, it should be pretty explicitly aware that it’s producing slop.
This part seems false.
As a concrete example, consider a very strong base LLM. By assumption, there exists some prompt such that the LLM will output basically the same alignment research you would. But with some other prompt, it produces slop, because it accurately predicts what lots of not-very-competent humans would produce. And when producing the sort of slop which not-very-competent humans produce, there’s no particular reason for it to explicitly think about what a more competent human would produce. There’s no particular reason for it to explicitly think “hmm, there probably exist more competent humans who would produce different text than this”. It’s just thinking about what token would come next, emulating the thinking of low-competence humans, without particularly thinking about more-competent humans at all.
How many of these failure modes still happen when there is an AI at least as smart as you, that is aware of these failure modes and actively trying to prevent them?
All of these failure modes apply when the AI is at least as smart as you and “aware of these failure modes” in some sense. It’s the “actively trying to prevent them” part which is key. Why would the AI actively try to prevent them? Would actively trying to prevent them give lower perplexity or higher reward or a more compressible policy? Answer: no, trying to prevent them would not give lower perplexity or higher reward or a more compressible policy.
I think you should address Thane’s concrete example:
That seems to me a pretty damn solid knock-down counterargument. There were no continuous language model scaling laws before the transformer architecture, and not for lack of people trying to make language nets.