Misaligned wrapper-mind optimization was a popular early worry because it’s possible in principle and seems convergently useful for all sorts of things, a plausible selection outcome. It became less relevant recently because it seems more likely to happen some time after much more anthropomorphic human imitating language model AGIs have decisive influence over the world, so it’s what they would need to worry about instead.
Something similar seems to be the case for recursive self-improvement. Language models already seem capable enough in principle, but insufficiently sane/agentic to act coherently in an autonomous manner. So any AI risk relevant self-improvement is not about increase in straightforwardly definable capability, it’s about tuning models towards sanity. Algorithmic self-improvement is something that happens automatically after that point, and doesn’t seem either plausible or necessary before.
Misaligned wrapper-mind optimization was a popular early worry because it’s possible in principle and seems convergently useful for all sorts of things, a plausible selection outcome. It became less relevant recently because it seems more likely to happen some time after much more anthropomorphic human imitating language model AGIs have decisive influence over the world, so it’s what they would need to worry about instead.
Something similar seems to be the case for recursive self-improvement. Language models already seem capable enough in principle, but insufficiently sane/agentic to act coherently in an autonomous manner. So any AI risk relevant self-improvement is not about increase in straightforwardly definable capability, it’s about tuning models towards sanity. Algorithmic self-improvement is something that happens automatically after that point, and doesn’t seem either plausible or necessary before.