A useful distinction. Yet of the rare outcomes that follow current timelines without ending in ruin, I expect the most likely one falls into neither category. Instead it’s an AGI that behaves like a weird supersmart human that bootstrapped its humanity from language models with relatively little architecture support (for alignment), as a result of a theoretical miracle where things like that are the default outcome. Possibly from giving a language model autonomy/agency to debug its thinking while having notebooks and a working memory, tuning the model in the process. It’s not going to reliably do as it’s told, could be deceptive, yet possibly doesn’t turn everything into paperclips. Arguably it’s aligned, but only the way weird individual humans are aligned, which is noncentrally strawberry-aligned, and too-indirectly-to-use-the-term CEV-aligned.
A useful distinction. Yet of the rare outcomes that follow current timelines without ending in ruin, I expect the most likely one falls into neither category. Instead it’s an AGI that behaves like a weird supersmart human that bootstrapped its humanity from language models with relatively little architecture support (for alignment), as a result of a theoretical miracle where things like that are the default outcome. Possibly from giving a language model autonomy/agency to debug its thinking while having notebooks and a working memory, tuning the model in the process. It’s not going to reliably do as it’s told, could be deceptive, yet possibly doesn’t turn everything into paperclips. Arguably it’s aligned, but only the way weird individual humans are aligned, which is noncentrally strawberry-aligned, and too-indirectly-to-use-the-term CEV-aligned.