I think you have a wrong model of the process, which comes from conflating outcome-alignment and intent-alignment.
Current LLMs are outcome-aligned, i.e., they produce “good” outputs. But, in pessimist model, internal mechanisms of LLM that produces “good outputs” has nothing common with “being nice” or “caring about humans” and more like “producing weird text patterns” and if we make LLMs sufficiently smarter, they turn the world into text patterns or do something else unpredictable. I.e., it’s not like control structures of LLMs are nice right now and stop being nice when we make LLM smarter, they simply aren’t about “being nice” in the first place.
On the other hand, humans are at least somewhat intent-aligned and if we don’t use really radical rearrangements of brain matter, we can expect them to stay intent-aligned.
I think you have a wrong model of the process, which comes from conflating outcome-alignment and intent-alignment. Current LLMs are outcome-aligned, i.e., they produce “good” outputs. But, in pessimist model, internal mechanisms of LLM that produces “good outputs” has nothing common with “being nice” or “caring about humans” and more like “producing weird text patterns” and if we make LLMs sufficiently smarter, they turn the world into text patterns or do something else unpredictable. I.e., it’s not like control structures of LLMs are nice right now and stop being nice when we make LLM smarter, they simply aren’t about “being nice” in the first place. On the other hand, humans are at least somewhat intent-aligned and if we don’t use really radical rearrangements of brain matter, we can expect them to stay intent-aligned.