Humans are normally agentic (sadly they can also quite often be selfish, power-seeking, deceitful, bad-tempered, untrustworthy, and/or generally unaligned). Standard unsupervised LLM foundation model training teaches LLMs how to emulated humans as text-generation processes. This will inevitably include modelling many aspects of human psychology, including the agentic ones, and the unsavory ones. So LLMs have trained-in agentic behavior before any RL is applied, or even if you use entirely non-RL means to attempt to make them helpful/honest/harmless (e.g. how Google did this to LaMDA). They have been trained on a great many examples of deceit, power-seeking, and every other kind of nasty human behavior, so RL is not the primary source of the problem.
The alignment problem is about producing something that we are significantly more certain is aligned than a typical randomly-selected human. Handing a randomly-selected human absolute power over all of society is unlikely to end well. What we need to train is a selfless altruist who (platonically or parentally) loves all humanity. For lack of better terminology: we need to create a saint or an angel.
Humans are normally agentic (sadly they can also quite often be selfish, power-seeking, deceitful, bad-tempered, untrustworthy, and/or generally unaligned). Standard unsupervised LLM foundation model training teaches LLMs how to emulated humans as text-generation processes. This will inevitably include modelling many aspects of human psychology, including the agentic ones, and the unsavory ones. So LLMs have trained-in agentic behavior before any RL is applied, or even if you use entirely non-RL means to attempt to make them helpful/honest/harmless (e.g. how Google did this to LaMDA). They have been trained on a great many examples of deceit, power-seeking, and every other kind of nasty human behavior, so RL is not the primary source of the problem.
The alignment problem is about producing something that we are significantly more certain is aligned than a typical randomly-selected human. Handing a randomly-selected human absolute power over all of society is unlikely to end well. What we need to train is a selfless altruist who (platonically or parentally) loves all humanity. For lack of better terminology: we need to create a saint or an angel.