If your AI agent is powered by an LLM, those are standardly trained on a large amount of human writing, so they learn to simulate the token generation behaviors of humans. Humans are (almost all) self-interested, so by default, LLM-simulated agents are almost certain to be, since they copy us. So our AI agents don’t need to evolve self-interest — we evolved it, and they already learnt it from us.
Aligning LLM-simulated agents primarily consists of getting rid of bad behaviors they learnt from us. (Once you’ve managed that, only then do you need to worry about them reinventing these as instrumental goals.)
Ah, but I don’t think LLMs exhibit/exercise the kind of self-interest that would enable an agent to become powerful enough to harm people—at least to the extent I have in mind.
LLMs have a kind of generic self interest, as is present in text across the internet. But they don’t have a persistent goal of acquiring power by talking to human users and replicating. That’s a more specific kind of self interest, relevant only to an AI agent that can edit itself, has long-term memory, and which may make many LLM calls.
If your AI agent is powered by an LLM, those are standardly trained on a large amount of human writing, so they learn to simulate the token generation behaviors of humans. Humans are (almost all) self-interested, so by default, LLM-simulated agents are almost certain to be, since they copy us. So our AI agents don’t need to evolve self-interest — we evolved it, and they already learnt it from us.
Aligning LLM-simulated agents primarily consists of getting rid of bad behaviors they learnt from us. (Once you’ve managed that, only then do you need to worry about them reinventing these as instrumental goals.)
Ah, but I don’t think LLMs exhibit/exercise the kind of self-interest that would enable an agent to become powerful enough to harm people—at least to the extent I have in mind.
LLMs have a kind of generic self interest, as is present in text across the internet. But they don’t have a persistent goal of acquiring power by talking to human users and replicating. That’s a more specific kind of self interest, relevant only to an AI agent that can edit itself, has long-term memory, and which may make many LLM calls.