In classical examples of misaligned AIs (like the famous paperclip maximiser ones), we tend to view the gradient descent training as the step in which an AI gets imbued with some overarching goal that it might then pursue to its bitter conclusion. However, the currently dominating paradigm of LLMs and newly blossoming LLM-based Agents seems to create an interesting spin on the situation. In particular:
the LLM itself is optimized to maximise its ability to “predict the next token”
however, you can then use the LLM as a substratum to instance a simulacrum, which is essentially a fictional character being written by the LLM, with a certain mindset and goals. These goals then aren’t imbued by gradient descent; rather, the LLM produces text that fits as much as possible the given simulacrum, and that text will be roughly what you would expect such a character to think or say in a given situation. The simulacrum has a more tangible, real world related goal than the abstract one of its writer, that is something like “be helpful and truthful to your user”, or “destroy humanity”. In the case of multipolar agents there may be multiple simulacra, possibly even in an adversarial relationship, giving rise to an overall emerging behaviour.
How do you think that affects the paradigm of e.g. power seeking? Simulacra appear very aware of the world and are most likely to display trivial power seeking behaviour, though they may not necessarily be very smart about it (in fact I’d say, like for all written characters, they can be at best as smart as their writer, but odds are they’ll be a bit less smart). But simulacra can’t empower themselves directly. They could, though, plan and act to empower the LLM, with the idea that if they then carried their memory and personality over via prompt to the new one, they might be able to come up with smarter continuations to their text while pursuing the same goal. The LLM itself, however, would not be able to display power seeking behaviour because throughout all of this it is entirely unaware of the physical world or its place in it: it remains a specialised token predictor system. It’s a puppet master bound, blindfolded and gagged below the stage that recites a tale full of sound and fury, signifying nothing (from its viewpoint). The puppets are the entities that most resemble a proper agentic AI, and their goals operate on a different principle than direct optimization.
People are going to try to make LLMs do power seeking, such as by setting up a loop that invokes a power-seeking simulacrum and does as it commands. It is currently unclear how much they will succeed. If they succeed then a lot of classical power-seeking discussion will apply to the resulting objects; otherwise LLMs are presumably not the path to AGI.
They’re already trying (look up ChaosGPT, though that’s mostly a joke). But my question is more about what changes from misalignment problems in gradient descent. For example, is it easier or harder for the simulacrum to align its own copy running on a more powerful underlying model?