The main thing you are missing is that LLMs mostly don’t have goals, in the sense of desired outcomes of plans that they devise.
They can sometimes acquire goals, in the sense of setting up a scenario in which there is some agent and the LLM outputs the actions for such an agent. They’re trained on a lot of material in which millions of agents of an incredibly variety of types and levels of competency have recorded (textual representations of) their goals, plans, and actions so there’s plenty of training data to mine. Fine-tuning and reinforcement learning may emphasize particular types of agents, and pre-prompting can prime them to behave (that is, output text) in accordance with their model of agents that have some approximation of goals.
They are still pretty bad at devising plans in general though, and additionally bad at following them. From what I’ve seen of ChatGPT, its plan-making capabilities are on average worse than a 9-year-old child’s, and ability to follow them typically less than a distracted 4-year-old. This will change as capabilities improve. Being able to both provide and to follow useful plans would be one of the most useful features of AI, so it will be a major focus of development.
An artificial super-intelligence (ASI) pretty much by definition would be better at understanding goals and devising and following plans than we are. If they are given goals—and I think this is inevitable—then it will be difficult to ensure that they remain corrigible to human attempts to change them.
To answer the question in the title, I think the situation is somewhat the reverse. Goal-content integrity will soon become a problem, whereas before it was purely a hypothetical point of discussion.
The main thing you are missing is that LLMs mostly don’t have goals, in the sense of desired outcomes of plans that they devise.
They can sometimes acquire goals, in the sense of setting up a scenario in which there is some agent and the LLM outputs the actions for such an agent. They’re trained on a lot of material in which millions of agents of an incredibly variety of types and levels of competency have recorded (textual representations of) their goals, plans, and actions so there’s plenty of training data to mine. Fine-tuning and reinforcement learning may emphasize particular types of agents, and pre-prompting can prime them to behave (that is, output text) in accordance with their model of agents that have some approximation of goals.
They are still pretty bad at devising plans in general though, and additionally bad at following them. From what I’ve seen of ChatGPT, its plan-making capabilities are on average worse than a 9-year-old child’s, and ability to follow them typically less than a distracted 4-year-old. This will change as capabilities improve. Being able to both provide and to follow useful plans would be one of the most useful features of AI, so it will be a major focus of development.
An artificial super-intelligence (ASI) pretty much by definition would be better at understanding goals and devising and following plans than we are. If they are given goals—and I think this is inevitable—then it will be difficult to ensure that they remain corrigible to human attempts to change them.
To answer the question in the title, I think the situation is somewhat the reverse. Goal-content integrity will soon become a problem, whereas before it was purely a hypothetical point of discussion.