I appreciate the clarification, and I’ll try to keep that distinction in mind going forward! To rephrase my claim in this language, I’d say that an LLM as a whole does not have a behavioral goal except for “predict the next token”, which is not a sufficiently descriptive as a behavioral goal to answer a lot of questions AI researchers care about (like “is the AI safe?”). In contrast, the simulacra the model produces can be much better described by more precise behavioral goals. For instance, one might say ChatGPT (with the hidden prompt we aren’t shown) has a behavioral goal of being a helpful assistant, or an LLM roleplaying as a paperclip maximizer has the behavioral goal of producing a lot of paperclips. But an LLM as a whole could contain simulacra that have all those behavioral goals and many more, and because of that diversity they can’t be well-described by any behavioral goal more precise than “predict the next token”.
I appreciate the clarification, and I’ll try to keep that distinction in mind going forward! To rephrase my claim in this language, I’d say that an LLM as a whole does not have a behavioral goal except for “predict the next token”, which is not a sufficiently descriptive as a behavioral goal to answer a lot of questions AI researchers care about (like “is the AI safe?”). In contrast, the simulacra the model produces can be much better described by more precise behavioral goals. For instance, one might say ChatGPT (with the hidden prompt we aren’t shown) has a behavioral goal of being a helpful assistant, or an LLM roleplaying as a paperclip maximizer has the behavioral goal of producing a lot of paperclips. But an LLM as a whole could contain simulacra that have all those behavioral goals and many more, and because of that diversity they can’t be well-described by any behavioral goal more precise than “predict the next token”.