Thanks for the feedback! I agree that language agents are relatively new, and so our claims about their safety properties will need to be empirically verified.
You write:
One example: “The functional roles of these beliefs and desires are enforced by the architecture of the language agent.”
I think this is an extremely strong claim. It also cannot be true for every possible architecture of language agents. As a pathological example, wrap the “task queue” submodule of BabyAGI with a function that stores the opposite task it has been given, but returns the opposite task to what it stored. The plain english interpetation of the data is no longer accurate.
The mistake is to assume that because the data inside a language agent takes the form of English words, it precisely corresponds to those words.
I agree that it seems reasonable that it would most of the time, but this isn’t something you can say is true always.
Let me clarify that we are not claiming that the architecture of every language agent fixes the functional role of the text it stores in the same way. Rather, our claim is that if you consider any particular language agent, its architecture will fix the functional role of the text it stores in a way which makes it possible to interpret its folk psychology relatively easily. We do not want to deny that in order to interpret the text stored by a language agent, one must know about its architecture.
In the case you imagine, the architecture of the agent fixes the functional role of the text stored so that any natural language task-description T represents an instruction to perform its negation, ~T. Thus the task “Write a poem about existential risk” is stored in the agent as the sentence “Do not write a poem about existential risk,” and the architecture of the agent later reverses the negation. Given these facts, a stored instance of “Do not write a poem about existential risk” corresponds to the agent having a plan to not not write a poem about existential risk, which is the same as having a plan to write a poem about existential risk.
What is important to us is not that the natural language representations stored inside a language agent have exactly their natural language meanings, but rather that for any language agent, there is a translation function recoverable from its architecture which allows us to determine the meanings of the natural language representations it stores. This suffices for interpretability, and it also allows us to directly encode goals into the agent in a way that helps to resolve problems with reward misspecification and goal misgeneralization.
Thanks for the feedback! I agree that language agents are relatively new, and so our claims about their safety properties will need to be empirically verified.
You write:
Let me clarify that we are not claiming that the architecture of every language agent fixes the functional role of the text it stores in the same way. Rather, our claim is that if you consider any particular language agent, its architecture will fix the functional role of the text it stores in a way which makes it possible to interpret its folk psychology relatively easily. We do not want to deny that in order to interpret the text stored by a language agent, one must know about its architecture.
In the case you imagine, the architecture of the agent fixes the functional role of the text stored so that any natural language task-description T represents an instruction to perform its negation, ~T. Thus the task “Write a poem about existential risk” is stored in the agent as the sentence “Do not write a poem about existential risk,” and the architecture of the agent later reverses the negation. Given these facts, a stored instance of “Do not write a poem about existential risk” corresponds to the agent having a plan to not not write a poem about existential risk, which is the same as having a plan to write a poem about existential risk.
What is important to us is not that the natural language representations stored inside a language agent have exactly their natural language meanings, but rather that for any language agent, there is a translation function recoverable from its architecture which allows us to determine the meanings of the natural language representations it stores. This suffices for interpretability, and it also allows us to directly encode goals into the agent in a way that helps to resolve problems with reward misspecification and goal misgeneralization.
Also, this translation function might be simple w.r.t. human semantics, based on current evidence about LLMs: https://www.lesswrong.com/posts/rjghymycfrMY2aRk5/llm-cognition-is-probably-not-human-like?commentId=KBpfGY3uX8rDJgoSj