Wei Dai comments on Language Agents Reduce the Risk of Existential Catastrophe

Wei Dai 31 May 2023 21:12 UTC
LW: 4 AF: 3
0
AF
Note that this paper already used “Language Agents” to mean something else. See link below for other possible terms. I will keep using “Language Agents” in this comment/thread (unless the OP decide to change their terminology).

I added the tag Chain-of-Thought Alignment, since there’s a bunch of related discussion on LW under that tag. I’m not very familiar with this discussion myself, and have some questions below that may or may not already have good answers.

How competent will Language Agents be at strategy/planning, compared to humans and other AI approaches (before considering the next paragraph)? A human doing strategy/planning has access to their beliefs and desires as encoded by synaptic weights in their brain (as well as textual versions of their beliefs and desires if they choose to write them down or have an internal monologue about them) whereas Language Agents would only have access to the textual versions. How much of a problem is this (e.g. how much of a human’s real beliefs/desires can be captured on text)?

As the world changes and the underlying LLM goes more and more out of distribution (for example lacking some concept or way of thinking that is important to reason about the new world), what happens? Do we wait for humans to invent the new concept or new way of thinking, use it a bunch / generate a bunch of new training data, then update the LLM? (That seems too slow/uncompetitive?)

When Paul Christiano worked on IDA (which shares some similarities with Language Agents) he worried about “security” of and “attacks” on the base model and proposed solutions that had significant costs. I don’t see similar discussion around Language Agents. Is that ok/reasonable or not?

Suppose Language Agents work out the way you think or hope, what happens afterwards? (How do you envision going from that state to a state of existential safety?)
- Simon Goldstein 31 May 2023 21:56 UTC
  2 points
  0
  Parent
  Thank you for your reactions:
  -Good catch on ‘language agents’, we will think about best terminology going forward
  -I’m not sure what you have in mind regarding accessing beliefs/desires using synaptic weights rather than text. For example, the language of thought approach to human cognition suggests that human access to beliefs/desires is also fundamentally syntactic rather than weight based. OTOH one way to incorporate some kind of weight would be to assign probabilities to the beliefs stored in the memory stream.
  
  -For OOD over time, I think updating the LLM wouldn’t be uncompetitive for inventing new concepts/ways of thinking, because that happens slowly. Harder issue is updating on new world knowledge. Maybe browser plugins will fill the gap here, open question.
  
  -I agree that info security is important safety intervention. AFAICT its value is independent of using language agents vs RL agents.
  
  -One end-game is systematic conflict between humans + language agents, vs RL/transformer agent successors to MuZero, GPT4, Gato etc.