Brendon_Wong comments on Internal independent review for language model agent alignment

Brendon_Wong 18 Jul 2023 21:47 UTC
3 points
2
Do you have a source for “Large labs (OpenAI and Anthropic, at least) are pouring at least tens of millions of dollars into this avenue of research?” I think a lot of the current work pertains to LMA alignment, like RLHF, but isn’t LMA alignment per say (I’d make a distinction between aligning the black box models that compose the LMA versus the LMA itself).
- Roman Leventov 19 Jul 2023 6:21 UTC
  1 point
  0
  Parent
  I implied the whole spectrum of “LLM alignment”, which I think is better to count as a single “avenue of research” because critiques and feedback in “LMA production time” could as well be applied during pre-training and fine-tuning phases of training (constitutional AI style). It’s only reasonable for large AGI labs to ban LMAs completely on top of their APIs (as Connor Leahy suggests), or research their safety themselves (as they already started to do, to a degree, with ARC’s evals of GPT-4, for instance).
  - Brendon_Wong 20 Jul 2023 1:24 UTC
    5 points
    2
    Parent
    I implied the whole spectrum of “LLM alignment”, which I think is better to count as a single “avenue of research” because critiques and feedback in “LMA production time” could as well be applied during pre-training and fine-tuning phases of training (constitutional AI style).
    If I’m understanding correctly, is your point here that you view LLM alignment and LMA alignment as the same? If so, this might be a matter of semantics, but I disagree; I feel like the distinction is similar to ensuring that the people that comprise the government is good (the LLMs in an LMA) versus trying to design a good governmental system itself (e.g. dictatorship, democracy, futarchy, separation of powers, etc.). The two areas are certainly related, and a failure in one can mean a failure in another, but the two areas can involve some very separate and non-associated considerations.
    It’s only reasonable for large AGI labs to ban LMAs completely on top of their APIs (as Connor Leahy suggests)
    Could you point me to where Connor Leahy suggests this? Is it in his podcast?
    or research their safety themselves (as they already started to do, to a degree, with ARC’s evals of GPT-4, for instance)
    To my understanding, the closest ARC Evals gets to LMA-related research is by equipping LLMs with tools to do tasks (similar to ChatGPT plugins), as specified here. I think one of the defining features of an LMA is self-delegation, which doesn’t appear to be happening here. The closest they might’ve gotten was a basic prompt chain.
    I’m mostly pointing these things out because I agree with Ape in the coat and Seth Herd. I don’t think there’s any actual LMA-specific work going on in this space (beyond some preliminary efforts, including my own), and I think there should be. I am pretty confident that LMA-specific work could be a very large research area, and many areas within it would not otherwise be covered with LLM-specific work.
    - Roman Leventov 20 Jul 2023 5:34 UTC
      1 point
      0
      Parent
      I have no intention to argue this point to death. After all, it’s better to do “too much” LMA alignment research than “too little”. But I would definitely suggest reaching to AGI labs’ safety teams, maybe privately, and at least trying to find out where they are than just to assume that they don’t do LMA alignment.
      
      Connor Leahy proposed banning LLM-based agent’s here: https://twitter.com/NPCollapse/status/1678841535243202562. In the context of this proposal (which I agree with), a potentially high-leverage thing to work on now is a detection algorithm for LLM API usage patterns that indicate agent-like usage. Though, this may be difficult, if the users interleave calling OpenAI API with Anthropic API with local usage of LLaMA 2 in their LMA.
      
      However, if Meta, Eleuther AI, Stability, etc. won’t stop developing more and more powerful “open” LLMs, agents are inevitable, anyway.