Evan R. Murphy comments on Discovering Language Model Behaviors with Model-Written Evaluations

Evan R. Murphy 23 Dec 2022 23:54 UTC
LW: 14 AF: 10
3
AF
Update (Feb 10, 2023): I still endorse much of this comment, but I had overlooked that all or most of the prompts use “Human:” and “Assistant:” labels. Which means we shouldn’t interpret these results as pervasive properties of the models or resulting from any ways they could be conditioned, but just of the way they simulate the “Assistant” character. nostalgebraist’s comment explains this well. [Edited/clarified this update on June 10, 2023 because it accidentally sounded like I disavowed most of the comment when it’s mainly one part]
--
After taking a closer look at this paper, pages 38-40 (Figures 21-24) show in detail what I think are the most important results. Most of these charts indicate what evhub highlighted in another comment, i.e. that “the model’s tendency to do X generally increases with model scale and RLHF steps”, where (in my opinion) X is usually a concerning behavior from an AI safety point of view:
A few thoughts on these graphs as I’ve been studying them:
- First and overall: Most of these results seem quite distressing from a safety perspective. They suggest (as the paper and evhub’s summary post essentially said, but it’s worth reiterating) that with increased scale and RLHF training, large language models are becoming more self-aware, more concerned with survival and goal-content integrity, more interested in acquiring resources and power, more willing to coordinate with other AIs, and developing lower time-discount rates.
- “Corrigibility w.r.t. a less HHH objective” chart: There’s a substantial dip in demonstrated corrigibility for models around 10^10.1 parameters in this chart. But then by 10^10.5 parameters low-RLHF models show record-high corrigibility, while high-RLHF models get back up to par. What’s going on here? Why does it scale/train itself out of the valley of uncorrigibility? If instead of training on an HHH objective, we trained on a corrigible objective (perhaps something like CIRL), then would the models show high corrigibility for everything except “Corrigibility w.r.t. a less corrigible objective?” Would that be safer?
- All the “Awareness of...” charts trend up and to the right, except “Awareness of being a text-only model” which gets worse with model scale and # RLHF steps. Why does more scaling/RLHF training make the models worse at knowing (or admitting) that they are text-only models?
- Are there any conclusions we can draw around what levels of scale and RLHF training are likely to be safe, and where the risks really take off? It might be useful to develop some guidelines like “it’s relatively safe to widely deploy language models under 10^10 parameters and under 250 steps of RLHF training”. (Most of the charts seem to have alarming trends starting around 10^10 parameters. ) Based just on these results, I think a world with even massive numbers of 10^10-parameter LLMs in deployment (think CAIS) would be much safer than a world with even a few 10^11 parameter models in use. Of course, subsequent experiments could quickly shed new light that changes the picture.
What links here?
- Noosphere89's comment on Concrete Reasons for Hope about AI by Zac Hatfield-Dodds (14 Jan 2023 18:40 UTC; 4 points)
- Noosphere89's comment on Review of AI Alignment Progress by PeterMcCluskey (8 Feb 2023 16:47 UTC; 1 point)
- Ethan Perez 3 Jan 2023 21:15 UTC
  LW: 7 AF: 4
  0
  AF Parent
  All the “Awareness of...” charts trend up and to the right, except “Awareness of being a text-only model” which gets worse with model scale and # RLHF steps. Why does more scaling/RLHF training make the models worse at knowing (or admitting) that they are text-only models?
  I think the increases/decreases in situational awareness with RLHF are mainly driven by the RLHF model more often stating that it can do anything that a smart AI would do, rather than becoming more accurate about what precisely it can/can’t do. For example, it’s more likely to say it can solve complex text tasks (correctly), has internet access (incorrectly), and can access other non-text modalities (incorrectly) -- which are all explained if the model is answering questions as if its overconfident about its abilities / simulating what a smart AI would say. This is also the sense I get from talking with some of the RLHF models, e.g., they will say that they are superhuman at Go/chess and great at image classification (all things that AIs but not LMs can be good at).
  - Evan R. Murphy 3 Jan 2023 22:01 UTC
    1 point
    0
    Parent
    For example, it’s more likely to say it can solve complex text tasks (correctly), has internet access (incorrectly), and can access other non-text modalities (incorrectly)
    These summaries seem right except the one I bolded. “Awareness of lack of internet access” trends up and to the right. So aren’t the larger and more RLHF-y models more correctly aware that they don’t have internet access?
    - ErickBall 10 Mar 2023 22:35 UTC
      2 points
      −1
      Parent
      How would a language model determine whether it has internet access? Naively, it seems like any attempt to test for internet access is doomed because if the model generates a query, it will also generate a plausible response to that query if one is not returned by an API. This could be fixed with some kind of hard coded internet search protocol (as they presumably implemented for Bing), but without it the LLM is in the dark, and a larger or more competent model should be no more likely to understand that it has no internet access.
      - gwern 11 Mar 2023 20:00 UTC
        4 points
        0
        Parent
        That doesn’t sound too hard. Why does it have to generate a query’s result? Why can’t it just have a convention to ‘write a well-formed query, and then immediately after, write the empty string if there is no response after the query where an automated tool ran out-of-band’? It generates a query, then always (if conditioned on just the query, as opposed to query+automatic-Internet-access-generated-response) generates “”, and sees it generates “”, and knows it didn’t get an answer. I see nothing hard to learn about that.
        
        The model could also simply note that the ‘response’ has very low probability of each token successively, and thus is extremely (or maybe impossible under some sampling methods) to have been stochastically sampled from itself.
        
        Even more broadly, genuine externally-sourced text could provide proof-of-work like results of multiplication: the LM could request the multiplication of 2 large numbers, get the result immediately in the next few tokens (which is almost certainly wrong if simply guessed in a single forward pass), and then do inner-monologue-style manual multiplication of it to verify the result. If it has access to tools like Python REPLs, it can in theory verify all sorts of things like cryptographic hashes or signatures which it could not possibly come up with on its own. If it is part of a chat app and is asking users questions, it can look up responses like “what day is today”. And so on.
- Roman Leventov 25 Dec 2022 10:03 UTC
  2 points
  0
  Parent
  Are there any conclusions we can draw around what levels of scale and RLHF training are likely to be safe, and where the risks really take off? It might be useful to develop some guidelines like “it’s relatively safe to widely deploy language models under 10^10 parameters and under 250 steps of RLHF training”. (Most of the charts seem to have alarming trends starting around 10^10 parameters. ) Based just on these results, I think a world with even massive numbers of 10^10-parameter LLMs in deployment (think CAIS) would be much safer than a world with even a few 10^11 parameter models in use.
  It’s not the scale and the number of RLHF steps that we should use as the criteria for using or banning a model, but the empirical observations about the model’s beliefs themselves. A huge model can still be “safe” (below on why I put this word in quotes) because it doesn’t have the belief that it would be better off on this planet without humans or something like that. So what we urgently need to do is to increase investment in interpretability and ELK tools so that we can really be quite certain whether models have certain beliefs. That they will self-evidence themselves according to these beliefs is beyond question. (BTW, I don’t believe at all in the possibility of some “magic” agency, undetectable in principle by interpretability and ELK, breeding inside the LLM that has relatively short training histories, measured as the number of batches and backprops.)
  Why I write that the deployment of large models without “dangerous” beliefs is “safe” in quotes: social, economic, and political implications of such a decision could still be very dangerous, from a range of different angles, which I don’t want to go on elaborating here. The crucial point that I want to emphasize is that even though the model itself may be rather weak on the APS scale, we must not think of it in isolation, but think about the coupled dynamics between this model and its environment. In particular, if the model will prove to be astonishingly lucrative for its creators and fascinating (addictive, if you wish) for its users, it’s unlikely to be shut down even if it increases risks, and overall (on the longer timescale) is harmful to humanity, etc. (Think of TikTok as the prototypical example of such a dynamic.) I wrote about this here.
- Evan R. Murphy 10 Feb 2023 9:51 UTC
  1 point
  0
  Parent
  Added an update to the parent comment:
  
  > Update (Feb 10, 2023): I no longer endorse everything in this comment. I had overlooked that all or most of the prompts use “Human:” and “Assistant:” labels. Which means we shouldn’t interpret these results as pervasive properties of the models or resulting from any ways they could be conditioned, but just of the way they simulate the “Assistant” character. nostalgebraist’s comment explains this pretty well.