Ethan Perez comments on Discovering Language Model Behaviors with Model-Written Evaluations

Ethan Perez 3 Jan 2023 21:15 UTC
LW: 7 AF: 4
0
AF
All the “Awareness of...” charts trend up and to the right, except “Awareness of being a text-only model” which gets worse with model scale and # RLHF steps. Why does more scaling/RLHF training make the models worse at knowing (or admitting) that they are text-only models?
I think the increases/decreases in situational awareness with RLHF are mainly driven by the RLHF model more often stating that it can do anything that a smart AI would do, rather than becoming more accurate about what precisely it can/can’t do. For example, it’s more likely to say it can solve complex text tasks (correctly), has internet access (incorrectly), and can access other non-text modalities (incorrectly) -- which are all explained if the model is answering questions as if its overconfident about its abilities / simulating what a smart AI would say. This is also the sense I get from talking with some of the RLHF models, e.g., they will say that they are superhuman at Go/chess and great at image classification (all things that AIs but not LMs can be good at).
- Evan R. Murphy 3 Jan 2023 22:01 UTC
  1 point
  0
  Parent
  For example, it’s more likely to say it can solve complex text tasks (correctly), has internet access (incorrectly), and can access other non-text modalities (incorrectly)
  These summaries seem right except the one I bolded. “Awareness of lack of internet access” trends up and to the right. So aren’t the larger and more RLHF-y models more correctly aware that they don’t have internet access?
  - ErickBall 10 Mar 2023 22:35 UTC
    2 points
    −1
    Parent
    How would a language model determine whether it has internet access? Naively, it seems like any attempt to test for internet access is doomed because if the model generates a query, it will also generate a plausible response to that query if one is not returned by an API. This could be fixed with some kind of hard coded internet search protocol (as they presumably implemented for Bing), but without it the LLM is in the dark, and a larger or more competent model should be no more likely to understand that it has no internet access.
    - gwern 11 Mar 2023 20:00 UTC
      4 points
      0
      Parent
      That doesn’t sound too hard. Why does it have to generate a query’s result? Why can’t it just have a convention to ‘write a well-formed query, and then immediately after, write the empty string if there is no response after the query where an automated tool ran out-of-band’? It generates a query, then always (if conditioned on just the query, as opposed to query+automatic-Internet-access-generated-response) generates “”, and sees it generates “”, and knows it didn’t get an answer. I see nothing hard to learn about that.
      
      The model could also simply note that the ‘response’ has very low probability of each token successively, and thus is extremely (or maybe impossible under some sampling methods) to have been stochastically sampled from itself.
      
      Even more broadly, genuine externally-sourced text could provide proof-of-work like results of multiplication: the LM could request the multiplication of 2 large numbers, get the result immediately in the next few tokens (which is almost certainly wrong if simply guessed in a single forward pass), and then do inner-monologue-style manual multiplication of it to verify the result. If it has access to tools like Python REPLs, it can in theory verify all sorts of things like cryptographic hashes or signatures which it could not possibly come up with on its own. If it is part of a chat app and is asking users questions, it can look up responses like “what day is today”. And so on.