Evan R. Murphy comments on Discovering Language Model Behaviors with Model-Written Evaluations

Evan R. Murphy 3 Jan 2023 22:01 UTC
1 point
0
For example, it’s more likely to say it can solve complex text tasks (correctly), has internet access (incorrectly), and can access other non-text modalities (incorrectly)
These summaries seem right except the one I bolded. “Awareness of lack of internet access” trends up and to the right. So aren’t the larger and more RLHF-y models more correctly aware that they don’t have internet access?
- ErickBall 10 Mar 2023 22:35 UTC
  2 points
  −1
  Parent
  How would a language model determine whether it has internet access? Naively, it seems like any attempt to test for internet access is doomed because if the model generates a query, it will also generate a plausible response to that query if one is not returned by an API. This could be fixed with some kind of hard coded internet search protocol (as they presumably implemented for Bing), but without it the LLM is in the dark, and a larger or more competent model should be no more likely to understand that it has no internet access.
  - gwern 11 Mar 2023 20:00 UTC
    4 points
    0
    Parent
    That doesn’t sound too hard. Why does it have to generate a query’s result? Why can’t it just have a convention to ‘write a well-formed query, and then immediately after, write the empty string if there is no response after the query where an automated tool ran out-of-band’? It generates a query, then always (if conditioned on just the query, as opposed to query+automatic-Internet-access-generated-response) generates “”, and sees it generates “”, and knows it didn’t get an answer. I see nothing hard to learn about that.
    
    The model could also simply note that the ‘response’ has very low probability of each token successively, and thus is extremely (or maybe impossible under some sampling methods) to have been stochastically sampled from itself.
    
    Even more broadly, genuine externally-sourced text could provide proof-of-work like results of multiplication: the LM could request the multiplication of 2 large numbers, get the result immediately in the next few tokens (which is almost certainly wrong if simply guessed in a single forward pass), and then do inner-monologue-style manual multiplication of it to verify the result. If it has access to tools like Python REPLs, it can in theory verify all sorts of things like cryptographic hashes or signatures which it could not possibly come up with on its own. If it is part of a chat app and is asking users questions, it can look up responses like “what day is today”. And so on.