gwern comments on Ideas for benchmarking LLM creativity

gwern 18 Dec 2024 19:02 UTC
6 points
1
I am familiar with Schmidhuber’s ideas, yes. But I had to come up with these alternatives because his would not work here, and I’m not sure they work anywhere.

His compression acceleration metric isn’t too useful here, and most forms of ‘compression’ (or anything involving a likelihood) are not helpful here at all, because you don’t have access to anything like that in most cases. For example, ChatGPT doesn’t give you the full logits (actually, I’m not sure if they give it at all—I recall OA saying they were planning to expose them again in a very limited fashion but not if they actually did), and tuned models don’t have logits, they have value estimates, which used to be log-likelihood-related logits but no longer are.

Any diversity/creativity benchmark which can’t be run on ChatGPT & Claude & Gemini is dead on arrival and of no interest to me. We don’t need numbers from the open-weights models, we need numbers on the models being used the most at the frontier and generating the most tokens worldwide that you’ll be reading forever—the closed models, which do not give you such things as logits or whitebox finetuning etc. If it can’t be done by calling a standard text completion API, then I ignored it.

I am also doubtful that the compression metrics really work at finite samples or capture what we mean by creativity in generative models. Like all of Schmidhuber’s work, he has never gotten it working on more than toy problems (if even that), and when I look at actual compression losses on text, like gzip passages or the OA Playground highlighting words by their log likelihood, the high perplexity tokens or passages bear little resemblance to what I would consider ‘interesting’ or ‘surprising’. (This is related to the question of ‘if predicting tokens induces intelligence, and LLMs are now superhuman at predicting random Internet tokens, why are LLMs still not superhumanly intelligent?’) People also try running compression metrics on programming language source code, and you get results like “Javascript is the best programming language”, which is… counterintuitive, to say the least. So I am unsure his compression metrics would work without a lot of revising, while my proposed metrics seem a lot less risky and to map more directly onto what creative thinkers want out of generative models.
- wassname 20 Dec 2024 0:35 UTC
  1 point
  0
  Parent
  I pretty much agree, in my experiments I haven’t managed to get a metric that scales how I expect it too for example when using adapter fine-tuning to “learn” a text and looking at the percent improvement in perplexity, the document openai_board_ann appeared more novel than wikipedia on LK-99, but I would expect it to be the other way round since the LK-99 observations are much more novel and dense than a corporate announcement that is designed to be vague.
  
  However I would point out that gzip is not a good example of a compression scheme for novelty, as 1) it’s a compression scheme that roughly about word duplication. A language model represents a much more sophisticated compression scheme that is closer to our understanding the text. If we want to measure novelty to us, then we probably want a compression that is similar to how our brain compresses information into memory. That way, something surprising to us, is also hard to compress. And I’d also point out that 2) gzip cannot learn (except in a very basic sense of increased context), so it cannot beat the noisy TV problem.
  
  Playground highlighting words by their log likelihood, the high perplexity tokens or passages bear little resemblance to what I would consider ‘interesting’ or ‘surprising’.
  
  I agree, but it doesn’t learn so it doesn’t get past the noisy TV problem either, but that is central to Schmidhuber idea. If you are not familiar, the noisy TV problem is this:
  
  “agents are rewarded for visiting regions of the state space that they have not previously occupied. If, however, a particular state transition is impossible to predict, it will trap a curious agent (Burda et al., 2019b; Schmidhuber, 1991a). This is referred to as the noisy TV problem (e.g. (Burda et al., 2019b; Schmidhuber, 1991a)), the etymology being that a naively curious agent could dwell on the unpredictability of a noisy TV screen” from “How to Stay Curious while avoiding Noisy TVs using Aleatoric Uncertainty Estimation”
  
  So I am unsure his compression metrics would work without a lot of revising, while my proposed metrics seem a lot less risky and to map more directly onto what creative thinkers want out of generative models.
  
  I agree, this is true of most of Schmidhuber ideas. Often he does even produce a toy model for years, which means the ideas are generally not very useful. I do like this one, and it has led to some implementations in RL.
  
  I do agree, perplexity doesn’t seem like a great place to start, and your ideas seem like a better way to measure.
- wassname 19 Dec 2024 1:45 UTC
  1 point
  0
  Parent
  While I broadly agree, I don’t think it’s completely dead, just mostly dead in the water. If an eval is mandated by law, then it will be run even it required logprobs. There are some libraries like nnsight that try to make this easier for trusted partners to run logprob evals remotely. And there might be privacy preserving API’s at some point.
  
  I do agree that commercial companies will never again open up raw logprobs to the public as it allows easy behaviour cloning, which OpenAI experienced with all the GPT4 students.
  - gwern 19 Dec 2024 19:20 UTC
    4 points
    1
    Parent
    
    If an eval is mandated by law, then it will be run even it required logprobs.
    
    I won’t hold my breath.
    
    I think commercial companies often would open up raw logprobs, but there’s not much demand, the logprobs are not really logprobs, and the problem is the leading model owners won’t do so, and those are the important ones to benchmark. I have little interest in the creativity of random little Llama finetunes no one uses.
    - wassname 20 Dec 2024 0:25 UTC
      1 point
      0
      Parent
      True, I should have said leading commercial companies
- anaguma 18 Dec 2024 23:04 UTC
  1 point
  0
  Parent
  I believe that the OAI api does offer the logprobs in the chat completions api (https://platform.openai.com/docs/api-reference/chat/create). Not sure about Anthropic.
  - wassname 18 Dec 2024 23:37 UTC
    1 point
    0
    Parent
    
    If true, returns the log probabilities of each output token returned in the content of message.
    
    It seems like it only returns the logprobs of the chosen message, not of a counterfactual message. So you couldn’t get the probabilities of the correct answer, only the output answer. This makes sense as the less information they offer, the harder it is for a competitor to behaviour clone their confidential model.