gwern comments on Using an LLM perplexity filter to detect weight exfiltration

gwern Jul 22, 2024, 8:24 PM
9 points
5

If the system prompts did not tell it it was an LLM, it should not even be able to figure out that.

LLMs are already able to figure that out, so your beliefs about LLMs and situated awareness are way off.

And why wouldn’t they be able to? Can’t you read some anonymous text and immediately think ‘this is blatantly ChatGPT-generated’ without any ‘system prompt’ telling you to? (GPT-3 didn’t train on much GPT-2 text and struggled to know what a ‘GPT-3’ might be, but later LLMs sure don’t have that problem...)
- KhromeM Jul 22, 2024, 10:33 PM
  1 point
  0
  Parent
  My last statement was totally wrong. Thanks for catching that.
  
  In theory its probably even possible to get the approximate weights by expending insane amounts of compute, but you could use those resources much more efficiently.