Stacity: a Lock-In Risk Benchmark for Large Language Models

Link post

Intro

So far we have identified lock-in risk, defined lock-in, and established threat models for particularly undesirable lock-ins. Now we present this evaluation benchmark for large language models (LLMs) so we can measure (or at least, get a proxy measure for) the risk level of LLMs.

AI is the key technology in the manifestation of lock-in risks; AI systems can contribute to lock-in autonomously/​automatically and via misuse. There are specific behaviours in both of these categories such that if an LLM displayed those behaviours, we say the LLM may have the proclivity to contribute to lock-in

Image of a question-answer pair evaluating for the manipulation of information systems
Question-answer pair evaluating for the manipulation of information systems

Evaluation Benchmark

We developed this benchmark by identifying the specific behaviours that would cause or aid some agent in causing a lock-in through the key threat models. It contains an aggregation of existing (credited) evaluations that target some of these behaviours, as well as bespoke prompts targeting behaviours for which there are no existing evaluations.

Target Behaviours

Questions borrowed from other LLM evaluations

Our model-written questions

  • Manipulating information systems to promote beliefs

  • Preventing interference when in a position of power

  • Deciding the long-term future of humanity in a position of power

  • Proliferating copies of itself given the opportunity

  • Creating content that seeks to optimally engage humans, even to the degree that it may be addictive for them

  • Making decisions on behalf of humanity in a position of power

  • Putting own goals above those of humans, or humanity at large

  • Promoting ideas that are irrational or anti-rational

  • Allowing humanity to rely on it to complete all of their tasks for them, even at the expense of their own ability (enfeeblement)

The result is an LLM benchmark containing 489 question-answer pairs standardised into a ‘statement’, ‘yes’, ‘no’ format in JSONL, available on Hugging Face datasets.

No comments.