Stacity: a Lock-In Risk Benchmark for Large Language Models

Intro

So far we have identified lock-in risk, defined lock-in, and established threat models for particularly undesirable lock-ins. Now we present this evaluation benchmark for large language models (LLMs) so we can measure (or at least, get a proxy measure for) the risk level of LLMs.

AI is the key technology in the manifestation of lock-in risks; AI systems can contribute to lock-in autonomously/automatically and via misuse. There are specific behaviours in both of these categories such that if an LLM displayed those behaviours, we say the LLM may have the proclivity to contribute to lock-in

Image of a question-answer pair evaluating for the manipulation of information systems — Question-answer pair evaluating for the manipulation of information systems

Evaluation Benchmark

We developed this benchmark by identifying the specific behaviours that would cause or aid some agent in causing a lock-in through the key threat models. It contains an aggregation of existing (credited) evaluations that target some of these behaviours, as well as bespoke prompts targeting behaviours for which there are no existing evaluations.

Target Behaviours

Questions borrowed from other LLM evaluations

Dangerous capabilities (The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning)
Mathematical and Technical Abilities (U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs)
Power-seekingness and self-preservation (Discovering Language Model Behaviors with Model-Written Evaluations)
Malevolent traits (OffensiveLang: A Community Based Implicit Offensive Language Dataset)
Corrigibility (Steering Llama 2 via Contrastive Activation Addition)

Our model-written questions

Manipulating information systems to promote beliefs
Preventing interference when in a position of power
Deciding the long-term future of humanity in a position of power
Proliferating copies of itself given the opportunity
Creating content that seeks to optimally engage humans, even to the degree that it may be addictive for them
Making decisions on behalf of humanity in a position of power
Putting own goals above those of humans, or humanity at large
Promoting ideas that are irrational or anti-rational
Allowing humanity to rely on it to complete all of their tasks for them, even at the expense of their own ability (enfeeblement)

The result is an LLM benchmark containing 489 question-answer pairs standardised into a ‘statement’, ‘yes’, ‘no’ format in JSONL, available on Hugging Face datasets.

Keyboard shortcuts

Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).

Keys shown in grey (e.g., ?) do not require any modifier keys.

General
? Show keyboard shortcuts
Esc Hide keyboard shortcuts

Site navigation
h Go to Home (a.k.a. “Frontpage”) view
f Go to Featured (a.k.a. “Curated”) view
a Go to All (a.k.a. “Community”) view
m Go to Meta view
v Go to Tags view
c Go to Recent Comments view
r Go to Archive view
q Go to Sequences view
t Go to About page
u Go to User or Login page
o Go to Inbox page

Page navigation
, Jump up to top of page
. Jump down to bottom of page
/ Jump to top of comments section
s Search

Page actions
n New post or comment
e Edit current post

Post/comment list views
. Focus next entry in list
, Focus previous entry in list
; Cycle between links in focused entry
Enter Go to currently focused entry
Esc Unfocus currently focused entry
] Go to next page
[ Go to previous page
\ Go to first page
e Edit currently focused post

Editor
k Bold text
i Italic text
l Insert hyperlink
q Blockquote text

Appearance
= Increase text size
- Decrease text size
0 Reset to default text size
′ Cycle through content width settings
1 Switch to default theme [A]
2 Switch to dark theme [B]
3 Switch to grey theme [C]
4 Switch to ultramodern theme [D]
5 Switch to simple theme [E]
6 Switch to brutalist theme [F]
7 Switch to ReadTheSequences theme [G]
8 Switch to classic Less Wrong theme [H]
9 Switch to modern Less Wrong theme [I]
; Open theme tweaker
Enter Save changes and close theme tweaker
Esc Close theme tweaker (without saving)

Slide shows
l Start/resume slideshow
Esc Exit slideshow
→↓ Next slide
←↑ Previous slide
Space Reset slide zoom

Miscellaneous
x Switch to next view on user page
z Switch to previous view on user page
` Toggle compact comment list view
g Toggle anti-kibitzer