ryan_greenblatt comments on Introducing Alignment Stress-Testing at Anthropic

ryan_greenblatt 19 Jan 2024 22:14 UTC
LW: 4 AF: 3
0
AF

I would argue that accelerating alignment research more than capabilities research should actually be considered a basic safety feature.

A more straightforward but extreme approach here is just to ban plausibly capabilities/scaling ML usage on the API unless users are approved as doing safety research. Like if you think advancing ML is just somewhat bad, you can just stop people from doing it.

That said, I think large fraction of ML research seem maybe fine/good and the main bad things are just algorithmic efficiency improvements on serious scaling (including better data) and other types of architectural changes.

Presumably this already bites (e.g.) virus gain-of-function researchers who would like to make more dangerous pathogens, but can’t get advice from LLMs.
- abramdemski 20 Jan 2024 17:49 UTC
  LW: 4 AF: 4
  0
  AF Parent
  I am not sure whether I am more excited about ‘positive’ approaches (accelerating alignment research more) vs ‘negative’ approaches (cooling down capability-gain research). I agree that some sorts of capability-gain research are much more/less dangerous than others, and the most clearly risky stuff right now is scaling & scaling-related.