Aaron_Scher comments on Aaron_Scher’s Shortform

Aaron_Scher 29 Mar 2024 20:40 UTC
3 points
0
Slightly Aspirational AGI Safety research landscape
This is a combination of an overview of current subfields in empirical AI safety and research subfields I would like to see but which do not currently exist or are very small. I think this list is probably worse than this recent review, but making it was useful for reminding myself how big this field is.
- Interpretability / understanding model internals
  - Circuit interpretability
  - Superposition study
  - Activation engineering
  - Developmental interpretability
- Understanding deep learning
  - Scaling laws / forecasting
  - Dangerous capability evaluations (directly relevant to particular risks, e.g., biosecurity, self-proliferation)
  - Other capability evaluations / benchmarking (useful for knowing how smart AIs are, informing forecasts), including persona evals
  - Understanding normal but poorly understood things, like in context learning
  - Understanding weird phenomenon in deep learning, like this paper
  - Understand how various HHH fine-tuning techniques work
- AI Control
  - General supervision of untrusted models (using human feedback efficiently, using weaker models for supervision, schemes for getting useful work done given various Control constraints)
  - Unlearning
  - Steganography prevention / CoT faithfulness
  - Censorship study (how censoring AI models affects performance; and similar things)
- Model organisms of misalignment
  - Demonstrations of deceptive alignment and sycophancy / reward hacking
  - Trojans
  - Alignment evaluations
  - Capability elicitation
- Scaling / scalable oversight
  - RLHF / RLAIF
  - Debate, market making, imitative generalization, etc.
  - Reward hacking and sycophancy (potentially overoptimization, but I’m confused by much of it)
  - Weak to strong generalization
  - General ideas: factoring cognition, iterated distillation and amplification, recursive reward modeling
- Robustness
  - Anomaly detection
  - Understanding distribution shifts and generalization
  - User jailbreaking
  - Adversarial attacks / training (generally), including latent adversarial training
- AI Security
  - Extracting info about models or their training data
  - Attacking LLM applications, self-replicating worms
- Multi-agent safety
  - Understanding AI in conflict situations
  - Cascading failures
  - Understanding optimization in multi-agent situations
  - Attacks vs. defenses for various problems
- Unsorted / grab bag
  - Watermarking and AI generation detection
  - Honesty (model says what it believes)
  - Truthfulness (only say true things, aka accuracy improvement)
  - Uncertainty quantification / calibration
  - Landscape study (anticipating and responding to risks from new AI paradigms like compound AI systems or self play)
Don’t quite make the list:
- Whose values? Figuring out how to aggregate preferences in RLHF. This seems like it’s almost certainly not a catastrophe-relevant safety problem, and my weak guess is that it makes other alignment properties harder to get (e.g., incentivizes sycophancy and makes jailbreaks easier by causing the model’s goals to be more context/user dependent). This work seems generally net positive to me, but it’s not relevant to catastrophic risks and thus is relatively low priority.
- Fine-tuning that calls itself “alignment”. I think it’s super lame that people are co-opting language from AI safety. Some of this work may actually be useful, e.g., by finding ways to mitigate jailbreaks, but it’s mostly low quality.
- Oliver Daniels 1 Apr 2024 2:37 UTC
  1 point
  0
  Parent
  I think mechanistic anomaly detection (mostly ARC but also Redwood and some forthcoming work) is importantly different than robustness (though clearly related).