Oliver Daniels comments on Oliver Daniels-Koch’s Shortform

Oliver Daniels 17 Mar 2024 19:01 UTC
1 point
0
(from conversation with Erik Jenner) roughly 3 classes of applications
1. MTD all the way down
  1. Come up with a bunch of measurements of things going well (humans self-report being very happy, news stories are good or something). Use GPT-N to predict measurements, and detect measurement tampering, condition on proposed actions of policy. Reward actions that GPT-N predicts will increase measures of things going well without tampering.
2. Other Scalable Oversight + MTD as reward function / side constraint
  1. Use whatever scalable oversight protocol seems good (Debate, IDA, etc) for the “primary” training signal, while also setting up a bunch of measurements of safety properties (e.g. integrity of the binaries on the server the model is hosted). Do RL on the scalable oversight signal while constantly screening for and negatively rewarding actions flagged as tampering
3. Other Scalable Oversight + MTD as extra safety check
  1. same as before, but your leaning more on the scalable oversight to avoid tampering in most cases, and detecting measurement tampering as a precautionary step (i.e. you detect measurement tampering you Pause/Stop)
(some ways in which these classes overlap, e.g. using your scalable oversight protocol as the measurement(s))
- ryan_greenblatt 18 Mar 2024 4:23 UTC
  3 points
  0
  Parent
  As far as (1) and similar things, you might find the narrow elicitation appendix in the ELK report and next several appendices (prior to generative model errors) worthwhile.