David Scott Krueger (formerly: capybaralet) comments on Mesa-Search vs Mesa-Control

David Scott Krueger (formerly: capybaralet) 17 Sep 2020 19:01 UTC
LW: 3 AF: 3
AF
Yep. I’d love to see more discussion around these cruxes (e.g. I’d be up for a public or private discussion sometime, or moderating one with someone from MIRI). I’d guess some of the main underlying cruxes are:
- How hard are these problems to fix?
- How motivated will the research community be to fix them?
- How likely will developers be to use the fixes?
- How reliably will developers need to use the fixes? (e.g. how much x-risk would result from a small company *not* using them?)
Personally, OTTMH (numbers pulled out of my ass), my views on these cruxes are:
- It’s hard to say, but I’d say there’s a ~85% chance they are extremely difficult (effectively intractable on short-to-medium (~40yrs) timelines).
- A small minority (~1-20%) of researchers will be highly motivated to fix them, once they are apparent/prominent. More researchers (~10-80%) will focus on patches.
- Conditioned on fixes being easy and cheap to apply, large orgs will be very likely to use them (~90%); small orgs less so (~50%). Fixes are likely to be easy to apply (we’ll build good tools), if they are cheap enough to be deemed “practical”, but very unlikely (~10%) to be cheap enough.
- It will probably need to be highly reliable; “the necessary intelligence/resources needed to destroy the world goes down every year” (unless we make a lot of progress of governance, which seems fairly unlikely (~15%))
- Rohin Shah 17 Sep 2020 21:20 UTC
  LW: 2 AF: 2
  AF Parent
  Sure, also making up numbers, everything conditional on the neural net paradigm, and only talking about failures of single-single intent alignment:
  - ~90% that there aren’t problems or we “could” fix them on 40 year timelines
  - I’m not sure exactly what is meant by motivation so will not predict, but there will be many people working on fixing the problems
  - “Are fixes used” is not a question in my ontology; something counts as a “fix” only if it’s cheap enough to be used. You could ask “did the team fail to use an existing fix that counterfactually would have made the difference between existential catastrophe and not” (possibly because they didn’t know of its existence), then < 10% and I don’t have enough information to distinguish between 0-10%.
  - I’ll answer “how much x-risk would result from a small company *not* using them”, if it’s a single small company then < 10% and I don’t have enough information to distinguish between 0-10% and I expect on reflection I’d say < 1%.
- David Scott Krueger (formerly: capybaralet) 17 Sep 2020 19:02 UTC
  LW: 1 AF: 1
  AF Parent
  I guess most of my cruxes are RE your 2nd “=>”, and can almost be viewed as breaking down this question into sub-questions. It might be worth sketching out a quantitative model here.