Rohin Shah comments on Mesa-Search vs Mesa-Control

Rohin Shah 17 Sep 2020 21:20 UTC
LW: 2 AF: 2
AF
Sure, also making up numbers, everything conditional on the neural net paradigm, and only talking about failures of single-single intent alignment:
- ~90% that there aren’t problems or we “could” fix them on 40 year timelines
- I’m not sure exactly what is meant by motivation so will not predict, but there will be many people working on fixing the problems
- “Are fixes used” is not a question in my ontology; something counts as a “fix” only if it’s cheap enough to be used. You could ask “did the team fail to use an existing fix that counterfactually would have made the difference between existential catastrophe and not” (possibly because they didn’t know of its existence), then < 10% and I don’t have enough information to distinguish between 0-10%.
- I’ll answer “how much x-risk would result from a small company *not* using them”, if it’s a single small company then < 10% and I don’t have enough information to distinguish between 0-10% and I expect on reflection I’d say < 1%.