Beth Barnes comments on AI safety via market making

Beth Barnes 16 Nov 2020 21:32 UTC
LW: 3 AF: 2
AF
Suppose by strong induction that $M$ always gives the right answer immediately for all sets of size less than $n$
Pretty sure debate can also access R if you make this strong of an assumption—ie assume that debaters give correct answers for all questions that can be answered with a debate tree of size <n.

I think the sort of claim that’s actually useful is going to look more like ‘we can guarantee that we’ll get a reasonable training signal for problems in [some class]’

Ie, suppose M gives correct answers some fraction of the time. Are these answers going to get lower loss? As n gets large, the chance that M has made a mistake somewhere in the recursion chain gets large, and the correct answer is not necessarily rewarded.
- evhub 16 Nov 2020 22:54 UTC
  LW: 4 AF: 3
  AF Parent
  
  Pretty sure debate can also access R if you make this strong of an assumption—ie assume that debaters give correct answers for all questions that can be answered with a debate tree of size <n.
  
  First, my full exploration of what’s going on with different alignment proposals and complexity classes can be found here, so I’d recommend just checking that out rather than relying on my the mini proof sketch I gave here.
  
  Second, in terms of directly addressing what you’re saying, I tried doing a proof by induction to get debate to RE and it doesn’t work. The problem is that you can only get guarantees for trees that the human can judge, which means they have to be polynomial in length (though if you relax that assumption then you might be able to do better). Also, it’s worth noting that the text that you’re quoting isn’t actually an assumption of the proof in any way—it’s just the inductive hypothesis in a proof by induction.
  
  I think the sort of claim that’s actually useful is going to look more like ‘we can guarantee that we’ll get a reasonable training signal for problems in [some class]’
  
  I think that is the same as what I’m proving, at least if you allow for “training signal” to mean “training signal in the limit of training on arbitrarily large amounts of data.” See my full post on complexity proofs for more detail on the setup I’m using.