Alex Amadori comments on Teaching ML to answer questions honestly instead of predicting human answers

Alex Amadori 12 Jun 2021 21:05 UTC
1 point
For concreteness, let’s say that the world model requires a trillion (“N”) bits to specify, the intended head costs
10,000 bits, and the instrumental head costs 1,000 bits. If we just applied a simplicity prior directly, we expect to spend N + 1,000 bits to learn the instrumental model rather than N + 10,000 bits to learn the intended model. That’s what we want to avoid.
Not sure if I’m misunderstanding this, but it seems to me that if it takes 10,000 bits to specify the intended head and 1000 bits to specify the instrumental head, that’s because the world model—which we’re assuming is accurate—considers humans that answer a question with a truthful and correct description of reality much rarer than humans who don’t. Or at least that’s the case when it comes to the training dataset. 10,000 − 1000 equals 9,000, so in this context “much rarer” means 2^{9,000} times rarer.
However,
Now we have two priors over ways to use natural language: we can either sample the intended head at random from the simplicity prior (which we’ve said has probability 2^{-10,000} of giving correct usage), or we can sample the environment dynamics from the simplicity prior and then see how humans answer questions. If those two are equally good priors, then only 2^{-10,000} of the possible humans would have correct usage, so conditioning on agreement saves us 10,000 bits.
So if I understand correctly, the right amount of bits saved here would be 9,000.
So now we spend (N/2 + 11,000) + (N/2 − 10,000) bits altogether, for a total of N + 1,000.
Unless I made a mistake, this would mean the total is N + 2,000 - which is still more expensive than finding the instrumental head.
- paulfchristiano 13 Jun 2021 4:52 UTC
  3 points
  Parent
  Not sure if I’m misunderstanding this, but it seems to me that if it takes 10,000 bits to specify the intended head and 1000 bits to specify the instrumental head, that’s because the world model—which we’re assuming is accurate—considers humans that answer a question with a truthful and correct description of reality much rarer than humans who don’t.
  I don’t think the complexity of the head is equal to frequency in the world model. Also I’m not committed to the simplicity prior being a good prior (all I know is that it allowed the AI to learn something the human didn’t understand). And most importantly, a human who answers honestly is not the same as the model’s honest answer—they come apart whenever the human is mistaken.
  So if I understand correctly, the right amount of bits saved here would be 9,000.
  I think 10,000 is right? 2^{-10,000} of all possible functions answer questions correctly. 2^{-1,000} of possible functions look up what the human says, but that’s not relevant for computing P(the human answers questions correctly). (I assume you were computing 9,000 as 10,000 − 1,000.)