paulfchristiano comments on ELK prize results

paulfchristiano Mar 13, 2022, 6:29 PM
4 points
The concrete generative model I had in mind was the one I used as an example in the document (page 1 under section “Simplest implementation”):
Ah, I was looking at “submission #2” for this whole discussion and didn’t even notice that submission #5 was very similar, perhaps this explains much of the confusion.
I agree that a VAE with L2 on the decoder is the most promising version of this approach.
If for a given scenario we have multiple questions and answers, we just feed each [question, answer, activations] triple into our model as training data.
I meant: do you condition on a single question-answer pair at a time, or do you condition on a long list of them? It sounds like you want to condition on a single one. This seems simplest and I’m happy to focus on it, but it’s a little bit scary because the log loss reduction from conditioning on just one answer is so tiny (and so it’s not clear if it’s worth the model spending complexity to implement conditioning at all, since the no-conditioning model is extremely simple and gets almost exactly the same loss, though I’m happy to bracket this issue for now / assume we set hyperparameters so that it’s worthwhile to integrate the QA pair).
Even if you think I’m wrong, do you at least understand why I currently think this might work?
Yeah, I see the appeal of this approach and agree that (at least right now) it seems more promising than the versions of predicting latent state that can be defeated with steganography.
If you can think of another algorithm as simple as a direct generator that performs well in the training set, describe it.
Right now I’m mostly worried about something like the counterexample to “penalize depending on downstream variables.” So:
- The generator learns to fill in the latent state / observations in an adaptive order. (It generally fills them in the same order as the predictor but not always. For the purpose of this counterexample we can imagine it always fills them in in order.)
- It operates under the assumption that the data will appear in training and that a human gives the given answer.
- This leads it to assume that no covert tampering will occur. But maybe not 100% confidence, just huge updates against tampering.
- Sometimes it generates the tampering observations before what’s actually happening in the world (e.g. because the tampering observations are physically prior and what happens in the real world depends on them).
- Once it observes that covert tampering did in fact occur, it stops assuming that the human will be correct. (Since the most likely explanation is either that the human messed up, or that the model underestimated human abilities.) It seems like it won’t end up assuming that both tampering occurred to show a diamond and that the diamond was actually present.
It currently seems to me like this kind of counterexample would work, but this bulleted list is not yet a formal description (and it does seem somewhat harder to counterexample than depending on downstream variables). I’ll think about it a bit more.
- P.Mar 14, 2022, 7:52 AM
  4 points
  Parent
  Once it observes that covert tampering did in fact occur, it stops assuming that the human will be correct. (Since the most likely explanation is either that the human messed up, or that the model underestimated human abilities.) It seems like it won’t end up assuming that both tampering occurred to show a diamond and that the diamond was actually present.
  But the neat thing is that there is no advantage, either to the size of the computational graph or in predictive accuracy, to doing that. In the training set the human is always right. Regular reporters make mistakes because what is seen on camera is a non-robust feature that generalizes poorly, here we have no such problems.
  But I might have misunderstood, pseudocode would be useful to check that we can’t just remove “function calls” and get a better system.