paulfchristiano comments on ELK prize results

paulfchristiano 10 Mar 2022 16:10 UTC
4 points
This was classified under “Train a reporter that’s useful for an auxiliary AI.”
(I think you didn’t get listed under that category because you got another prize for a different proposal and we were inconsistent about listing people multiple times.)
- P. 10 Mar 2022 19:17 UTC
  4 points
  Parent
  Sorry for writing so many comments, but I just don’t see it. Unless it is implemented in a very weird way or I’m misunderstanding something, my proposal doesn’t fail to the steganography counterexamples.
  I don’t know why you think it would. One possibility is that there is a superficially similar proposal that might fail to steganography: we can train a reporter so that it produces answers that allow another system to reconstruct the activations of the predictor. In this case steganography is useful to pass extra information to that generator. But my proposal isn’t trying to make the answers as informative about the world as possible (during training); if we are using a proper distribution matching generator (i.e. almost anything except GANs), then it is trying to model the distribution as accurately as possible (given the computational limitations). Encoding information on the far side of the moon would just increase its loss function: given any random variable the distribution that minimizes the expected -log-likelihood is its true distribution, not any smart encoding. And if there is a QA pair that a human could have answered with uncertainty, the generator should then again accurately model that distribution (but a well made dataset won’t reflect that uncertainty).
  And even if it did perform steganography (for some reason unknown to me) it would do so “on top” of the “direct generator” instead of the “generator of worlds that a human thinks are consistent with a QA pair”, because it is simpler.
  - paulfchristiano 11 Mar 2022 1:39 UTC
    4 points
    Parent
    I may have misunderstood your proposal. (And I definitely misremembered it when I replied to your comment earlier, I just looked back at the google doc now.)
    Here was the naive interpretation I had of what you submitted:
    Start with a regularizer C that prefers faster programs. You suggested “serial depth” so I’ll assume that you are imposing some upper limit on total size of the computation and then minimizing serial depth subject to that limit.
    Learn a map F : (question, answer) --> (predictor activations)
    Learn a prior distribution P(answer|question)
    Define a map G : (predictor activations, question) --> (distribution over answers) as G(z, Q) = P(A|Q, F(Q, A) = z)
    Then optimize F for some linear combination of C(F) and log loss of G when predicting human answers to question.
    I think you suggested giving F many (question, answer) pairs but it wasn’t clear how to define the function and I don’t think it change the bottom line.
    It seems like this would learn the bad behavior:
    F generates a picture of the world with the given questions and answers, by running the human model forward.
    Then it guesses that picture for the output and fills in all remaining activations uniformly at random.
    This is much cheaper than a direct translator if the human model is much cheaper than a direct translator. It leads to great log loss, and it copies human answers.
    If you had something else in mind then we may have misinterpreted. I’d be happy to provide feedback if you provide pseudocode or another precise description.
    - P. 11 Mar 2022 16:34 UTC
      4 points
      Parent
      Is this precise enough?
      Start with a regularizer C that prefers faster programs. The exact implementation depends on the exact F used. For a regular neural network it is some proxy of simplicity, like L2 regularization. Although it would also be helpful to make F shallow.
      Learn a conditional probability distribution F: (question, answer) --> (predictor activations)
      It must be such that we can compute P(predictor activations|question, answer), or some approximation
      It is optimized for some linear combination of C(F) and log loss when predicting the activations.
      Learn a prior distribution P(answer|question)
      In the document I actually wrote just P(answer), that was a mistake, but it doesn’t change anything meaningful.
      In order to answer a question, iterate over all possible answers and choose the one that maximizes P(A|Q)*P(z|Q,A), z being the activations.
      This might be slow, but this whole system can then be distilled into one that answers questions directly, we just need to train it with generated samples outside the training distribution, perhaps chosen adversarially. But that is outside the scope of the proposal.
      This is given by P(A|Q,z) = P(Q,A,z)/P(Q,z) = P(Q)*P(A|Q)*P(z|Q,A)/P(Q,z) ∝ P(A|Q)*P(z|Q,A)
      As discussed in the document, generating the sensory data first is a problem that won’t happen. Neither can it fill the activations at random, that will give it a terrible loss.
      - paulfchristiano 13 Mar 2022 7:44 UTC
        4 points
        Parent
        Is this precise enough?
        It seems like there are still some details that lead to totally different behavior: how do we represent the generative model, how do we define C, and maybe (depending on the answers to those questions) how do we trade off C against the log loss and how do we incorporate many different questions and answers.
        In order to respond, I’ll just assume that you are considering only a single question-answer pair (since otherwise there are lots of other details) and just make some random assumptions about the rest. I expect you aren’t happy with this version, in which case you can supply your own version of those details.
        Learn a conditional probability distribution F: (question, answer) --> (predictor activations)
        It must be such that we can compute P(predictor activations|question, answer), or some approximation
        How do you represent that probability distribution?
        If F(z|Q, A) as an autoregressive model, then it seems like the most important thing to do by far is to learn a copy of the predictor. And that’s very shallow! Perhaps this isn’t how you want to represent the probability distribution, in which case you can correct it by stating a concrete alternative. For example, if you represent as a GAN or energy-based model then you seem to get a human simulator. Perhaps you mean a VAE with the computational penalty applied only to the decoder? But I’m just going to assume you mean an autoregressive model.
        You might hope that you can improve the log loss of the autoregressive by incorporating information from the question & answer. That is, F(z[k] | z[<k], Q, A) is the best guess, given a tiny amount of computation and z[<k]. And the hope is that it might be very fast to infer something about particular bits z[k] given Q and A. Maybe you have something else in mind, in which case you could spell that out.
        It seems like the optimal F should make predictions not only conditioned on A being the answer to Q, but on the fact that the situation was included for training (e.g. it should predict z[k] is such that nothing complicated happens). Does that sound right?
        And now the question is what is the best way to predict z[k] given (z[<k], Q, A), and the fact that the current situation was included as a training datapoint. In particular, if z is such that the answer to Q is A but the human thinks it is A’, is F(z, Q, A) larger or smaller than F(z, Q, A’)?
        I don’t think this works, but just want to understand the proposal.
        P. 13 Mar 2022 10:29 UTC
        4 points
        Parent
        The concrete generative model I had in mind was the one I used as an example in the document (page 1 under section “Simplest implementation”):
        Train a conditional VAE to generate all the variables of the decoder of the predictor, condition on the [question, answer] pair. Use L2 regularization on the decoder of our model as a proxy for complexity (since the computation time of VAEs is constant).
        An autoregressive model is probably the single worst model you could choose. It forces an order of generation of the latents, which breaks the arguments I wrote in my proposal. And for the purposes of how I expect the intended model to work (which I will explain below), it is very very deep. During training we can generate everything in parallel, but at inference time we need to generate each token after the previous one, which is a very deep computational graph.
        I don’t understand what you mean by considering a single question-answer pair. If for a given scenario we have multiple questions and answers, we just feed each [question, answer, activations] triple into our model as training data.
        Does that sound right?
        It does, that is the distribution it is modeling.
        The intended behavior is the one I wrote in my reply to Thomas:
        both the “direct generator” and the “generator of worlds that a human thinks are consistent with a QA pair” need to generate the activations that represent the world, but the “direct generator” does it directly (e.g. “diamond” → representation of a diamond), while the other one performs additional computation to determine what worlds a human thinks are consistent with a QA pair (e.g. “diamond” → it might be a diamond, but it might also be just a screen in front of the camera and what is really in the vault is…), in the training set where the labels are always correct the direct generator is preferred.
        If we have a dataset where what is seen on screen always corresponds (simply) with reality, I expect the model to generate the “world” first (using a direct map like I said above) and then use it to generate the images (I argued for this order in the document). If we use this to answer questions, it will care about what is really happening in reality (this is just the Markov property, see the last row of my drawing if that isn’t clear).
        If for some reason this fails and the model generates both the images and the world each conditioned mostly independently on the QA pair, then using that to answer questions will get us a system that cares about both what is happening on screen and what is happening in reality. Given that the data it had access to is consistent with both hypotheses, I don’t fault it for doing that. Using it to optimize the world so that the answer to “Is the diamond in the vault?” is “Yes”, will give us a system that cares about “both” diamonds. Not a terrible failure mode.
        If on the other hand we have a dataset where the vault has an operator, so that it is hard for another person to answer correctly, the ordering argument becomes stronger. For simple enough models QA->world->images works but in QA->images->world the QA information is lost in the first step and doesn’t get to the world (since the images are modeled as independent from the true answer). And in this case if this fails for some reason and the model generates the world and the images independently, using it to answer questions will give us a model that cares strongly about reality. Because the correlation of the QA pair is stronger with reality (by construction, at least for simple models) than with the images. Here you can’t use the argument of “but what if performing perfect inference on the human simulation is still cheaper than the direct reporter”, since both models need to be big enough to generate all the data.
        If you can think of another algorithm as simple as a direct generator that performs well in the training set, describe it.
        Even if you think I’m wrong, do you at least understand why I currently think this might work?
        paulfchristiano 13 Mar 2022 18:29 UTC
        4 points
        Parent
        The concrete generative model I had in mind was the one I used as an example in the document (page 1 under section “Simplest implementation”):
        Ah, I was looking at “submission #2” for this whole discussion and didn’t even notice that submission #5 was very similar, perhaps this explains much of the confusion.
        I agree that a VAE with L2 on the decoder is the most promising version of this approach.
        If for a given scenario we have multiple questions and answers, we just feed each [question, answer, activations] triple into our model as training data.
        I meant: do you condition on a single question-answer pair at a time, or do you condition on a long list of them? It sounds like you want to condition on a single one. This seems simplest and I’m happy to focus on it, but it’s a little bit scary because the log loss reduction from conditioning on just one answer is so tiny (and so it’s not clear if it’s worth the model spending complexity to implement conditioning at all, since the no-conditioning model is extremely simple and gets almost exactly the same loss, though I’m happy to bracket this issue for now / assume we set hyperparameters so that it’s worthwhile to integrate the QA pair).
        Even if you think I’m wrong, do you at least understand why I currently think this might work?
        Yeah, I see the appeal of this approach and agree that (at least right now) it seems more promising than the versions of predicting latent state that can be defeated with steganography.
        If you can think of another algorithm as simple as a direct generator that performs well in the training set, describe it.
        Right now I’m mostly worried about something like the counterexample to “penalize depending on downstream variables.” So:
        The generator learns to fill in the latent state / observations in an adaptive order. (It generally fills them in the same order as the predictor but not always. For the purpose of this counterexample we can imagine it always fills them in in order.)
        It operates under the assumption that the data will appear in training and that a human gives the given answer.
        This leads it to assume that no covert tampering will occur. But maybe not 100% confidence, just huge updates against tampering.
        Sometimes it generates the tampering observations before what’s actually happening in the world (e.g. because the tampering observations are physically prior and what happens in the real world depends on them).
        Once it observes that covert tampering did in fact occur, it stops assuming that the human will be correct. (Since the most likely explanation is either that the human messed up, or that the model underestimated human abilities.) It seems like it won’t end up assuming that both tampering occurred to show a diamond and that the diamond was actually present.
        It currently seems to me like this kind of counterexample would work, but this bulleted list is not yet a formal description (and it does seem somewhat harder to counterexample than depending on downstream variables). I’ll think about it a bit more.
        P. 14 Mar 2022 7:52 UTC
        4 points
        Parent
        Once it observes that covert tampering did in fact occur, it stops assuming that the human will be correct. (Since the most likely explanation is either that the human messed up, or that the model underestimated human abilities.) It seems like it won’t end up assuming that both tampering occurred to show a diamond and that the diamond was actually present.
        But the neat thing is that there is no advantage, either to the size of the computational graph or in predictive accuracy, to doing that. In the training set the human is always right. Regular reporters make mistakes because what is seen on camera is a non-robust feature that generalizes poorly, here we have no such problems.
        But I might have misunderstood, pseudocode would be useful to check that we can’t just remove “function calls” and get a better system.
        Thomas 13 Mar 2022 21:56 UTC
        1 point
        Parent
        Ah, a conditional VAE! Small question: Am I the only one that reserves ‘z’ for the latent variables of the autoencoder? You seem to be using it as the ‘predictor state’ input. Or am I reading it wrong?
        
        Now I understand your P(z|Q,A) better, as it’s just the conditional generator. But, how do you get P(A|Q)? That distribution need not be the same for the human known set and the total set.
        
        I was wondering what happens in deployment when you meet a z that’s not in your P(z,Q,A) (ie very small p). Would you be sampling P(z|Q,A) forever?
        P. 14 Mar 2022 8:28 UTC
        1 point
        Parent
        You aren’t the only one, z is usually used for the latent, I just followed Paul’s notation to avoid confusion.
        P(A|Q) comes just from training on the QA pairs. But I did say “set any reasonable prior over answers”, because I expect P(z|Q,A) to be orders of magnitude higher for the right answer. Like I said in another comment, an image generator (that isn’t terrible) is incredibly unlikely to generate a cat from the input “dog”, so even big changes to the prior probably won’t matter much. That being said, machine learning rests on the IID assumption, regular reporters are no exception, they also incorporate P(A|Q), it’s just that here it is explicit.
        The whole point of VAEs is that the estimation of the probability of a sample is efficient (see section 2.1 here: https://arxiv.org/abs/1606.05908v1), so I don’t expect it to be a problem.
      - Thomas 12 Mar 2022 12:25 UTC
        3 points
        Parent
        Is this precise enough?
        As I read this, your proposal seems to hinge on a speed prior (or simplicity prior) over F such that F has good generalization from the simple training set to the complex world. I think you could be more precise if you’d explain how the speed prior (enforced by C) chooses direct translation over simulation. Reading your discussion, your disagreement seems to stem from a disagreement about the effect of the speed prior i.e. are translators faster or less complex than imitators?
        P. 12 Mar 2022 13:22 UTC
        3 points
        Parent
        The disagreement stems from them not understanding my proposal, I hope that now it is clear what I meant. I explained in my submission why the speed prior works, but in a nutshell it is because both the “direct generator” and the “generator of worlds that a human thinks are consistent with a QA pair” need to generate the activations that represent the world, but the “direct generator” does it directly (e.g. “diamond” → representation of a diamond), while the other one performs additional computation to determine what worlds a human thinks are consistent with a QA pair (e.g. “diamond” → it might be a diamond, but it might also be just a screen in front of the camera and what is really in the vault is…), in the training set where the labels are always correct the direct generator is preferred.
        Thomas 12 Mar 2022 14:12 UTC
        4 points
        Parent
        Hmm, why would it require additional computation? The counterexample does not need to be an exact human imitator, only a not-translator that performs well in training. In the worst case there exist multiple parts of the activations of the predictor that correlate to “diamond”, so multiple ‘good results’ by just shifting parameters in the model.
        P. 12 Mar 2022 17:47 UTC
        2 points
        Parent
        A generative model isn’t like a regression model. If we have two variables that are strongly correlated and want to predict another variable, then we can shift the parameters to either of them and get very close to what we would get by using both. In a generative model on the other hand, we need to predict both, no value is privileged, we can’t just shift the parameters. See my reply to interstice on what I think would happen in the worst case scenario where we have failed with the implementation details and that happens. The result wouldn’t be that bad.
        If you can think of another algorithm as simple as a direct generator that performs well in training, say so. I think that almost by definition the direct generator is the simplest one.
        And if we make a good enough but still human level dataset (although this isn’t a requirement for my approach to work) the only robust and simple correlation that remains is the one we are interested in.
        Thomas 12 Mar 2022 20:11 UTC
        1 point
        Parent
        Ah, I missed that it was a generative model. If you don’t mind I’d like to extend this discussion a bit. I think it’s valuable (and fun).
        
        I do still think it can go wrong. The joint distribution can shift after training by confounding factors and effect modification. And the latter is more dangerous, because for the purposes of reporting the confounder matters less (I think), but effect modification can move you outside any distribution you’ve seen in training. And it can be something really stupid you forgot in your training set, like the action to turn off the lights causing some sensors to work while others do not.
        
        You might say, “ah, but the information about the diamond is the same”. But I don’t think that that applies here. It might be that the predictor state as a whole encodes the whereabouts of the diamond and the shift might make it unreadable.
        
        I think that it’s very likely that the real world has effect modification that is not in the training data just by the fact that the world of possibilities is infinite. When the shift occurs your P(z|Q,A) becomes small, causing us to reject everything outside the learned distribution. Which is safe, but also seems to defeat the purpose of our super smart predictor.
        P. 12 Mar 2022 18:43 UTC
        1 point
        Parent
        As an aside, I think that that property of regression models, in addition to using small networks and poor regularization might be why adversarial examples exist (see http://gradientscience.org/adv.pdf). Some features might not be robust. If we have an image of a cat and the model depends on some non robust feature to tell it apart from dogs, we might be able to use the many degrees of freedom we have available to make a cat look like a dog. On the other hand if we used something like this method we would need to find an image of a cat that is more likely to have been generated from the input “dog” than from the input “cat”, it’s probably not going to happen.
        Thomas 12 Mar 2022 20:27 UTC
        1 point
        Parent
        Could be! Though, in my head I see it as a self centering monte carlo sampling of a distribution mimicking some other training distribution, GANs not being the only one in that group. The drawback is that you can never leave that distribution; if your training is narrow, your model is narrow.
      - interstice 12 Mar 2022 16:25 UTC
        1 point
        Parent
        Interesting idea, but wouldn’t it run into the problem that the map F would learn both valid and invalid correlations? Like it should learn both to predict that the activations representing “diamond position” AND the activations representing “diamond shown on camera” are active. So in a situation where those activations don’t match it’s not clear which will be preferred. You might say that in that case the human model should be able to generate those activations using “camera was hacked” as a hypothesis, but if the hack is done in a way that the human finds incomprehensible this might not work(or put another way, the probability assigned to “diamond location neurons acting weird for some reason” might be higher than “camera hacked in undetectable way”, which could be the case if the encoding of the diamond position is weird enough)
        P. 12 Mar 2022 17:28 UTC
        2 points
        Parent
        I don’t think that would happen. But imagine that somehow it does happen, the regularization is too strong and the dataset doesn’t include any examples where the camera was hacked, so our model predicts both the activations of the physical diamond and the diamond image independently, what then? Try to think about any toy model of a scenario like that, any simple enough that we can analyze exactly. The simplest is that the variable on which we are conditioning the generation is an “uniformly” distributed scalar to which we apply two linear transformations to predict two values (which are meant to stand for the two diamonds) and then add gaussian noise. Given two observed values I’m pretty sure (I didn’t actually do the math but it seems obvious) that the reconstructed initial value is a weighted average of what would be predicted by either value independently. I expect that something analogous would happen in more realistic scenarios. Is this an acceptable behavior? I think so, or at least much better than any known alternative. If we used an AI to optimize the world so that the answer to “Is the diamond in the vault?” is “Yes”, it would make sure that both the real diamond and the one in the image stay in place.
    - David Johnston 11 Mar 2022 1:44 UTC
      3 points
      Parent
      This gets good log loss because it’s trained in the regime where the human understands what’s going on, correct?
      - paulfchristiano 11 Mar 2022 1:51 UTC
        4 points
        Parent
        Yes; for this bad F, the resulting G is very similar to a human simulator.