paulfchristiano comments on Contest: $1,000 for good questions to ask to an Oracle AI

paulfchristiano 3 Jul 2019 22:37 UTC
LW: 2 AF: 1
AF
I mean, if the oracle hasn’t yet looked at the question they could use simulation warfare to cause the preceding oracles to take actions that lead to them getting given easier questions. Once you start unbarring all holds, stuff gets wild.
- Wei Dai 3 Jul 2019 23:03 UTC
  LW: 3 AF: 2
  AF Parent
  Yes, but if we can make it so that each Oracle looks at the question they get and only cares about doing well on that question, that seems to remove the simulation warfare concern in the sequential case but not in the nested case.
  
  Also, aside from simulation warfare, another way that the nested case can be manipulative and the sequential case not is if each Oracle cares about doing well on a fixed distribution of inputs (as opposed to doing well “on the current question” or “in the real world” or “on the actual questions that it gets”). That’s because in the sequential case manipulation can only change the distribution of inputs that the Oracles receive, but it doesn’t improve performance on any particular given input. In the nested case, performance on given inputs do increase.
  - paulfchristiano 4 Jul 2019 4:59 UTC
    LW: 3 AF: 2
    AF Parent
    in the sequential case manipulation can only change the distribution of inputs that the Oracles receive, but it doesn’t improve performance on any particular given input
    Why is that? Doesn’t my behavior on question #1 affect both question #2 and its answer?
    Also, this feels like a doomed game to me—I think we should be trying to reason from selection rather than relying on more speculative claims about incentives.
    - Wei Dai 4 Jul 2019 5:26 UTC
      LW: 3 AF: 2
      AF Parent
      
      Why is that? Doesn’t my behavior on question #1 affect both question #2 and its answer?
      
      I was assuming each “question” actually includes as much relevant history as we can gather about the world, to make the Oracle’s job easier, and in particular it would include all previous Oracle questions/answers, in which case if Oracle #1 does X to make question #2 easier, it was already that easy because the only world in which question #2 gets asked is one in which Oracle #1 did X. But now I realize that’s not actually a safe assumption because Oracle #1 could break out of its box and feed Oracle #2 a false history that doesn’t include X.
      
      My point about “if we can make it so that each Oracle looks at the question they get and only cares about doing well on that question, that seems to remove the simulation warfare concern in the sequential case but not in the nested case” still stands though, right?
      
      Also, this feels like a doomed game to me—I think we should be trying to reason from selection rather than relying on more speculative claims about incentives.
      
      You may well be right about this, but I’m not sure what reason from selection means. Can you give an example or say what it implies about nested vs sequential queries?
      - paulfchristiano 17 Jul 2019 22:54 UTC
        LW: 7 AF: 4
        AF Parent
        You may well be right about this, but I’m not sure what reason from selection means. Can you give an example or say what it implies about nested vs sequential queries?
        What I want: “There is a model in the class that has property P. Training will find a model with property P.”
        What I don’t want: “The best way to get a high reward is to have property P. Therefore a model that is trying to get a high reward will have property P.”
        Example of what I don’t want: “Manipulative actions don’t help get a high reward (at least for the episodic reward function we intended), so the model won’t produce manipulative actions.”
        Wei Dai 18 Jul 2019 7:03 UTC
        LW: 4 AF: 2
        AF Parent
        So this is an argument against the setup of the contest, right? Because the OP seems to be asking us to reason from incentives, and presumably will reward entries that do well under such analysis:
        
        Note that both of these Oracles are designed to be episodic (they are run for single episodes, get their rewards by the end of that episode, aren’t asked further questions before the episode ends, and are only motivated to best perform on that one episode), to avoid incentives to longer term manipulation.
        
        On a more object level, for reasoning from selection, what model class and training method would you suggest that we assume?
        
        ETA: Is an instance of the idea to see if we can implement something like counterfactual oracles using your Opt? I actually did give that some thought and nothing obvious immediately jumped out at me. Do you think that’s a useful direction to think?
        
        paulfchristiano 20 Jul 2019 16:08 UTC
        LW: 4 AF: 2
        AF Parent
        So this is an argument against the setup of the contest, right? Because the OP seems to be asking us to reason from incentives, and presumably will reward entries that do well under such analysis:
        This is an objection to reasoning from incentives, but it’s stronger in the case of some kinds of reasoning from incentives (e.g. where incentives come apart from “what kind of policy would be selected under a plausible objective”). It’s hard for me to see how nested vs. sequential really matters here.
        On a more object level, for reasoning from selection, what model class and training method would you suggest that we assume?
        (I don’t think model class is going to matter much.)
        I think training method should get pinned down more. My default would just be the usual thing people do: pick the model that has best predictive accuracy over the data so far, considering only data where there was an erasure.
        (Though I don’t think you really need to focus on erasures, I think you can just consider all the data, since each possible parameter setting is being evaluated on what other parameter settings say anyway. I think this was discussed in one of Stuart’s posts about “forward-looking” vs. “backwards-looking” oracles?)
        I think it’s also interesting to imagine internal RL (e.g. there are internal randomized cognitive actions, and we use REINFORCE to get gradient estimates—i.e. you try to increase the probability of cognitive actions taken in rounds where you got a lower loss than predicted, and decrease the probability of actions taken in rounds where you got a higher loss), which might make the setting a bit more like the one Stuart is imagining.
        ETA: Is an instance of the idea to see if we can implement something like counterfactual oracles using your Opt? I actually did give that some thought and nothing obvious immediately jumped out at me. Do you think that’s a useful direction to think?
        Seems like the counterfactually issue doesn’t come up in the Opt case, since you aren’t training the algorithm incrementally—you’d just collect a relevant dataset before you started training. I think the Opt setting throws away too much for analyzing this kind of situation, and would want to do an online learning version of OPT (e.g. you provide inputs and losses one at a time, and it gives you the answer of the mixture of models that would do best so far).
        Wei Dai 25 Jul 2019 18:02 UTC
        LW: 2 AF: 1
        AF Parent
        
        I think training method should get pinned down more. My default would just be the usual thing people do: pick the model that has best predictive accuracy over the data so far, considering only data where there was an erasure.
        
        This seems to ignore regularizers that people use to try to prevent overfitting and to make their models generalize better. Isn’t that liable to give you bad intuitions versus the actual training methods people use and especially the more advanced methods of generalization that people will presumably use in the future?
        
        (Though I don’t think you really need to focus on erasures, I think you can just consider all the data, since each possible parameter setting is being evaluated on what other parameter settings say anyway. I think this was discussed in one of Stuart’s posts about “forward-looking” vs. “backwards-looking” oracles?)
        
        I don’t understand what you mean in this paragraph (especially “since each possible parameter setting is being evaluated on what other parameter settings say anyway”), even after reading Stuart’s post, plus Stuart has changed his mind and no longer endorses the conclusions in that post. I wonder if you could write a fuller explanation of your views here, and maybe include your response to Stuart’s reasons for changing his mind? (Or talk to him again and get him to write the post for you. :)
        
        would want to do an online learning version of OPT (e.g. you provide inputs and losses one at a time, and it gives you the answer of the mixture of models that would do best so far).
        
        Couldn’t you simulate that with Opt by just running it repeatedly?
        
        paulfchristiano 25 Jul 2019 18:18 UTC
        LW: 2 AF: 1
        AF Parent
        This seems to ignore regularizers that people use to try to prevent overfitting and to make their models generalize better. Isn’t that liable to give you bad intuitions versus the actual training methods people use and especially the more advanced methods of generalization that people will presumably use in the future?
        “The best model” is usually regularized. I don’t think this really changes the picture compared to imagining optimizing over some smaller space (e.g. space of models with regularize<x). In particular, I don’t think my intuitions are sensitive to the difference.
        I don’t understand what you mean in this paragraph (especially “since each possible parameter setting is being evaluated on what other parameter settings say anyway”)
        The normal procedure is: I gather data, and am using the model (and other ML models) while I’m gathering data. I search over parameters to find the ones that would make the best predictions on that data.
        I’m not finding parameters that result in good predictive accuracy when used in the world. I’m generating some data, and then finding the parameters that make the best predictions about that data. That data was collected in a world where there are plenty of ML systems (including potentially a version of my oracle with different parameters).
        Yes, the normal procedure converges to a fixed point. But why do we care / why is that bad?
        I wonder if you could write a fuller explanation of your views here, and maybe include your response to Stuart’s reasons for changing his mind? (Or talk to him again and get him to write the post for you. :)
        I take a perspective where I want to use ML techniques (or other AI algorithms) to do useful work, without introducing powerful optimization working at cross-purposes to humans. On that perspective I don’t think any of this is a problem (or if you look at it another way, it wouldn’t be a problem if you had a solution that had any chance at all of working).
        I don’t think Stuart is thinking about it in this way, so it’s hard to engage at the object level, and I don’t really know what the alternative perspective is, so I also don’t know how to engage at the meta level.
        Is there a particular claim where you think there is an interesting disagreement?
        Couldn’t you simulate that with Opt by just running it repeatedly?
        If I care about competitiveness, rerunning OPT for every new datapoint is pretty bad. (I don’t think this is very important in the current context, nothing depends on competitiveness.)
    - Wei Dai 17 Jul 2019 21:24 UTC
      LW: 2 AF: 1
      AF Parent
      
      Also, this feels like a doomed game to me—I think we should be trying to reason from selection rather than relying on more speculative claims about incentives.
      
      Does anyone know what Paul meant by this? I’m afraid I might be missing some relatively simple but important insight here.