Andy_McKenzie comments on Stupid Questions Open Thread

Andy_McKenzie 30 Dec 2011 4:40 UTC
9 points
In this interview between Eliezer and Luke, Eliezer says that the “solution” to the exploration-exploitation trade-off is to “figure out how much resources you want to spend on exploring, do a bunch of exploring, use all your remaining resources on exploiting the most valuable thing you’ve discovered, over and over and over again.” His point is that humans don’t do this, because we have our own, arbitrary value called boredom, while an AI would follow this “pure math.”

My potentially stupid question: doesn’t this strategy assume that environmental conditions relevant to your goals do not change? It seems to me that if your environment can change, then you can never be sure that you’re exploiting the most valuable choice. More specifically, why is Eliezer so sure that what wikipedia describes as the epsilon-first strategy is always the optimal one? (Posting this here because I assume he has read more about this than me and that I am missing something.)

Edit ¹²⁄₃₀ 8:56 GMT: fixed typo in last sentence of second paragraph.
- jsteinhardt 30 Dec 2011 16:55 UTC
  6 points
  Parent
  You got me curious, so I did some searching. This paper gives fairly tight bounds in the case where the payoffs are adaptive (i.e. can change in response to your previous actions) but bounded. The algorithm is on page 5.
  - Andy_McKenzie 30 Dec 2011 18:23 UTC
    3 points
    Parent
    Thanks for the link. Their algorithm, the “multiplicative update rule,” which goes about “selecting each arm randomly with probabilities that evolve based on their past performance,” does not seem to me to be the same strategy as Eliezer describes. So does this contradict his argument?
    - jsteinhardt 30 Dec 2011 23:10 UTC
      1 point
      Parent
      Yes.
- Larks 30 Dec 2011 5:07 UTC
  3 points
  Parent
  You should probably be prepared to change how much you plan to spend on exploring based on the initial information recieved.
  - RomeoStevens 31 Dec 2011 9:12 UTC
    0 points
    Parent
    This has me confused as well.
    Assume a large area divided into two regions. Region A has slot machines with average payout 50, while region B has machines with average payout 500. I am blindfolded and randomly dropped into region A or B. The first slot machine I try has payout 70. I update in the direction of being in region A. Doesn’t this affect how many resources I wish to spend doing exploration?
    - TheOtherDave 31 Dec 2011 17:32 UTC
      0 points
      Parent
      Are you also assuming that you know all of those assumed facts about the area?
      
      I would certainly expect that how many resources I want to spend on exploration will be affected by how much a priori knowledge I have about the system. Without such knowledge, the amount of exploration-energy I’d have to expend to be confident that there are two regions A and B with average payout as you describe is enormous.
  - Andy_McKenzie 30 Dec 2011 18:29 UTC
    0 points
    Parent
    Do you mean to set the parameter specifying the amount of resources (e.g., time steps) to spend exploring (before switching to full-exploiting) based on the info you receive upon your first observation? Also, what do you mean by “probably”?
- TheOtherDave 30 Dec 2011 4:55 UTC
  1 point
  Parent
  Sure. For example, if your environment is such that the process of exploitation can alter your environment in such a way that your earlier judgment of “the most valuable thing” is no longer reliable, then an iterative cycle of explore-exploit-explore can potentially get you better results.
  
  Of course, you can treat each loop of that cycle as a separate optimization problem and use the abovementioned strategy.
  - Andy_McKenzie 30 Dec 2011 18:31 UTC
    0 points
    Parent
    Could I replace “can potentially get you better results” with “will get you better results on average”?
    - TheOtherDave 30 Dec 2011 20:12 UTC
      1 point
      Parent
      Would you accept “will get you better results, all else being equal” instead? I don’t have a very clear sense of what we’d be averaging.
      - Andy_McKenzie 30 Dec 2011 21:00 UTC
        0 points
        Parent
        I meant averaging over the possible ways that the environment could change following your exploitation. For example, it’s possible that a particular course of exploitation action could shape the environment such that your exploitation strategy actually becomes more valuable upon each iteration. In such a scenario, exploring more after exploiting would be an especially bad decision. So I don’t think I can accept “will” without “on average” unless “all else” excludes all of these types of scenarios in which exploring is harmful.
        TheOtherDave 30 Dec 2011 22:22 UTC
        0 points
        Parent
        OK, understood. Thanks for clarifying.
        
        Hm. I expect that within the set of environments where exploitation can alter the results of what-to-exploit-next calculations, there more possible ways for it to do so such that the right move in the next iteration is further exploration than further exploitation.
        
        So, yeah, I’ll accept “will get you better results on average.”