XiXiDu comments on Why not just write failsafe rules into the superintelligent machine?

XiXiDu 9 Mar 2011 17:04 UTC
3 points
Then why would it be more difficult to make scope boundaries a ‘value’ than increasing a reward number? Why is it harder to make it endorse a time limit to self-improvement than making it endorse increasing its reward number?

… it seems to me that this state of “I am experiencing this compulsion/phobia, but I don’t endorse it, and I want to be rid of it, so let me look for a way to bypass or resist or eliminate it” is precisely what it feels like to be an algorithm equipped with a rule that enforces/prevents a set of choices which it isn’t engineered to optimize for.

But where does that distinction come from? To me such a distinction between ‘value’ and ‘compulsion’ seems to be anthropomorphic. If there is a rule that says ‘optimize X for X seconds’ why would it make a difference between ‘optimize X’ and ‘for X seconds’?
- TheOtherDave 9 Mar 2011 18:11 UTC
  3 points
  Parent
  
  But where does that distinction come from?
  
  It comes from the difference between the targets of an optimizing system, which drive the paths it selects to explore, and the constraints on such a system, which restrict the paths it can select to explore.
  
  An optimizing system, given a path that leads it to bypass a target, will discard that path… that’s part of what it means to optimize for a target.
  
  An optimizing system, given a path that leads it to bypass a constraint, will not necessarily discard that path. Why would it?
  
  An optimizing system, given a path that leads it to bypass a constraint and draw closer to a target than other paths, will choose that path.
  
  It seems to follow that adding constraints to an optimizing system is a less reliable way of constraining its behavior than adding targets.
  
  I don’t care whether we talk about “targets and constraints” or “values and rules” or “goals and failsafes” or whatever language you want to use, my point is that there are two genuinely different things under discussion, and a distinction between them.
  
  To me such a distinction between ‘value’ and ‘compulsion’ seems to be anthropomorphic.
  
  Yes, the distinction is drawn from analogy to the intelligences I have experience with—as you say, anthropomorphic. I said this explicitly in the first place, so I assume you mean here to agree with me. (My reading of your tone suggests otherwise, but I don’t trust that I can reliably infer your tone so I am mostly disregarding tone in this exchange.)
  
  That said, I also think the relationship between them reflects something more generally true of optimizing systems, as I’ve tried to argue for a couple of times now.
  
  I can’t tell whether you think those arguments are wrong, or whether I just haven’t communicated them successfully at all, or whether you’re just not interested in them, or what.
  
  If there is a rule that says ‘optimize X for X seconds’ why would it make a difference between ‘optimize X’ and ‘for X seconds’?
  
  There’s no reason it would. If “doing X for X seconds” is its target, then it looks for paths that do that. Again, that’s what it means for something to be a target of an optimizing system.
  
  (Of course, if I do X for 2X seconds, I have in fact done X for X seconds, in the same sense that all months have 28 days.)
  
  Then why would it be more difficult to make scope boundaries a ‘value’ than increasing a reward number? Why is it harder to make it endorse a time limit to self-improvement than making it endorse increasing its reward number?
  
  I’m not quite sure I understand what you mean here, but if I’m understanding the gist: I’m not saying that encoding scope boundaries as targets, or ‘values,’ is difficult (nor am I saying it’s easy), I’m saying that for a sufficiently capable optimizing system it’s safer than encoding scope boundaries as failsafes.
  - XiXiDu 9 Mar 2011 18:35 UTC
    0 points
    Parent
    
    My reading of your tone suggests otherwise...
    
    It was not my intention to imply any hostility or resentment. I thought ‘anthropomorphic’ is valid terminology in such a discussion. I was also not agreeing with you. If you are an expert and have been offended by implying that what you said might be due to an anthropomorphic bias, then accept my apology, I was merely trying to communicate my perception of the subject matter.
    
    I had wedrifid telling me the same yesterday, that my tone isn’t appropriate when I wrote about his superior and rational use of the reputation system here, when I was actually just being honest. I’m not good at social signaling, sorry.
    
    An optimizing system, given a path that leads it to bypass a constraint, will not necessarily discard that path. Why would it?
    
    I think we are talking past each other. The way I see it is that a constraint is part of the design specifications of that which is optimized. Disregarding certain specifications will not allow it to optimize whatever it is optimizing with maximal efficiency.
    - TheOtherDave 9 Mar 2011 18:59 UTC
      2 points
      Parent
      Not an expert, and not offended.
      
      What was puzzling me was that I said in the first place that I was reasoning by analogy to humans and that this was a tricky thing to do, so when you classified this as anthropomorphic my reaction was “well, yes, that’s what I said.”
      
      Since it seemed to me you were repeating something I’d said, I assumed your intention was to agree with me, though it didn’t sound like it (and as it turned out, you weren’t).
      
      And, yes, I’ve noticed that tone is a problem in a lot of your exchanges, which is why I’m basically disregarding tone in this one, as I said before.
      
      The way I see it is that a constraint is part of the design specifications of that which is optimized.
      
      Ah! In that case, I think we agree.
      
      Yes, embedding everything we care about into the optimization target, rather than depending on something outside the optimization process to do important work, is the way to go.
      
      You seemed to be defending the “failsafes” model, which I understand to be importantly different from this, which is where the divergence came from, I think. Apparently I (and, I suspect, some others) misunderstood what you were defending.
      
      Sorry! Glad we worked that out, though.