David Scott Krueger (formerly: capybaralet) comments on How much can value learning be disentangled?

David Scott Krueger (formerly: capybaralet) 31 Jan 2019 4:58 UTC
3 points
So I want to emphasize that I’m only saying it’s *plausible* that *there exists* a specification of “manipulation”. This is my default position on all human concepts. I also think it’s plausible that there does not exist such a specification, or that the specification is too complex to grok, or that there end up being multiple conflicting notions we conflate under the heading of “manipulation”. See this post for more.
Overall, I understand and appreciate the issues you’re raising, but I think all this post does is show that naive attempts to specify “manipulation” fail; I think it’s quite difficult to argue compellingly that no such specification exists ;)
“It seems that the only difference between manipulation and explanation is whether we end up with a better understanding of the situation at the end. And measuring understanding is very subtle.”
^ Actually, I think “ending up with a better understanding” (in the sense I’m reading it)is probably not sufficient to rule out manipulation; what I mean is that I can do something which actually improves your model of the world, but leads you to follow a policy with worse expected returns. A simple example would be if you are doing Bayesian updating and your prior over returns for two bandit arms is P(r|a_1) = N(1,1), P(r|a_2) = N(2, 1), while the true returns are ¹⁄₂ and ²⁄₃ (respectively). So your current estimates are optimistic, but they are ordered correctly, and so induce the optimal (greedy) policy.
Now if I give you a bunch of observations of a_2, I will be giving you true information, that will lead you to learn, correctly and with high confidence, that the expected reward for a_2 is ~2/3, improving your model of the world. But since you haven’t updated your estimate for a_1, you will now prefer a_1 to a_2 (if acting greedily), which is suboptimal. So overall I’ve informed you with true information, but disadvantaged you nonetheless. I’d argue that if I did this intentionally, it should count as a form of manipulation.
- Stuart_Armstrong 31 Jan 2019 9:04 UTC
  2 points
  Parent
  Thanks for writing that post; have you got much in terms of volunteers currently?
  - David Scott Krueger (formerly: capybaralet) 31 Jan 2019 19:41 UTC
    3 points
    Parent
    Haha no not at all ;)
    I’m not actually trying to recruit people to work on that, just trying to make people aware of the idea of doing such projects. I’d suggest it to pretty much anyone who wants to work on AI-Xrisk without diving deep into math or ML.
    - Stuart_Armstrong 1 Feb 2019 13:48 UTC
      2 points
      Parent
      Shame :-(