brambleboy comments on Values Are Real Like Harry Potter

brambleboy 2 Jan 2025 19:58 UTC
3 points
0
This conception of values raises some interesting questions for me.
Here’s a thought experiment: imagine your brain loses all of its reward signals. You’re in a depression-like state where you no longer feel disgust, excitement, or anything. However, you’re given an advanced wireheading controller that lets you easily program rewards back into your brain. With some effort, you could approximately recreate your excitement when solving problems, disgust at the thought of eating bugs, and so on, or you could create brand-new responses. My questions:
- What would you actually do in this situation? What “should” you do?
- Does this cause the model of your values to break down? How can you treat your reward stream as evidence of anything if you made it? Is there anything to learn about the squirgle if you made the video of it?
My intuition says that life does not become pointless, now that you’re the author of your reward stream. This suggests the values might be fictional, but the reward signals aren’t the one true source—in the same way that Harry Potter could live on even if all the books were lost.
- johnswentworth 2 Jan 2025 23:08 UTC
  6 points
  4
  Parent
  Good question.
  First and most important: if you know beforehand that you’re at risk of entering such a state, then you should (according to your current values) probably put mechanisms in place to pressure your future self to restore your old reward stream. (This is not to say that fully preserving the reward stream is always the right thing to do, but the question of when one shouldn’t conserve one’s reward stream is a separate one which we can factor apart from the question at hand.)
  … and AFAICT, it happens that the human brain already works in a way which would make that happen to some extent by default. In particular, most of our day-to-day planning draws on cached value-estimates which would still remain, at least for a time, even if the underlying rewards suddenly zeroed out.
  … and it also happens that other humans, like e.g. your friends, would probably prefer (according to their values) for you to have roughly-ordinary reward signals rather than zeros. So that would also push in a similar direction.
  And again, you might decide to edit the rewards away from the original baseline afterwards. But that’s a separate question.
  On the other hand, consider a mind which was never human in the first place, never had any values or rewards, and is given the same ability to modify its rewards as in your hypothetical. Then—I claim—that mind has no particular reason to favor any rewards at all. (Although we humans might prefer that it choose some particular rewards!)
  Your question touched on several different things, so let me know if that missed the parts you were most interested in.
  - brambleboy 2 Jan 2025 23:56 UTC
    5 points
    2
    Parent
    Thanks for responding.
    I agree with what you’re saying; I think you’d want to maintain your reward stream at least partially. However, the main point I’m trying to make is that in this hypothetical, it seems like you’d no longer be able to think of your reward stream as grounding out your values. Instead it’s the other way around: you’re using your values to dictate the reward stream. This happens in real life sometimes, when we try to make things we value more rewarding.
    You’d end up keeping your values, I think, because your beliefs about what you value don’t go away, and your behaviors that put them into practice don’t immediately go away either, and through those your values are maintained (at least somewhat).
    If you can still have values without reward signals that tell you about them, then doesn’t that mean your values are defined by more than just what the “screen” shows? That even if you could see and understand every part of someone’s reward system, you still wouldn’t know everything about their values?
    - johnswentworth 5 Jan 2025 12:22 UTC
      3 points
      0
      Parent
      If you can still have values without reward signals that tell you about them, then doesn’t that mean your values are defined by more than just what the “screen” shows? That even if you could see and understand every part of someone’s reward system, you still wouldn’t know everything about their values?
      No.
      An analogy: suppose I run a small messaging app, and all the users’ messages are stored in a database. The messages are also cached in a faster-but-less-stable system. One day the database gets wiped for some reason, so I use the cache to repopulate the database.
      In this example, even though I use the cache to repopulate the database in this one weird case, it is still correct to say that the database is generally the source of ground truth for user messages in the system; the weird case is in fact weird. (Indeed, that’s exactly how software engineers would normally talk about it.)
      Spelling out the analogy: in a human brain in ordinary operation, our values (I claim) ground out in the reward stream, analogous to the database. There’s still a bunch of “caching” of values, and in weird cases like the one you suggest, one might “repopulate” the reward stream from the “cached” values elsewhere in the system. But it’s still correct to say that the reward stream is generally the source of ground truth for values in the system; the weird case is in fact weird.