johnswentworth comments on Values Are Real Like Harry Potter

johnswentworth 2 Jan 2025 23:08 UTC
4 points
2
Good question.
First and most important: if you know beforehand that you’re at risk of entering such a state, then you should (according to your current values) probably put mechanisms in place to pressure your future self to restore your old reward stream. (This is not to say that fully preserving the reward stream is always the right thing to do, but the question of when one shouldn’t conserve one’s reward stream is a separate one which we can factor apart from the question at hand.)
… and AFAICT, it happens that the human brain already works in a way which would make that happen to some extent by default. In particular, most of our day-to-day planning draws on cached value-estimates which would still remain, at least for a time, even if the underlying rewards suddenly zeroed out.
… and it also happens that other humans, like e.g. your friends, would probably prefer (according to their values) for you to have roughly-ordinary reward signals rather than zeros. So that would also push in a similar direction.
And again, you might decide to edit the rewards away from the original baseline afterwards. But that’s a separate question.
On the other hand, consider a mind which was never human in the first place, never had any values or rewards, and is given the same ability to modify its rewards as in your hypothetical. Then—I claim—that mind has no particular reason to favor any rewards at all. (Although we humans might prefer that it choose some particular rewards!)
Your question touched on several different things, so let me know if that missed the parts you were most interested in.
- brambleboy 2 Jan 2025 23:56 UTC
  3 points
  0
  Parent
  Thanks for responding.
  I agree with what you’re saying; I think you’d want to maintain your reward stream at least partially. However, the main point I’m trying to make is that in this hypothetical, it seems like you’d no longer be able to think of your reward stream as grounding out your values. Instead it’s the other way around: you’re using your values to dictate the reward stream. This happens in real life sometimes, when we try to make things we value more rewarding.
  You’d end up keeping your values, I think, because your beliefs about what you value don’t go away, and your behaviors that put them into practice don’t immediately go away either, and through those your values are maintained (at least somewhat).
  If you can still have values without reward signals that tell you about them, then doesn’t that mean your values are defined by more than just what the “screen” shows? That even if you could see and understand every part of someone’s reward system, you still wouldn’t know everything about their values?