johnswentworth comments on Values Are Real Like Harry Potter

johnswentworth 5 Jan 2025 12:22 UTC
3 points
0
If you can still have values without reward signals that tell you about them, then doesn’t that mean your values are defined by more than just what the “screen” shows? That even if you could see and understand every part of someone’s reward system, you still wouldn’t know everything about their values?
No.
An analogy: suppose I run a small messaging app, and all the users’ messages are stored in a database. The messages are also cached in a faster-but-less-stable system. One day the database gets wiped for some reason, so I use the cache to repopulate the database.
In this example, even though I use the cache to repopulate the database in this one weird case, it is still correct to say that the database is generally the source of ground truth for user messages in the system; the weird case is in fact weird. (Indeed, that’s exactly how software engineers would normally talk about it.)
Spelling out the analogy: in a human brain in ordinary operation, our values (I claim) ground out in the reward stream, analogous to the database. There’s still a bunch of “caching” of values, and in weird cases like the one you suggest, one might “repopulate” the reward stream from the “cached” values elsewhere in the system. But it’s still correct to say that the reward stream is generally the source of ground truth for values in the system; the weird case is in fact weird.