Raemon comments on We Don’t Know Our Own Values, but Reward Bridges The Is-Ought Gap

Raemon 20 Sep 2024 18:39 UTC
5 points
0
I felt a bit confused by this bit:
At some point I sit down and think about escamoles. Yeah, ants are kinda gross, but on reflection I don’t think I endorse that reaction to escamoles. I can see why my reward system would generate an “ew, gross” signal, but I model that reward as being the result of two decoupled things: either a hardcoded aversion to insects, or my actual values. I know that I am automatically averse to putting insects in my mouth so it’s less likely that the negative reward is evidence of my values in this case; the signal is explained away in the usual epistemic sense by some cause other than my values. So, I partially undo the value-downgrade I had assigned to escamoles in response to the “ew, gross” reaction. I might still feel some disgust, but I consciously override that disgust to some extent.
That last example is particularly interesting, since it highlights a nontrivial prediction of this model. Insofar as reward is treated as evidence about values, and our beliefs about values update in the ordinary epistemic manner, we should expect all the typical phenomena of epistemic updating to carry over to learning about our values. Explaining-away is one such phenomenon. What do other standard epistemic phenomena look like, when carried over to learning about values using reward as evidence?
I feel like this sort of makes sense but don’t quite parse why this counted as “explaining away.” How do I know my hardcoded reactions aren’t values?
- David Lorell 20 Sep 2024 19:58 UTC
  4 points
  0
  Parent
  Suppose you have a randomly activated (not dependent on weather) sprinkler system, and also it rains sometimes. These are two independent causes for the sidewalk being wet, each of which are capable of getting the job done all on their own. Suppose you notice that the sidewalk is wet, so it definitely either rained, sprinkled, or both. If I told you it had rained last night, your probability that the sprinklers went on (given that it is wet) should go down, since they already explain the wet sidewalk. If I told you instead that the sprinklers went on last night, then your probability of it having rained (given that it is wet) goes down for a similar reason. This is what “explaining away” is in causal inference. The probability of a cause given its effect goes down when an alternative cause is present.
  
  In the post, the supposedly independent causes are “hardcoded ant-in-mouth aversion” and “value of eating escamoles”, and the effect is negative reward. Realizing that you have a hardcoded ant-in-mouth aversion is like learning that the sprinklers were on last night. The sprinklers being on (incompletely) “explain away” the rain as a cause for the sidewalk being wet. The hardcoded ant-in-mouth aversion explains away the-amount-you-value-escamoles as a cause for the low reward.
  
  I’m not totally sure if that answers your question, maybe you were asking “why model my values as a cause of the negative reward, separate from the hardcoded response itself”? And if so, I think I’d rephrase the heart of the question as, “what do the values in this reward model actually correspond to out in the world, if anything? What are the ‘real values’ which reward is treated as evidence of?” (We’ve done some thinking about that and might put out a post on that soon.)
  - Raemon 20 Sep 2024 20:36 UTC
    4 points
    0
    Parent
    Okay, I think one crystallization here for me is that “explaining away” is a matter of degree. (I think I found the second half of the comment less helpful, but the combo of the first half + John’s response is helpful both for my own updating, and seeing where you guys are currently at)
- johnswentworth 20 Sep 2024 19:43 UTC
  4 points
  0
  Parent
  The main observation from the quoted block is “man, this sure sounds like explaining away, if I’m treating my hardcoded reactions as a signal which is sometimes influenced by things besides values”. But exactly when do I treat my hardcoded reactions as though they’re being influenced by non-value stuff? I don’t know yet; I don’t yet understand that part.