wact := fact about the world
mact := fact about the mind
aact := fact about the agent more generally
vwact := value assigned by some agent to a fact about the world
David Lorell
Seems accurate to me. This has been an exercise in the initial step(s) of CCC, which indeed consist of “the phenomenon looks this way to me. It also looks that way to others? Cool. What are we all cottoning on to?”
Wait. I thought that was crossing the is-ought gap. As I think of it, the is ought gap refers to the apparent type-clash and unclear evidential entanglement between facts-about-the-world and values-an-agent-assigns-to-facts-about-the-world. And also as I think of it, “should be” always is short hand for “should be according to me” though possibly means some kind of aggregated thing but also ground out in subjective shoulds.
So “how the external world is” does not tell us “how the external world should be” …. except in so far as the external world has become causally/logically entangled with a particular agent’s ‘true values’. (Punting on what are an agent’s “true values” are as opposed to the much easier “motivating values” or possibly “estimated true values.” But for the purposes of this comment, its sufficient to assume that they are dependent on some readable property (or logical consequence of readable properties) of the agent itself.)
We have at least one jury rigged idea! Conceptually. Kind of.
Yeeeahhh.… But maybe it’s just awkwardly worded rather than being deeply confused. Like: “The learned algorithms which an adaptive system implements may not necessarily accept, output, or even internally use data(structures) which have any relationship at all to some external environment.” “Also what the hell is ‘reference’.”
Seconded. I have extensional ideas about “symbolic representations” and how they differ from.… non-representations.… but I would not trust this understanding with much weight.
Seconded. Comments above.
Indeed, our beliefs-about-values can be integrated into the same system as all our other beliefs, allowing for e.g. ordinary factual evidence to become relevant to beliefs about values in some cases.
Super unclear to the uninitiated what this means. (And therefore threateningly confusing to our future selves.)
Maybe: “Indeed, we can plug ‘value’ variables into our epistemic models (like, for instance, our models of what brings about reward signals) and update them as a result of non-value-laden facts about the world.”
But clearly the reward signal is not itself our values.
Ahhhh
Maybe: “But presumably the reward signal does not plug directly into the action-decision system.”?
Or: “But intuitively we do not value reward for its own sake.”?
It does seem like humans have some kind of physiological “reward”, in a hand-wavy reinforcement-learning-esque sense, which seems to at least partially drive the subjective valuation of things.
Hrm… If this compresses down to, “Humans are clearly compelled at least in part by what ‘feels good’.” then I think it’s fine. If not, then this is an awkward sentence and we should discuss.
an agent could aim to pursue any values regardless of what the world outside it looks like;
Without knowing what values are, it’s unclear that an agent could aim to pursue any of them. The implicit model here is that there is something like a value function in DP which gets passed into the action-decider along with the world model and that drives the agent. But I think we’re saying something more general than that.
but the fact that it makes sense to us to talk about our beliefs
Better terminology for the phenomenon of “making sense” in the above way?
“learn” in the sense that their behavior adapts to their environment.
I want a new word for this. “Learn” vs “Adapt” maybe. Learn means updating of symbolic references (maps) while Adapt means something like responding to stimuli in a systematic way.
Not quite what we were trying to say in the post. Rather than tradeoffs being decided on reflection, we were trying to talk about the causal-inference-style “explaining away” which the reflection gives enough compute for. In Johannes’s example, the idea is that the sadist might model the reward as coming potentially from two independent causes: a hardcoded sadist response, and “actually” valuing the pain caused. Since the probability of one cause, given the effect, goes down when we also know that the other cause definitely obtained, the sadist might lower their probability that they actually value hurting people given that (after reflection) they’re quite sure they are hardcoded to get reward for it. That’s how it’s analagous to the ant thing.
Suppose you have a randomly activated (not dependent on weather) sprinkler system, and also it rains sometimes. These are two independent causes for the sidewalk being wet, each of which are capable of getting the job done all on their own. Suppose you notice that the sidewalk is wet, so it definitely either rained, sprinkled, or both. If I told you it had rained last night, your probability that the sprinklers went on (given that it is wet) should go down, since they already explain the wet sidewalk. If I told you instead that the sprinklers went on last night, then your probability of it having rained (given that it is wet) goes down for a similar reason. This is what “explaining away” is in causal inference. The probability of a cause given its effect goes down when an alternative cause is present.
In the post, the supposedly independent causes are “hardcoded ant-in-mouth aversion” and “value of eating escamoles”, and the effect is negative reward. Realizing that you have a hardcoded ant-in-mouth aversion is like learning that the sprinklers were on last night. The sprinklers being on (incompletely) “explain away” the rain as a cause for the sidewalk being wet. The hardcoded ant-in-mouth aversion explains away the-amount-you-value-escamoles as a cause for the low reward.
I’m not totally sure if that answers your question, maybe you were asking “why model my values as a cause of the negative reward, separate from the hardcoded response itself”? And if so, I think I’d rephrase the heart of the question as, “what do the values in this reward model actually correspond to out in the world, if anything? What are the ‘real values’ which reward is treated as evidence of?” (We’ve done some thinking about that and might put out a post on that soon.)
This is fascinating and I would love to hear about anything else you know of a similar flavor.
Seconded!!
Anecdotal 2¢: This is very accurate in my experience. Basically every time I talk to someone outside of tech/alignment about AI risk, I have to go through the whole “we don’t know what algorithms the AI is running to do what it does. Yes, really.” thing. Every time I skip this accidentally, I realize after a while that this is where a lot of confusion is coming from.
1. “Trust” does seem to me to often be an epistemically broken thing that rides on human-peculiar social dynamics and often shakes out to gut-understandings of honor and respect and loyalty etc.
2. I think there is a version that doesn’t route through that stuff. Trust in the “trust me” sense is a bid for present-but-not-necessarily-permanent suspension of disbelief, where the stakes are social credit. I.e. When I say, “trust me on this,” I’m really saying something like, “All of that anxious analysis you might be about to do to determine if X is true? Don’t do it. I claim that using my best-effort model of your values, the thing you should assume/do to fulfill them in this case is X. To the extent that you agree that I know you well and want to help you and tend to do well for myself in similar situations, defer to me on this. I predict you’ll thank me for it (because, e.g., confirming it yourself before acting is costly), and if not...well I’m willing to stake some amount of the social credit I have with you on it.” [Edit: By social credit here I meant something like: The credence you give to it being a good idea to engage with me like this.]
Similarly:
“I decided to trust her” → “I decided to defer to her claims on this thing without looking into it much myself (because it would be costly to do otherwise and I believe—for some reason—that she is sufficiently likely to come to true conclusions on this, is probably trying to help me, knows me fairly well etc.) And if this turns out badly, I’ll (hopefully) stop deciding to do this.”
“Should I trust him?” → “Does the cost/benefit analysis gestured at above come out net positive in expectation if I defer to him on this?”
“They offered me their trust” → “They believe that deferring to me is their current best move and if I screw this up enough, they will (hopefully) stop thinking that.”
So, I feel like I’ve landed fairly close to where you did but there is a difference in emphasis or maybe specificity. There’s more there than asking “what do they believe, and what caused them to believe it?” Like, that probably covers it but more specifically the question I can imagine people asking when wondering whether or not to “trust” someone is instead, “do I believe that deferring these decisions/assumptions to them in this case will turn out better for me than otherwise?” Where the answer can be “yes” because of things like cost-of-information or time constraints etc. If you map “what do they believe” to “what do they believe that I should assume/do” and “what caused them to believe it” to “how much do they want to help me, how well do they know me, how effective are they in this domain, …” then we’re on the same page.
Some nits we know about but didn’t include in the problems section:
P[mushroom->anchovy] = 0. The current argument does not handle the case where subagents believe that there is a probability of 0 on one of the possible states. It wouldn’t be possible to complete the preferences exactly as written, then.
Indifference. If anchovy were placed directly above mushroom in the preference graph above (so that John is truly indifferent between them), then that might require some special handling. But also it might just work if the “Value vs Utility” issue is worked out. If the subagents are not myopic / handle instrumental values, then whether anchovy is less, identically, or more desirable than mushroom doesn’t really matter so much on its own as opposed to what opportunities are possible afterward from the anchovy state relative to the mushroom state.
Also, I think I buy the following part but I really wish it were more constructive.
Now, we haven’t established which distribution of preferences the system will end up sampling from. But so long as it ends up at some non-dominated choice, it must end up with non-strongly-incomplete preferences with probability 1 (otherwise it could modify the contract for a strict improvement in cases where it ends up with non-strongly-incomplete preferences). And, so long as the space of possibilities is compact and arbitrary contracts are allowed, all we have left is a bargaining problem. The only way the system would end up with dominated preference-distribution is if there’s some kind of bargaining breakdown.
wiggitywiggitywact := fact about the world which requires a typical human to cross a large inferential gap.