I’ve always thought of your shard theory as something like path-dependence? For example, a human is more excited about making plans with their friend if they’re currently talking to their friend. You mentioned this in a talk as evidence that shard theory applies to humans. Basically, the shard “hang out with Alice” is weighted higher in contexts where Alice is nearby.
Let’s say π:(S×A)∗×S→ΔA is a policy with state space S and action space A.
A “context” is a small moving window in the state-history, i.e. an element of Sd where d is a small positive integer.
A shard is something like u:S×A→R, i.e. it evaluates actions given particular states.
The shards u1,…,un are “activated” by contexts, i.e.gi:Sd→R≥0 maps each context to the amount that shard ui is activated by the context.
The total activation of ui, given a history h:=(s1,a1,s2,a2,…,sN−1,aN−1,sN), is given by the time-decay average of the activation across the contexts, i.e. λi=gi(sN−d+1,…sN)+β⋅gi(sN−d,…,sN−1)+β2⋅gi(sN−d−1,…,sN−2)⋯
The overall utility function u is the weighted average of the shards, i.e. u=λi⋅ui+⋯+λi⋅un
Finally, the policy u will maximise the utility function, i.e. π(h)=softmax(u)
Hey TurnTrout.
I’ve always thought of your shard theory as something like path-dependence? For example, a human is more excited about making plans with their friend if they’re currently talking to their friend. You mentioned this in a talk as evidence that shard theory applies to humans. Basically, the shard “hang out with Alice” is weighted higher in contexts where Alice is nearby.
Let’s say π:(S×A)∗×S→ΔA is a policy with state space S and action space A.
A “context” is a small moving window in the state-history, i.e. an element of Sd where d is a small positive integer.
A shard is something like u:S×A→R, i.e. it evaluates actions given particular states.
The shards u1,…,un are “activated” by contexts, i.e.gi:Sd→R≥0 maps each context to the amount that shard ui is activated by the context.
The total activation of ui, given a history h:=(s1,a1,s2,a2,…,sN−1,aN−1,sN), is given by the time-decay average of the activation across the contexts, i.e. λi=gi(sN−d+1,…sN)+β⋅gi(sN−d,…,sN−1)+β2⋅gi(sN−d−1,…,sN−2)⋯
The overall utility function u is the weighted average of the shards, i.e. u=λi⋅ui+⋯+λi⋅un
Finally, the policy u will maximise the utility function, i.e. π(h)=softmax(u)
Is this what you had in mind?