Cleo Nardo comments on TurnTrout’s shortform feed

Cleo Nardo 8 Oct 2024 20:03 UTC
3 points
0
Hey TurnTrout.
I’ve always thought of your shard theory as something like path-dependence? For example, a human is more excited about making plans with their friend if they’re currently talking to their friend. You mentioned this in a talk as evidence that shard theory applies to humans. Basically, the shard “hang out with Alice” is weighted higher in contexts where Alice is nearby.
- Let’s say $π : (S \times A)^{*} \times S \to Δ A$ is a policy with state space $S$ and action space $A$ .
- A “context” is a small moving window in the state-history, i.e. an element of $S^{d}$ where $d$ is a small positive integer.
- A shard is something like $u : S \times A \to R$ , i.e. it evaluates actions given particular states.
- The shards $u_{1}, \dots, u_{n}$ are “activated” by contexts, i.e. $g_{i} : S^{d} \to R^{\geq 0}$ maps each context to the amount that shard $u_{i}$ is activated by the context.
- The total activation of $u_{i}$ , given a history $h := (s_{1}, a_{1}, s_{2}, a_{2}, \dots, s_{N - 1}, a_{N - 1}, s_{N})$ , is given by the time-decay average of the activation across the contexts, i.e. $λ_{i} = g_{i} (s_{N - d + 1}, \dots s_{N}) + β \cdot g_{i} (s_{N - d}, \dots, s_{N - 1}) + β^{2} \cdot g_{i} (s_{N - d - 1}, \dots, s_{N - 2}) \dots$
- The overall utility function $u$ is the weighted average of the shards, i.e. $u = λ_{i} \cdot u_{i} + \dots + λ_{i} \cdot u_{n}$
- Finally, the policy $u$ will maximise the utility function, i.e. $π (h) = softmax (u)$
Is this what you had in mind?