seems like an interesting idea. I had never heard of it before, and I’m generally decently aware of weird stuff like this, so they probably need to put more effort into publicity. I don’t know if truly broad public appeal will ever happen but I could imagine this being pretty popular among the kinds of people who would e.g use manifold.
one perverse incentive this scheme creates is that if you think other charity is better than political donations, you are incentivized to donate to the party with less in its pool, and you get a 1:1 match for free, at the expense of people who wanted to support a candidate.
also, in the grand scheme of things, the amount of money in politics isn’t that big, but it’s still a solid chunk. but the TAM is inherently quite limited.
https://slatestarcodex.com/2019/09/18/too-much-dark-money-in-almonds/
i don’t think this is unique to world models. you can also think of rewards as things you move towards or away from. this is compatible with translation/scaling-invariance because if you move towards everything but move towards X even more, then in the long run you will do more of X on net, because you only have so much probability mass to go around.
i have an alternative hypothesis for why positive and negative motivation feel distinct in humans.
although the expectation of the reward gradient doesn’t change if you translate the reward, it hugely affects the variance of the gradient.[1] in other words, if you always move towards everything, you will still eventually learn the right thing, but it will take a lot longer.
my hypothesis is that humans have some hard coded baseline for variance reduction. in the ancestral environment, the expectation of perceived reward was centered around where zero feels to be. our minds do try to adjust to changes in distribution (e.g hedonic adaptation), but it’s not perfect, and so in the current world, our baseline may be suboptimal.
Quick proof sketch (this is a very standard result in RL and is the motivation for advantage estimation, but still good practice to check things).
The REINFORCE estimator is ∇θR=Eτ∼π(⋅)[R(τ)∇θlogπ(τ)].
WLOG, suppose we define a new reward R′(τ)=R(τ)+1 (and assume that E[R]=0, so R′ is moving away from the mean).
Then we can verify the expectation of the gradient is still the same:∇θR'−∇θR=Eτ∼π(⋅)[∇θlogπ(τ)]=∫π(τ)∇θπ(τ)π(τ)dτ=0.
But the variance increases:
Vτ∼π(⋅)[R(τ)∇θlogπ(τ)]=∫R(τ)2(∇θlogπ(τ))2π(τ)dτ−(∇θR(τ))2
Vτ∼π(⋅)[R′(τ)∇θlogπ(τ)]=∫(R(τ)+1)2(∇θlogπ(τ))2π(τ)dτ−(∇θR(τ))2
So:
Vτ∼π(⋅)[R′(τ)∇θlogπ(τ)]−Vτ∼π(⋅)[R(τ)∇θlogπ(τ)]=2∫(R(τ)(∇θlogπ(τ))2π(τ)dτ+∫(∇θlogπ(τ))2π(τ)dτ
Obviously, both terms on the right have to be non-negative. More generally, if E[R]=k, the variance increases with O(k2). So having your rewards be uncentered hurts a ton.