Fabien Roger comments on The shard theory of human values

Fabien Roger 8 Sep 2022 12:07 UTC
LW: 4 AF: 1
0
AF
Thank you for the post!
I found it interesting to think about how self-supervised learning + RL can lead to human-like value formation, however I’m not sure how much predictive power you gain out of the shards. The model of value formation you present feels close to the Alpha Go setup:
You have an encoder E, an action decoder D, and a value head V. You train D°E with something close to self-supervised learning (not entirely accurate, but I can imagine other RL systems trained with D°E doing exactly supervised learning), and train V°E with hard-coded sparse rewards. This looks very close to shard theory, except that you replace V with a bunch of shards, right? However, I think this later part doesn’t make predictions different from “V is a neural network”, because neural networks often learn context-dependent things, and I expect Alpha Go V-network to be very context dependent.
Is sharding a way to understand what neural networks can do in human understandable terms? Or is it a claim about what kind of neural network V is (because there are neural networks which aren’t very “shard-like”)?
Or do you think that sharding explains more than “the brain is like Alpha Go”? For example, maybe it’s hard for different part of the V network to self-reflect. But that feels pretty weak, because human don’t do that much either. Did I miss important predictions shard theory does and the classic RL+supervised learning setup doesn’t?