I like this as a description of value drift under training and regularization. It’s not actually an inevitable process—we’re just heading for something like the minimum circuit complexity of the whole system, and usually that stores some precomputation or otherwise isn’t totally ststematized. But though I’m sure the literature on the intersection of NNs and circuit complexity is fascinating, I’ve never read it, so my intuition may be bad.
But I don’t like this as a description of value drift under self-reflection. I see this post more as “this is what you get right after offline training” than “this is the whole story that needs to have an opinion on the end state of the galaxy.”
An AI trained with RL that suddenly gets access to self-modifying actions might (briefly) have value dynamics according to idiosyncratic considerations that do not necessarily contain human-like guardrails. You could call this “systematization,” but it’s not proceeding according to the same story that governed systematization during training by gradient descent.
I think the idea of self-modification and lack-of-guard-rails is really important. I think we’re likely to run into issues with this in humans too once our neuro-modification tech gets good enough / accessible enough.
A significant part of why I argue that an AI built with a human brain-like architecture and constrained to follow human brain updating rules would be safer. We are familiar with the set of moves that human brains make when doing online learning. Trying to apply the same instincts to ML models can lead us badly astray. A proportional amount of fine-tuning of a model relative to the day in the life of an adult human should be expected to result in far more dramatic changes to the model. Or less, depending on the settings of the learning rate. But importantly, weird changes. Changes we can’t accurately predict by thinking about a day in the life of a human who read some new set of articles.
I like this as a description of value drift under training and regularization. It’s not actually an inevitable process—we’re just heading for something like the minimum circuit complexity of the whole system, and usually that stores some precomputation or otherwise isn’t totally ststematized. But though I’m sure the literature on the intersection of NNs and circuit complexity is fascinating, I’ve never read it, so my intuition may be bad.
But I don’t like this as a description of value drift under self-reflection. I see this post more as “this is what you get right after offline training” than “this is the whole story that needs to have an opinion on the end state of the galaxy.”
Why not? It seems like this is a good description of how values change for humans under self-reflection; why not for AIs?
An AI trained with RL that suddenly gets access to self-modifying actions might (briefly) have value dynamics according to idiosyncratic considerations that do not necessarily contain human-like guardrails. You could call this “systematization,” but it’s not proceeding according to the same story that governed systematization during training by gradient descent.
I think the idea of self-modification and lack-of-guard-rails is really important. I think we’re likely to run into issues with this in humans too once our neuro-modification tech gets good enough / accessible enough.
A significant part of why I argue that an AI built with a human brain-like architecture and constrained to follow human brain updating rules would be safer. We are familiar with the set of moves that human brains make when doing online learning. Trying to apply the same instincts to ML models can lead us badly astray. A proportional amount of fine-tuning of a model relative to the day in the life of an adult human should be expected to result in far more dramatic changes to the model. Or less, depending on the settings of the learning rate. But importantly, weird changes. Changes we can’t accurately predict by thinking about a day in the life of a human who read some new set of articles.