An AI trained with RL that suddenly gets access to self-modifying actions might (briefly) have value dynamics according to idiosyncratic considerations that do not necessarily contain human-like guardrails. You could call this “systematization,” but it’s not proceeding according to the same story that governed systematization during training by gradient descent.
I think the idea of self-modification and lack-of-guard-rails is really important. I think we’re likely to run into issues with this in humans too once our neuro-modification tech gets good enough / accessible enough.
A significant part of why I argue that an AI built with a human brain-like architecture and constrained to follow human brain updating rules would be safer. We are familiar with the set of moves that human brains make when doing online learning. Trying to apply the same instincts to ML models can lead us badly astray. A proportional amount of fine-tuning of a model relative to the day in the life of an adult human should be expected to result in far more dramatic changes to the model. Or less, depending on the settings of the learning rate. But importantly, weird changes. Changes we can’t accurately predict by thinking about a day in the life of a human who read some new set of articles.
Why not? It seems like this is a good description of how values change for humans under self-reflection; why not for AIs?
An AI trained with RL that suddenly gets access to self-modifying actions might (briefly) have value dynamics according to idiosyncratic considerations that do not necessarily contain human-like guardrails. You could call this “systematization,” but it’s not proceeding according to the same story that governed systematization during training by gradient descent.
I think the idea of self-modification and lack-of-guard-rails is really important. I think we’re likely to run into issues with this in humans too once our neuro-modification tech gets good enough / accessible enough.
A significant part of why I argue that an AI built with a human brain-like architecture and constrained to follow human brain updating rules would be safer. We are familiar with the set of moves that human brains make when doing online learning. Trying to apply the same instincts to ML models can lead us badly astray. A proportional amount of fine-tuning of a model relative to the day in the life of an adult human should be expected to result in far more dramatic changes to the model. Or less, depending on the settings of the learning rate. But importantly, weird changes. Changes we can’t accurately predict by thinking about a day in the life of a human who read some new set of articles.