Talking about broad basins in this community reminds me of the claims of corrigible AI: that if you could get an AGI that is at least a little bit willing to cooperate with humans in adjusting its own preferences to better align with ours, then it would fall into a broad basin of corrigibility and become more aligned with human values over time.
I realize than the basins you’re talking about here are more related to perception than value alignment, but do you think your work here could apply? In other words, if broad basins of network performance/generalization can be found by increasing the number of disentangled features available to the network, then would adding more independent measures of value help an AI to find broader basins in value space that would be more likely to contain human-aligned regions?
Maybe humans are able to align with each other so well because we have so many dimensions of value, all competing with each other to drive our behavior. A human brain might have an entire ensemble of empathy mechanisms, each making predictions about some independent, low-dimensional projection of other humans’ preferences, wiring up those predictions into competing goal-generating modules.
Any one estimate of value would be prone to Goodharting, but maybe adding enough dimensions could make that easier to avoid?
Talking about broad basins in this community reminds me of the claims of corrigible AI: that if you could get an AGI that is at least a little bit willing to cooperate with humans in adjusting its own preferences to better align with ours, then it would fall into a broad basin of corrigibility and become more aligned with human values over time
I don’t suppose you could link me to a post arguing for this.
A benign act-based agent will be robustly corrigible if we want it to be.
A sufficiently corrigible agent will tend to become more corrigible and benign over time. Corrigibility marks out a broad basin of attraction towards acceptable outcomes.
I think we’re far off from being able to make any concrete claims about selection dynamics with this, let alone selection dynamics about things as complex and currently ill-operationalised as “goals”.
I’d hope to be able to model complicated things like this once Selection Theory is more advanced, but right now this is just attempting to find angles to build up the bare basics.
Talking about broad basins in this community reminds me of the claims of corrigible AI: that if you could get an AGI that is at least a little bit willing to cooperate with humans in adjusting its own preferences to better align with ours, then it would fall into a broad basin of corrigibility and become more aligned with human values over time.
I realize than the basins you’re talking about here are more related to perception than value alignment, but do you think your work here could apply? In other words, if broad basins of network performance/generalization can be found by increasing the number of disentangled features available to the network, then would adding more independent measures of value help an AI to find broader basins in value space that would be more likely to contain human-aligned regions?
Maybe humans are able to align with each other so well because we have so many dimensions of value, all competing with each other to drive our behavior. A human brain might have an entire ensemble of empathy mechanisms, each making predictions about some independent, low-dimensional projection of other humans’ preferences, wiring up those predictions into competing goal-generating modules.
Any one estimate of value would be prone to Goodharting, but maybe adding enough dimensions could make that easier to avoid?
I don’t suppose you could link me to a post arguing for this.
I was thinking of paulfchristiano’s articles on corrigibility (https://www.lesswrong.com/posts/fkLYhTQteAu5SinAc/corrigibility):
I think we’re far off from being able to make any concrete claims about selection dynamics with this, let alone selection dynamics about things as complex and currently ill-operationalised as “goals”.
I’d hope to be able to model complicated things like this once Selection Theory is more advanced, but right now this is just attempting to find angles to build up the bare basics.