I really like this perspective. There may be no way to find a human’s One True Value Function (to say nothing of humanity’s), not only because humans are complicated to model, but also because there is probably no such thing as a human’s One True Value Function (even less so for humanity as a whole). Similar to what you said, it could very well just be heuristics all the way down, heuristics both in what is valued (preferences) and how different heuristic values compete (meta-preferences). Natural selection has fine-tuned both levels to get something that works well enough for survival and reproduction of the species in the domain of validity of humans’ ancestral environment, while each individual human could fine-tune their preferences and meta-preferences based on whatever leads to the greatest perceived harmony among them within the domain of validity of lived personal experience.
In AI, the concept of multiple competing value functions could be realized through ensemble models. Each sub-model within the ensemble learns a value function independently. If each sub-model receives slightly different input or starts with different random initialization to its weights, then they will each learn slightly different value functions. Then you can use the ensemble variance in predicted value (or precision = 1/variance) to determine the domain of validity. Those regions of state space where all sub-models in the ensemble pretty much agree on value (low variance / high precision) are “safe” to explore, while those regions with large disagreements in predicted value (high variance / low precision) are “unsafe”. Of course a creativity or curiosity drive could motivate the system to push the frontier of the safe region, but there would always have to come a point where the potential value of exploring further is overcome by the risk, which I guess falls under the umbrella of “meta-preference”.
I have had the idea that the discount factor used in decision theory and RL could be based on the precision of predictions rather than on some constant gamma factor raised to the power of the number of time steps. That way, plans with high expected value but low precision (high ensemble variance) might be weighted the same as plans with lower expected value but higher precision (lower ensemble variance). This would hopefully prevent the AI from pursuing dangerous plans that fall far outside of the trusted region of state space while steering toward plans with long-term stable positive outcomes and away from plans with long-term stable negative outcomes.
I really like this perspective. There may be no way to find a human’s One True Value Function (to say nothing of humanity’s), not only because humans are complicated to model, but also because there is probably no such thing as a human’s One True Value Function (even less so for humanity as a whole). Similar to what you said, it could very well just be heuristics all the way down, heuristics both in what is valued (preferences) and how different heuristic values compete (meta-preferences). Natural selection has fine-tuned both levels to get something that works well enough for survival and reproduction of the species in the domain of validity of humans’ ancestral environment, while each individual human could fine-tune their preferences and meta-preferences based on whatever leads to the greatest perceived harmony among them within the domain of validity of lived personal experience.
In AI, the concept of multiple competing value functions could be realized through ensemble models. Each sub-model within the ensemble learns a value function independently. If each sub-model receives slightly different input or starts with different random initialization to its weights, then they will each learn slightly different value functions. Then you can use the ensemble variance in predicted value (or precision = 1/variance) to determine the domain of validity. Those regions of state space where all sub-models in the ensemble pretty much agree on value (low variance / high precision) are “safe” to explore, while those regions with large disagreements in predicted value (high variance / low precision) are “unsafe”. Of course a creativity or curiosity drive could motivate the system to push the frontier of the safe region, but there would always have to come a point where the potential value of exploring further is overcome by the risk, which I guess falls under the umbrella of “meta-preference”.
I have had the idea that the discount factor used in decision theory and RL could be based on the precision of predictions rather than on some constant gamma factor raised to the power of the number of time steps. That way, plans with high expected value but low precision (high ensemble variance) might be weighted the same as plans with lower expected value but higher precision (lower ensemble variance). This would hopefully prevent the AI from pursuing dangerous plans that fall far outside of the trusted region of state space while steering toward plans with long-term stable positive outcomes and away from plans with long-term stable negative outcomes.