These are definitely reasonable things to think about.
For my part, I get kinda stuck right at your step #1. Like, say you give the AGI access to youtube and tell it to build a predictive model (i.e. do self-supervised learning). It runs for a while and winds up with a model of everything in the videos—people doing things, balls bouncing, trucks digging, etc. etc. Then you need to point to a piece of this model and say “This is human behavior” or “This is humans intentionally doing things”. How do we do that? How do we find the right piece of the model? So I see step #1 as a quite hard and rich problem.
Then step #2 is also hard, especially if you don’t have a constraint on what the causal nodes will wind up looking like (e.g. is a person a node? A person at a particular moment? A subagent? A neuron? This could tie into how step #1 works.)
#2 also seems (maybe?) to require understanding how brains work (e.g. what kind of data structure is “knowledge”?) and if you have that then you can use very different approaches (like section 5 here).
What’s the motivation behind #4?
Using a pointer to human values rather than human values itself suffers from issues of the AI resisting attempts to re-orient the pointer, which is what the self-consistency parts of this method are there for.
I’m not sure I follow. If the AGI thinks “I want human flourishing, by which I mean blah blah blah”, then it will by default resist attempts to make it actually want a slightly different operationalization of human flourishing. Unless “wanting human flourishing” winds up incentivizing corrigibility. Or I guess in general, I don’t understand how you’re defining pointer-like vs non-pointer-like goals, and why one tends to incentivize corrigibility more than the other. Sorry if I’m being dense.
By the way, I’m not deeply involved in IRL / value learning (as you might have noticed from this post). You might consider posting a top-level post with what you’re thinking about, to get a broader range of feedback, not just my own idiosyncratic not-especially-well-informed thoughts.
Thanks for sharing!
These are definitely reasonable things to think about.
For my part, I get kinda stuck right at your step #1. Like, say you give the AGI access to youtube and tell it to build a predictive model (i.e. do self-supervised learning). It runs for a while and winds up with a model of everything in the videos—people doing things, balls bouncing, trucks digging, etc. etc. Then you need to point to a piece of this model and say “This is human behavior” or “This is humans intentionally doing things”. How do we do that? How do we find the right piece of the model? So I see step #1 as a quite hard and rich problem.
Then step #2 is also hard, especially if you don’t have a constraint on what the causal nodes will wind up looking like (e.g. is a person a node? A person at a particular moment? A subagent? A neuron? This could tie into how step #1 works.)
#2 also seems (maybe?) to require understanding how brains work (e.g. what kind of data structure is “knowledge”?) and if you have that then you can use very different approaches (like section 5 here).
What’s the motivation behind #4?
I’m not sure I follow. If the AGI thinks “I want human flourishing, by which I mean blah blah blah”, then it will by default resist attempts to make it actually want a slightly different operationalization of human flourishing. Unless “wanting human flourishing” winds up incentivizing corrigibility. Or I guess in general, I don’t understand how you’re defining pointer-like vs non-pointer-like goals, and why one tends to incentivize corrigibility more than the other. Sorry if I’m being dense.
By the way, I’m not deeply involved in IRL / value learning (as you might have noticed from this post). You might consider posting a top-level post with what you’re thinking about, to get a broader range of feedback, not just my own idiosyncratic not-especially-well-informed thoughts.