A fairly vague idea for corrigible motivation which I’ve been toying with has been something along the lines of:
1: Have the AI model human behaviour
2: Have this model split the causal nodes governing human behaviour into three boxes: Values, Knowledge and Other Stuff. (With other stuff being things like random impulses which cause behaviour, revealed but not endorsed preferences etc.) This is the difficult bit, I think using tools like psychology/neurology/evolution we can get around the no free lunch theorems.
3: Have the model keep the values, improve on the knowledge, and throw out the other stuff.
4: Enforce a reflective consistency thing. I don’t know exactly how this would work but something along the lines of “Re-running the algorithm with oversight from the output algorithm shouldn’t lead to a different output algorithm”. This is also possibly difficult, if something ends up in the “Values” it’s not clear whether it might get stuck there, so local attractors of values are a problem.
This is something like inverse reinforcement learning but with an enforced prior on humans not being fully rational or strategic. It also might require an architecture which is good at breaking down models into legible gears, which NNs often fail at unless we spend a lot of time studying the resulting NN.
Using a pointer to human values rather than human values itself suffers from issues of the AI resisting attempts to re-orient the pointer, which is what the self-consistency parts of this method are there for.
This approach was mostly borne out of considerations of the “The AI knows we will fight it and therefore knows we must have messed up its alignment but doesn’t care because we messed up its alignment” situation. My hope is also that it can leverage the human-modelling parts of the AI to our advantage. Issues of modelling humans do fall prey to “mind-crime” though so we ought to be careful there too.
These are definitely reasonable things to think about.
For my part, I get kinda stuck right at your step #1. Like, say you give the AGI access to youtube and tell it to build a predictive model (i.e. do self-supervised learning). It runs for a while and winds up with a model of everything in the videos—people doing things, balls bouncing, trucks digging, etc. etc. Then you need to point to a piece of this model and say “This is human behavior” or “This is humans intentionally doing things”. How do we do that? How do we find the right piece of the model? So I see step #1 as a quite hard and rich problem.
Then step #2 is also hard, especially if you don’t have a constraint on what the causal nodes will wind up looking like (e.g. is a person a node? A person at a particular moment? A subagent? A neuron? This could tie into how step #1 works.)
#2 also seems (maybe?) to require understanding how brains work (e.g. what kind of data structure is “knowledge”?) and if you have that then you can use very different approaches (like section 5 here).
What’s the motivation behind #4?
Using a pointer to human values rather than human values itself suffers from issues of the AI resisting attempts to re-orient the pointer, which is what the self-consistency parts of this method are there for.
I’m not sure I follow. If the AGI thinks “I want human flourishing, by which I mean blah blah blah”, then it will by default resist attempts to make it actually want a slightly different operationalization of human flourishing. Unless “wanting human flourishing” winds up incentivizing corrigibility. Or I guess in general, I don’t understand how you’re defining pointer-like vs non-pointer-like goals, and why one tends to incentivize corrigibility more than the other. Sorry if I’m being dense.
By the way, I’m not deeply involved in IRL / value learning (as you might have noticed from this post). You might consider posting a top-level post with what you’re thinking about, to get a broader range of feedback, not just my own idiosyncratic not-especially-well-informed thoughts.
A fairly vague idea for corrigible motivation which I’ve been toying with has been something along the lines of:
1: Have the AI model human behaviour
2: Have this model split the causal nodes governing human behaviour into three boxes: Values, Knowledge and Other Stuff. (With other stuff being things like random impulses which cause behaviour, revealed but not endorsed preferences etc.) This is the difficult bit, I think using tools like psychology/neurology/evolution we can get around the no free lunch theorems.
3: Have the model keep the values, improve on the knowledge, and throw out the other stuff.
4: Enforce a reflective consistency thing. I don’t know exactly how this would work but something along the lines of “Re-running the algorithm with oversight from the output algorithm shouldn’t lead to a different output algorithm”. This is also possibly difficult, if something ends up in the “Values” it’s not clear whether it might get stuck there, so local attractors of values are a problem.
This is something like inverse reinforcement learning but with an enforced prior on humans not being fully rational or strategic. It also might require an architecture which is good at breaking down models into legible gears, which NNs often fail at unless we spend a lot of time studying the resulting NN.
Using a pointer to human values rather than human values itself suffers from issues of the AI resisting attempts to re-orient the pointer, which is what the self-consistency parts of this method are there for.
This approach was mostly borne out of considerations of the “The AI knows we will fight it and therefore knows we must have messed up its alignment but doesn’t care because we messed up its alignment” situation. My hope is also that it can leverage the human-modelling parts of the AI to our advantage. Issues of modelling humans do fall prey to “mind-crime” though so we ought to be careful there too.
Thanks for sharing!
These are definitely reasonable things to think about.
For my part, I get kinda stuck right at your step #1. Like, say you give the AGI access to youtube and tell it to build a predictive model (i.e. do self-supervised learning). It runs for a while and winds up with a model of everything in the videos—people doing things, balls bouncing, trucks digging, etc. etc. Then you need to point to a piece of this model and say “This is human behavior” or “This is humans intentionally doing things”. How do we do that? How do we find the right piece of the model? So I see step #1 as a quite hard and rich problem.
Then step #2 is also hard, especially if you don’t have a constraint on what the causal nodes will wind up looking like (e.g. is a person a node? A person at a particular moment? A subagent? A neuron? This could tie into how step #1 works.)
#2 also seems (maybe?) to require understanding how brains work (e.g. what kind of data structure is “knowledge”?) and if you have that then you can use very different approaches (like section 5 here).
What’s the motivation behind #4?
I’m not sure I follow. If the AGI thinks “I want human flourishing, by which I mean blah blah blah”, then it will by default resist attempts to make it actually want a slightly different operationalization of human flourishing. Unless “wanting human flourishing” winds up incentivizing corrigibility. Or I guess in general, I don’t understand how you’re defining pointer-like vs non-pointer-like goals, and why one tends to incentivize corrigibility more than the other. Sorry if I’m being dense.
By the way, I’m not deeply involved in IRL / value learning (as you might have noticed from this post). You might consider posting a top-level post with what you’re thinking about, to get a broader range of feedback, not just my own idiosyncratic not-especially-well-informed thoughts.