What judgments about which of our revealed preferences are biases and which are “true” preferences should we try to put into the initial conditions of a value learning process?
How should we elicit higher-order preferences from human models, and how should we translate them into changes in the parameters of the value learning procedure?
I realize being able to infer preferences from human actions could be a really useful ability. I’m curious how important you think those are vs. direct conversations with AI about values? (Or do you consider conversations simply a subset of the data that an AI could infer preferences from?)
It occurred to me that since language models are getting pretty good now, there is interesting opportunity now for starting to experiment with such human-and-AI conversations about values. This thought led me to have the following dialogue with the OpenAI Codex just now:
[Me:]
Here are some things I value:
- I like exercise - I like getting to work / do something productive during the day - I like going for a walk at the end of the day - I like having dinner with my wife in the evening - I like getting enough sleep
What are my values?
[Codex:]
- I like staying active and being productive - I like spending time with my wife and family - I like getting a good night’s sleep
[Me:]
Ok that’s pretty good. But I remembered I also like watching a show before bed. So what are my values now?
[Codex:]
- I like staying active, being productive, and spending time with my wife and family - I like getting a good night’s sleep
[Me:]
But you forgot the part about watching a show before bed...
[Codex:]
- I like staying active, being productive, spending time with my wife and family, and watching a show before bed
[Me:]
Ok that sounds good
Personally I feel more comfortable with the idea of an AGI that I could stop at any time to have a conversation to clarify what my values are like this. I imagine a lot could be clarified about higher-order preferences this way too, although sometimes the AI might need to read between the lines (e.g. “I know before I said I want to quit smoking but now I really want a cigarette!”).
Do you agree that this is a mode of operation we should be aiming for? Perhaps there are issues with this seeming reasonable for toy examples like the above, but we need to be worrying about a future where the AGI is dealing with issues so sophisticated or removed from our everyday experience that we won’t know how to give it guidance?
Overall I found your research plan clear and easy to follow. It was a bit long (as it probably should be) so it took me awhile to find the time to read it thoroughly. I’ve always been attracted to the value learning / CIRL / etc. general direction of research and find it intuitively compelling, so I’m excited to see what comes of your work here.
If we end up having a separation between initial “seed” meta-preferences and the general data used to learn human preferences, then I think it’s possible that part of the seed could be text that gets interpreted as authoritative about our preferences in a privileged sort of way (at least at first).
I’m not sure a conversation is quite the right picture, though. There might be some active learning on the AI’s part, but that won’t have to flow like a human conversation and it’s going to be augmented by a lot of feedback from interpretability tools.
I realize being able to infer preferences from human actions could be a really useful ability. I’m curious how important you think those are vs. direct conversations with AI about values? (Or do you consider conversations simply a subset of the data that an AI could infer preferences from?)
It occurred to me that since language models are getting pretty good now, there is interesting opportunity now for starting to experiment with such human-and-AI conversations about values. This thought led me to have the following dialogue with the OpenAI Codex just now:
Personally I feel more comfortable with the idea of an AGI that I could stop at any time to have a conversation to clarify what my values are like this. I imagine a lot could be clarified about higher-order preferences this way too, although sometimes the AI might need to read between the lines (e.g. “I know before I said I want to quit smoking but now I really want a cigarette!”).
Do you agree that this is a mode of operation we should be aiming for? Perhaps there are issues with this seeming reasonable for toy examples like the above, but we need to be worrying about a future where the AGI is dealing with issues so sophisticated or removed from our everyday experience that we won’t know how to give it guidance?
Overall I found your research plan clear and easy to follow. It was a bit long (as it probably should be) so it took me awhile to find the time to read it thoroughly. I’ve always been attracted to the value learning / CIRL / etc. general direction of research and find it intuitively compelling, so I’m excited to see what comes of your work here.
If we end up having a separation between initial “seed” meta-preferences and the general data used to learn human preferences, then I think it’s possible that part of the seed could be text that gets interpreted as authoritative about our preferences in a privileged sort of way (at least at first).
I’m not sure a conversation is quite the right picture, though. There might be some active learning on the AI’s part, but that won’t have to flow like a human conversation and it’s going to be augmented by a lot of feedback from interpretability tools.