Thanks for this response. I heard a similar discussion recently, with someone talking about whether an algorithm’s reward function was activated because it got the answer correct or because it knew it was what the programmers wanted it to do. It’s not clear since the decision-making pathways are not always clear, especially with more complex machine learning.
The inner optimizer thing is really interesting; I hadn’t heard it coined like that before. Is it in AI’s interest (a big assumption that is has interests at all, I know) to become so human-specific that it loses its ability to generalize? Variability would decrease in the population and the probability mechanisms of machine learning would approach certainty, thus rendering the AI basically ineffective.
Is it in AI’s interest (a big assumption that is has interests at all, I know) to become so human-specific that it loses its ability to generalize?
There’s an approach called learning the prior through imitative generalization, that seemed to me a promising way to address this problem. Most relevant quotes from that article:
We might hope that our models will naturally generalize correctly from easy-to-answer questions to the ones that we care about. However, a natural pathological generalisation is for our models to only give us ‘human-like’ answers to questions, even if it knows the best answer is different. If we only have access to these human-like answers to questions, that probably doesn’t give us enough information to supervise a superhuman model.
What we’re going to call ‘Imitative Generalization’ is a possible way to narrow the gap between the things our model knows, and the questions we can train our model to answer honestly. It avoids the pathological generalisation by only using ML for IID tasks, and imitating the way humans generalize. This hopefully gives us answers that are more like ‘how a human would answer if they’d learnt from all the data the model has learnt from’. We supervise how the model does the transfer, to get the sort of generalisation we want.
Thanks for this response. I heard a similar discussion recently, with someone talking about whether an algorithm’s reward function was activated because it got the answer correct or because it knew it was what the programmers wanted it to do. It’s not clear since the decision-making pathways are not always clear, especially with more complex machine learning.
The inner optimizer thing is really interesting; I hadn’t heard it coined like that before. Is it in AI’s interest (a big assumption that is has interests at all, I know) to become so human-specific that it loses its ability to generalize? Variability would decrease in the population and the probability mechanisms of machine learning would approach certainty, thus rendering the AI basically ineffective.
There’s an approach called learning the prior through imitative generalization, that seemed to me a promising way to address this problem. Most relevant quotes from that article: