In contrast, I think we can explain humans’ tendency to like ice cream using the standard language of reinforcement learning.
I think you could defend a stronger claim (albeit you’d have to expend some effort): misgeneralisation of this kind is a predictable consequence of the evolution “training paradigm”, and would in fact be predicted by machine learning practitioners. I think the fact that the failure is soft (humans don’t eat ice cream until they die) might be harder to predict than the fact that the failure occurs.
I do not think that optimizing a network on a given objective function produces goals orientated towards maximizing that objective function. In fact, I think that this almost never happens. For example, I don’t think GPTs have any sort of inner desire to predict text really well.
I think this is looking at the question in the wrong way. From a behaviourist viewpoint:
it considers all of the possible 1-token completions of a piece of text
then selects the most likely one (or randomises according to its distribution or something similar)
on this account, it “wants to predict text accurately”. But Yudkowsky’s claim is (roughly):
it considers all of the possible long run interaction outcomes
it selects the completion that leads to the lowest predictive loss for the machine’s outputs across the entire interaction
and perhaps in this alternative sense it “wants to predict text accurately”.
I’d say the first behaviour has high priors and strong evidence, and the second is (apparently?) supported by the fact that both behaviours are compatible with the vague statement “wants to predict text accurately”, which I don’t think is very compelling.
My response in Why aren’t other people as pessimistic as Yudkowsky? includes a discussion of adversarial vulnerability and why I don’t think points to any irreconcilable flaws in current alignment techniques.
I think this might be the wrong link. Either that, or I’m confused about how the sentence relates to the podcast video.
I think you could defend a stronger claim (albeit you’d have to expend some effort): misgeneralisation of this kind is a predictable consequence of the evolution “training paradigm”, and would in fact be predicted by machine learning practitioners. I think the fact that the failure is soft (humans don’t eat ice cream until they die) might be harder to predict than the fact that the failure occurs.
I think this is looking at the question in the wrong way. From a behaviourist viewpoint:
it considers all of the possible 1-token completions of a piece of text
then selects the most likely one (or randomises according to its distribution or something similar)
on this account, it “wants to predict text accurately”. But Yudkowsky’s claim is (roughly):
it considers all of the possible long run interaction outcomes
it selects the completion that leads to the lowest predictive loss for the machine’s outputs across the entire interaction
and perhaps in this alternative sense it “wants to predict text accurately”.
I’d say the first behaviour has high priors and strong evidence, and the second is (apparently?) supported by the fact that both behaviours are compatible with the vague statement “wants to predict text accurately”, which I don’t think is very compelling.
I think this might be the wrong link. Either that, or I’m confused about how the sentence relates to the podcast video.