Update: a recent new paper, ‘Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback’, described by the authors on LW in ‘Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback’, finds that post-RLHF, LLMs may identify users who are more susceptible to manipulation and behave differently with those users. This seems like a clear example of LLMs modeling users and also making use of that information.
Update: a recent new paper, ‘Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback’, described by the authors on LW in ‘Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback’, finds that post-RLHF, LLMs may identify users who are more susceptible to manipulation and behave differently with those users. This seems like a clear example of LLMs modeling users and also making use of that information.