I’ve encountered this claim multiple times over the years (most recently on this AXRP episode), but I can’t trace its origins (it doesn’t seem to be on Wikipedia). Quoting Evan from the episode:
And so if you think about, for example, an online learning setup, maybe you’re imagining something like a recommendation system. So it’s trying to recommend you YouTube videos or something. One of the things that can happen in this sort of a setup is that, well, it can try to change the distribution to make its task easier in the future. You know, if it tries to give you videos which will change your views in a particular way such that it’s easier to satisfy your views in the future, that’s a sort of non-myopia that could be incentivized just by the fact that you’re doing this online learning over many steps.
And if you think about something, especially what can happen in this sort of situation is, let’s say I have a—Or another situation this can happen is let’s say I’m just trying to train the model to satisfy humans’ preferences or whatever. It can try to modify the humans’ preferences to be easier to satisfy.
Furthermore, there’s a world a difference between deliberately optimising for modifying preferences in order to make them easier to predict, vs preferences changing as a byproduct of the AI getting better at predicting them and thus converging on what to advertise. This matters for what predicted features of strategies an AI is likely to pick out of strategy space when new options are introduced.
Simple example: If YouTube can turn you into an ideological extremist, you’ll probably watch more YouTube videos. See these two recent papers by people interested in AI safety for more detail:
https://openreview.net/pdf?id=mMiKHj7Pobj
https://arxiv.org/abs/2204.11966
Thanks! This is what I’m looking for. Seems like I should have googled “recommender systems” and “preference shifts”.
Edit: The openreview paper is so good. Do you know who the authors are?
Yeah it’s really cool! It’s David Scott Krueger, who’s doing a lot of work bringing theories from the LW alignment community into mainstream ML. This preference shift argument seems similar to the concept of gradient hacking, though it doesn’t require the presence of a mesa optimizer. I’d love to write a post summarizing this recent work and discussing its relevance to long-term safety if you’d be interesting in working on it together.
Flattered you ask, but I estimate that I’ll be either very busy with my own projects or on mental-health vacation until the end of the year. But unless you’re completely saturated with connections, I’d be happy to have a 1:1 conversation sometime after October 25th? Just for exploration purposes, not for working on a particular project.
I have experience in both product search/recommendation systems, and in voice-input automation and information systems. There is a very real and measurable (and measured) change in customer behavior and success, as the customer adapts to the system, even while the system is adapting (or at least customizing itself) to the user. There is a fair amount of design work put into making this two-way adaptation work better. This includes product or domain selection for topics that will make the system work better.
That may not be precisely what you’re talking about, but it seems very related.
The issue is that different AI models still produce similar results in type-casting the audience. For example, Facebook, YouTube, and Spotify will recommend equivalent recommendations based on your history (most likely the most recent history is weighted). As audiences are more and more type-casted they are offered products and services based on the same principles. Until AI models include some kind of “wild card” factor the results will be homogenous. The fact you micro-segment and therefore see better advertising results does not indicate changing behaviors, the opposite is true—when un-related segments can be converted then AI can be said to change preferences.