What is more, the change that the population undergoes is shaped in such a way that it tends towards making the values more predictable.
(...)
As a result, a firms’ steering power will specifically tend towards making the predicted behaviour easier to predict, because it is this predictability that the firm is able to exploit for profit (e.g., via increases in advertisement revenues).
A small misconception that lies at the heart of this section is that AI systems (and specifically recommenders) will try to make people more predictable. This is not necessarily the case.
For example, one could imagine incentives for modifying someone’s values to be more unpredictable (changing constantly within some subset) but in an area of the value-space that leads to much higher reward for any AI action.
Moreover, most recommender systems (given that they only optimize instantaneous engagement) don’t really optimize for making people more predictable, and can’t reason about changing the human’s long-term predictability. In fact, most recsystems today are “myopic”: their objective is a one-timestep optimization that won’t account for much change in the human, and can essentially be thought of as ~”let me find the single content item X that maximizes the probability that you’d engage with X right now”. This often doesn’t have much to do with long-term predictability: clickbait often will maximize the current chance of a click but might make you more unpredictable later.
For example, in the case of recommendation platforms, rather than finding an increased heterogeneity in viewing behaviour, studies have observed that these platforms suffer from what is called a ‘popularity bias’, which leads to a loss of diversity and a homogenisation in the content recommended (see, e.g., Chechkin et al. (2007), DiFranzo et al. (2017), & Hazrati et al. (2022)). As such, predictive optimisers impose pressures towards making behaviour more predictable, which, in reality, often imply pressures towards simplification, homogenisation, and/or polarisation of (individual and collective) values.
Related to my point above (and this quoted paragraph), a fundamental nuance here is the distinction between “accidental influence side effects” and “incentivized influence effects”. I’m happy to answer more questions on this difference if it’s not clear from the rest of my comment.
Popularity bias and homogenization have mostly been studied as common accidental influence side effects: even if you just optimize for instantaneous engagement, often in practice it seems like this homogenization effect will occur, but there’s not a sense that the AI system is “trying to bring homogenization about” – it just happens by chance, similarly to how introducing TV will change the dynamics of how people produce and consume information.
I think most people’s concern about AI influencing us (and our values) comes instead from incentivized influence: the AI “planning out” how to influence us in ways that are advantageous to its objective, and actively trying to change people’s values because of manipulation incentives emerged from the optimization [3, 8]. For instance, various works [1-2] have shown that recommenders which optimize long-term engagement via RL (or other forms of ~planning) will have these kinds of incentives to manipulate users (potentially by making them more predictable, but not necessarily).
Regarding grounding the discussion of “mechanisms causing illegitimate value change”: I do think that it makes sense to talk about performative power as a measure of how much a population can be steered, and why we would expect firms to have incentives to intentionally try to steer user values. However, imo performative power is more an issue of AI policy, misuse, and mechanism design (to discourage firms from trying to cause value change for profit), rather than the “core mechanism” of the VCP.
In part because of this, imo performative prediction/power seem like a potentially misleading lens to analyze the VCP. Here are some reasons why I’ve come to think so:
The lens of performative power suggests that the problem has mostly got to do with conscious choices of misaligned profit-maximizing firms. In fact, even with completely benevolent firms, it would still be unclear how to avoid the issue:the VCP will remain an issue even in settings of full alignment between the system designer and the user, because of the fundamental difficulties in specifying exactly what kinds of value changes should be considered legitimate or illegitimate. In fact, the line of work about incentivized influence effects [1-5] shows that even with the best intentions, without the designers intentionally trying to bring about changes, AI systems can learn to systematically and “intentionally” induce illegitimate shifts, because of objective misspecification arising from the core issue of the VCP – distinguishing between legitimate and illegitimate changes.
Performative prediction and power are mostly focused on firms that are trying to solve sequential decision problems (e.g. multi-timestep interactions, where the algorithm’s choices affect users’ future behavior) with algorithms that optimize over only the next timestep’s outcomes.Mathematically, performative power can be thought of as a measure of how much a firm can shift users in a single timestep if they choose to do so. The steering analysis with ex-ante and ex-post optimization only performs a one-timestep lookahead, which isn’t a natural formalism for the multi-timestep nature of value change. Instead, the RL formalism automatically solves the multi-timestep equivalent of the ex-post optimization problem: in RL training, the human’s adaptation to the AI is already factored into how the AI should be making decisions in order to maximize the multi-timestep objectives. In short, the lens of RL is strictly more expressive than that of performative prediction.
I expect most advanced AI systems to be trained on multi-timestep objectives (explicitly or implicitly), making the performative power framework less naturally applicable (because it was developed with single-timestep objectives in mind). When imagining an AI assistant that might significantly change one’s values in illegitimate ways, the most likely story in my head is that it was trained on multi-timestep objectives (by doing some form of RL / planning) – this is the only way one can hope to go beyond human performance (relative to imitation), so there will be strong incentives to use this kind of training across the board. In fact, many recommender systems are already trying to use multi-timestep objectives with RL [7].
The story seems a lot cleaner (at least in my head) from the perspective of sequential decision problems and RL [1-5], which makes much less assumptions about the nature of the interaction. It goes something like this (even in the best case in which we are assuming a system designer aligned with the user):
We will make our best attempt at operationalizing our long-term objectives, but we will specify the rules for value changes incorrectly unless we solve the VCP
We will optimize AI assistants / agents with such mis-specified objective in environments which include humans. This is a sequential decision problem, and we will try to solve it via some forms of approximate planning or RL-like methods
By optimizing a multi-timestep objective, we will obtain agents that do what ~RL agents do: they try to change the state of the world in ways that lead to high-reward areas of the state space. It just so happens in this case that the human is part of the state of the world, and that we’re not very good at specifying what changes to the human’s values are legitimate or illegitimate
This is how you get illegitimate preference change (as a form of reward hacking) by changing the human’s values to the most advantageous settings for the reward as defined
On another note, in some of our work [1] we propose a way to ground a notion of value-change legitimacy based on counterfactual preference evolution (what we call “natural preference shifts”). While it’s not perfect (in part also because it’s challenging to implement computationally), I believe it could limit some of the main potential harms we are worried about, and might be of interest to you.
The idea behind natural preference shifts is to consider “what would have the person’s value been without the actions of the AI system”, and evaluate the AIs actions based on such counterfactual preferences rather than their current ones. This ensures that the AI won’t drive the person to internal states that they would have judged negatively according to their counterfactual preferences. While this might prevent beneficial legitimate preference shifts from being induced by the AI (as they wouldn’t have happened without the AI), it at least can guarantee that the effect of the system is not arbitrarily bad. For an alternate description of natural preference shifts, you can also see [3].
Sorry for the very long comment! Would love to chat more, and see the full version of the paper – feel free to reach out!
Related to my point above (and this quoted paragraph), a fundamental nuance here is the distinction between “accidental influence side effects” and “incentivized influence effects”. I’m happy to answer more questions on this difference if it’s not clear from the rest of my comment.
Thanks for clarifying; I agree it’s important to be nuanced here!
I basically agree with what you say. I also want to say something like: whether to best count it as side effect or incentivized depends on what optimizer we’re looking at/where you draw the boundary around the optimizer in question. I agree that a) at the moment, recommender systems are myopic in the way you describe, and the larger economic logic is where some of the pressure towards homogenization comes from (while other stuff is happening to, including humans pushing to some extent against that pressure, more or less successfully); b) at some limit, we might be worried about an AI systems becoming so powerful that its optimization arc comes to sufficiently large in scope that it’s correctly understood as directly doign incentivized influence; but I also want to point out a third scanrios, c) where we should be worried about basically incentivized influence but not all of the causal force/optimization has to be enacted from wihtin the boundaries of a single/specific AI system, but where the economy as a whole is sufficiently integrated with and accelerated by advanced AI to justify the incentivized influence frame (e.g. a la ascended economy, fully automated tech company singularity). I think the general pattern here is basically one of “we continue to outsource ever more consequential decisions to advanced AI systems, without having figured out how to make these systems reliably (not) do any thing in particular”.
A small misconception that lies at the heart of this section is that AI systems (and specifically recommenders) will try to make people more predictable. This is not necessarily the case.
Yes, I’d agree (and didn’t make this clear in the post, sorry) -- the pressure towards predictability comes from a combination of the logic of performative prediction AND the “economic logic” that provide the context in which these performative predictors are being used/applied. This is certainly an important thing to be clear about!
(Though it also can only give us so much reassurance: I think it’s an extremely hard problem to find reliable ways for AI models to NOT be applied inside of the capitalist economic logic, if that’s what we’re hoping to do to avoid the legibilisation risk.)
A small misconception that lies at the heart of this section is that AI systems (and specifically recommenders) will try to make people more predictable. This is not necessarily the case.
For example, one could imagine incentives for modifying someone’s values to be more unpredictable (changing constantly within some subset) but in an area of the value-space that leads to much higher reward for any AI action.
Moreover, most recommender systems (given that they only optimize instantaneous engagement) don’t really optimize for making people more predictable, and can’t reason about changing the human’s long-term predictability. In fact, most recsystems today are “myopic”: their objective is a one-timestep optimization that won’t account for much change in the human, and can essentially be thought of as ~”let me find the single content item X that maximizes the probability that you’d engage with X right now”. This often doesn’t have much to do with long-term predictability: clickbait often will maximize the current chance of a click but might make you more unpredictable later.
Related to my point above (and this quoted paragraph), a fundamental nuance here is the distinction between “accidental influence side effects” and “incentivized influence effects”. I’m happy to answer more questions on this difference if it’s not clear from the rest of my comment.
Popularity bias and homogenization have mostly been studied as common accidental influence side effects: even if you just optimize for instantaneous engagement, often in practice it seems like this homogenization effect will occur, but there’s not a sense that the AI system is “trying to bring homogenization about” – it just happens by chance, similarly to how introducing TV will change the dynamics of how people produce and consume information.
I think most people’s concern about AI influencing us (and our values) comes instead from incentivized influence: the AI “planning out” how to influence us in ways that are advantageous to its objective, and actively trying to change people’s values because of manipulation incentives emerged from the optimization [3, 8]. For instance, various works [1-2] have shown that recommenders which optimize long-term engagement via RL (or other forms of ~planning) will have these kinds of incentives to manipulate users (potentially by making them more predictable, but not necessarily).
Regarding grounding the discussion of “mechanisms causing illegitimate value change”: I do think that it makes sense to talk about performative power as a measure of how much a population can be steered, and why we would expect firms to have incentives to intentionally try to steer user values. However, imo performative power is more an issue of AI policy, misuse, and mechanism design (to discourage firms from trying to cause value change for profit), rather than the “core mechanism” of the VCP.
In part because of this, imo performative prediction/power seem like a potentially misleading lens to analyze the VCP. Here are some reasons why I’ve come to think so:
The lens of performative power suggests that the problem has mostly got to do with conscious choices of misaligned profit-maximizing firms. In fact, even with completely benevolent firms, it would still be unclear how to avoid the issue: the VCP will remain an issue even in settings of full alignment between the system designer and the user, because of the fundamental difficulties in specifying exactly what kinds of value changes should be considered legitimate or illegitimate. In fact, the line of work about incentivized influence effects [1-5] shows that even with the best intentions, without the designers intentionally trying to bring about changes, AI systems can learn to systematically and “intentionally” induce illegitimate shifts, because of objective misspecification arising from the core issue of the VCP – distinguishing between legitimate and illegitimate changes.
Performative prediction and power are mostly focused on firms that are trying to solve sequential decision problems (e.g. multi-timestep interactions, where the algorithm’s choices affect users’ future behavior) with algorithms that optimize over only the next timestep’s outcomes. Mathematically, performative power can be thought of as a measure of how much a firm can shift users in a single timestep if they choose to do so. The steering analysis with ex-ante and ex-post optimization only performs a one-timestep lookahead, which isn’t a natural formalism for the multi-timestep nature of value change. Instead, the RL formalism automatically solves the multi-timestep equivalent of the ex-post optimization problem: in RL training, the human’s adaptation to the AI is already factored into how the AI should be making decisions in order to maximize the multi-timestep objectives. In short, the lens of RL is strictly more expressive than that of performative prediction.
I expect most advanced AI systems to be trained on multi-timestep objectives (explicitly or implicitly), making the performative power framework less naturally applicable (because it was developed with single-timestep objectives in mind). When imagining an AI assistant that might significantly change one’s values in illegitimate ways, the most likely story in my head is that it was trained on multi-timestep objectives (by doing some form of RL / planning) – this is the only way one can hope to go beyond human performance (relative to imitation), so there will be strong incentives to use this kind of training across the board. In fact, many recommender systems are already trying to use multi-timestep objectives with RL [7].
The story seems a lot cleaner (at least in my head) from the perspective of sequential decision problems and RL [1-5], which makes much less assumptions about the nature of the interaction. It goes something like this (even in the best case in which we are assuming a system designer aligned with the user):
We will make our best attempt at operationalizing our long-term objectives, but we will specify the rules for value changes incorrectly unless we solve the VCP
We will optimize AI assistants / agents with such mis-specified objective in environments which include humans. This is a sequential decision problem, and we will try to solve it via some forms of approximate planning or RL-like methods
By optimizing a multi-timestep objective, we will obtain agents that do what ~RL agents do: they try to change the state of the world in ways that lead to high-reward areas of the state space. It just so happens in this case that the human is part of the state of the world, and that we’re not very good at specifying what changes to the human’s values are legitimate or illegitimate
This is how you get illegitimate preference change (as a form of reward hacking) by changing the human’s values to the most advantageous settings for the reward as defined
On another note, in some of our work [1] we propose a way to ground a notion of value-change legitimacy based on counterfactual preference evolution (what we call “natural preference shifts”). While it’s not perfect (in part also because it’s challenging to implement computationally), I believe it could limit some of the main potential harms we are worried about, and might be of interest to you.
The idea behind natural preference shifts is to consider “what would have the person’s value been without the actions of the AI system”, and evaluate the AIs actions based on such counterfactual preferences rather than their current ones. This ensures that the AI won’t drive the person to internal states that they would have judged negatively according to their counterfactual preferences. While this might prevent beneficial legitimate preference shifts from being induced by the AI (as they wouldn’t have happened without the AI), it at least can guarantee that the effect of the system is not arbitrarily bad. For an alternate description of natural preference shifts, you can also see [3].
Sorry for the very long comment! Would love to chat more, and see the full version of the paper – feel free to reach out!
[1] Estimating and Penalizing Induced Preference Shifts in Recommender Systems
[2] User Tampering in Reinforcement Learning Recommender Systems
[3] Characterizing Manipulation from AI Systems
[4] Hidden Incentives for Auto-Induced Distributional Shift
[5] Path-Specific Objectives for Safer Agent Incentives
[6] Agent Incentives: A Causal Perspective
[7] Reinforcement learning based recommender systems: A survey
[8] Emergent Deception and Emergent Optimization
Thanks for clarifying; I agree it’s important to be nuanced here!
I basically agree with what you say. I also want to say something like: whether to best count it as side effect or incentivized depends on what optimizer we’re looking at/where you draw the boundary around the optimizer in question. I agree that a) at the moment, recommender systems are myopic in the way you describe, and the larger economic logic is where some of the pressure towards homogenization comes from (while other stuff is happening to, including humans pushing to some extent against that pressure, more or less successfully); b) at some limit, we might be worried about an AI systems becoming so powerful that its optimization arc comes to sufficiently large in scope that it’s correctly understood as directly doign incentivized influence; but I also want to point out a third scanrios, c) where we should be worried about basically incentivized influence but not all of the causal force/optimization has to be enacted from wihtin the boundaries of a single/specific AI system, but where the economy as a whole is sufficiently integrated with and accelerated by advanced AI to justify the incentivized influence frame (e.g. a la ascended economy, fully automated tech company singularity). I think the general pattern here is basically one of “we continue to outsource ever more consequential decisions to advanced AI systems, without having figured out how to make these systems reliably (not) do any thing in particular”.
Yes, I’d agree (and didn’t make this clear in the post, sorry) -- the pressure towards predictability comes from a combination of the logic of performative prediction AND the “economic logic” that provide the context in which these performative predictors are being used/applied. This is certainly an important thing to be clear about!
(Though it also can only give us so much reassurance: I think it’s an extremely hard problem to find reliable ways for AI models to NOT be applied inside of the capitalist economic logic, if that’s what we’re hoping to do to avoid the legibilisation risk.)