Isn’t this a temporary solution at best? Eventually you resolve your uncertainty over the reward (or, more accurately, you get as much information as you can about the reward, potentially leaving behind some irreducible uncertainty), and then you start manipulating the target human.
I’m pretty wary of introducing potentially-false assumptions like the SVP already, and it seems particularly bad if their benefits are only temporary.
Yeah, at the end of the post I point out both the potential falsity of the SVP and the problem of updated deference. Approaches that make the agent indefinitely uncertain about the reward (or at least uncertain for longer) might help with the latter, e.g. ifHis also uncertain about the reward, or if preferences are modeled as changing over time or with different contexts, etc.
I’m pretty wary of introducing potentially-false assumptions like the SVP already, and it seems particularly bad if their benefits are only temporary.
I agree, and I’m not sure I endorse the SVP, but I think it’s the right type of solution—i.e. an assumption about the training environment that (hopefully) encourages cooperative behaviour.
I’ve found it difficult to think of a more robust/satisfying solution to manipulation (in this context). It seems like agents just will have incentives to manipulate each other in a multi-polar world, and it’s hard to prevent that.
I’ve found it difficult to think of a more robust/satisfying solution to manipulation (in this context). It seems like agents just will have incentives to manipulate each other in a multi-polar world, and it’s hard to prevent that.
Fundamentally you need some way of distinguishing between “manipulation” and “not manipulation”. The first guess of “manipulation = affecting the human’s brain” is not a good definition, as it basically prevents all communication whatsoever. I haven’t seen any simple formal-ish definitions that seem remotely correct.
(There’s of course the approach where you try to learn the human concept of manipulation from human feedback, and train your system to avoid that, but that’s pretty different from a formal definition based on causal diagrams.)
I liked how Rhy’s definition of manipulation specifically included the requirement of the target getting lower utility.
Therefore something like “manipulation = affecting the human’s brain in a way that will reduce their expected utility” does not classify all communication as manipulation.
As Richard points out, my definition of manipulation is “I influence your actions in a way that causes you to get lower utility”. (And we can similarly define cooperation except with the target getting higher utility.) Can send you the formal version if you’re interested.
I continue to think that this classifies all communication as manipulation. Every action reduces someone’s expected utility, from Omega’s perspective.
I guess if you communicate with only one person, and you’re only looking at your effects on that person’s utility, then this does not classify all communication as manipulation. So maybe I should say that it classifies almost all communication-to-groups as manipulation.
Isn’t this a temporary solution at best? Eventually you resolve your uncertainty over the reward (or, more accurately, you get as much information as you can about the reward, potentially leaving behind some irreducible uncertainty), and then you start manipulating the target human.
I’m pretty wary of introducing potentially-false assumptions like the SVP already, and it seems particularly bad if their benefits are only temporary.
Yeah, at the end of the post I point out both the potential falsity of the SVP and the problem of updated deference. Approaches that make the agent indefinitely uncertain about the reward (or at least uncertain for longer) might help with the latter, e.g. if H is also uncertain about the reward, or if preferences are modeled as changing over time or with different contexts, etc.
I agree, and I’m not sure I endorse the SVP, but I think it’s the right type of solution—i.e. an assumption about the training environment that (hopefully) encourages cooperative behaviour.
I’ve found it difficult to think of a more robust/satisfying solution to manipulation (in this context). It seems like agents just will have incentives to manipulate each other in a multi-polar world, and it’s hard to prevent that.
Fundamentally you need some way of distinguishing between “manipulation” and “not manipulation”. The first guess of “manipulation = affecting the human’s brain” is not a good definition, as it basically prevents all communication whatsoever. I haven’t seen any simple formal-ish definitions that seem remotely correct.
(There’s of course the approach where you try to learn the human concept of manipulation from human feedback, and train your system to avoid that, but that’s pretty different from a formal definition based on causal diagrams.)
I liked how Rhy’s definition of manipulation specifically included the requirement of the target getting lower utility.
Therefore something like “manipulation = affecting the human’s brain in a way that will reduce their expected utility” does not classify all communication as manipulation.
As Richard points out, my definition of manipulation is “I influence your actions in a way that causes you to get lower utility”. (And we can similarly define cooperation except with the target getting higher utility.) Can send you the formal version if you’re interested.
I continue to think that this classifies all communication as manipulation. Every action reduces someone’s expected utility, from Omega’s perspective.
I guess if you communicate with only one person, and you’re only looking at your effects on that person’s utility, then this does not classify all communication as manipulation. So maybe I should say that it classifies almost all communication-to-groups as manipulation.