Caspar Oesterheld comments on Announcing: Mechanism Design for AI Safety—Reading Group

Caspar Oesterheld 21 Aug 2022 17:56 UTC
LW: 3 AF: 3
0
AF
Sounds interesting! Are you going to post the reading list somewhere once it is completed?
(Sorry for self-promotion in the below!)
I have a mechanism design paper that might be of interest: Caspar Oesterheld and Vincent Conitzer: Decision Scoring Rules. WINE 2020. Extended version. Talk at CMID.
Here’s a pitch in the language of incentivizing AI systems—the paper is written in CS-econ style. Imagine you have an AI system that does two things at the same time:
1) It makes predictions about the world.
2) It takes actions that influence the world. (In the paper, we specifically imagine that the agent makes recommendations to a principal who then takes the recommended action.) Note that if the predictions are seen by humanity, they themselves influence the world. So even a pure oracle AI might satisfy 2, as has been discussed before (see end of this comment).
We want to design a reward system for this agent such the agent maximizes its reward by making accurate predictions and taking actions that maximize our, the principals’, utility.
The challenge is that if we reward the accuracy of the agent’s predictions, we may set an incentive on the agent to make the world more predictable, which will generally not be aligned without mazimizing our utility.
So how can we properly incentivize the agent? The paper provides a full and very simple characterization of such incentive schemes, which we call proper decision scoring rules:
We show that proper decision scoring rules cannot give the [agent] strict incentives to report any properties of the outcome distribution [...] other than its expected utility. Intuitively, rewarding the [agent] for getting anything else about the distribution right will make him [take] actions whose outcome is easy to predict as opposed to actions with high expected utility [for the principal]. Hence, the [agent’s] reward can depend only on the reported expected utility for the recommended action. [...] we then obtain four characterizations of proper decision scoring rules, two of which are analogous to existing results on proper affine scoring [...]. One of the [...] characterizations [...] has an especially intuitive interpretation in economic contexts: the principal offers shares in her project to the [agent] at some pricing schedule. The price schedule does not depend on the action chosen. Thus, given the chosen action, the [agent] is incentivized to buy shares up to the point where the price of a share exceeds the expected value of the share, thereby revealing the principal’s expected utility. Moreover, once the [agent] has some positive share in the principal’s utility, it will be (strictly) incentivized to [take] an optimal action.
Also see Johannes Treutlein’s post on “Training goals for large language models”, which also discusses some of the above results among other things that seem like they might be a good fit for the reading group, e.g., Armstrong and O’Rourke’s work.
My motivation for working on this was to address issues of decision making under logical uncertainty. For this I drew inspiration from the fact that Garrabrant et al.’s work on logical induction is also inspired by market design ideas (specifically prediction markets).
- Rubi J. Hudson 25 Oct 2022 4:05 UTC
  2 points
  0
  Parent
  Update: the reading list has now been posted.
- Rubi J. Hudson 22 Aug 2022 23:05 UTC
  2 points
  0
  Parent
  Yeah, the full reading list will be posted publicly once it’s finalized.
  Thanks for the recommendation! I was planning on including something from yourself/Vince/out of FOCAL, but wasn’t sure which option to go with.