Less exploitable value-updating agent

My indifferent value learning agent design is in some ways too good. The agent transfer perfectly from u maximisers to v maximisers—but this makes them exploitable, as Benja has pointed out.

For instance, if u values paperclips and v values staples, and everyone knows that the agent will soon transfer from a u-maximiser to a v-maximiser, then an enterprising trader can sell the agent paperclips in exchange for staples, then wait for the utility change, and sell the agent back staples for paperclips, pocketing a profit each time. More prosaically, they could “borrow” £1,000,000 from the agent, promising to pay back £2,000,000 tomorrow if the agent is still a u-maximiser. And the currently u-maximising agent will accept, even though everyone knows it will change to a v-maximiser before tomorrow.

One could argue that exploitability is inevitable, given the change in utility functions. And I haven’t yet found any principled way of avoiding exploitability which preserves the indifference. But here is a tantalising quasi-example.

As before, u values paperclips and v values staples. Both are defined in terms of extra paperclips/staples over those existing in the world (and negatively in terms of destruction of existing/staples), with their zero being at the current situation. Let’s put some diminishing returns on both utilities: for each paperclips/stables created/destroyed up to the first five, u/v will gain/lose one utilon. For each subsequent paperclip/staple destroyed above five, they will gain/lose one half utilon.

We now construct our world and our agent. The world lasts two days, and has a machine that can create or destroy paperclips and staples for the cost of £1 apiece. Assume there is a tiny ε chance that the machine stops working at any given time. This ε will be ignored in all calculations; it’s there only to make the agent act sooner rather than later when the choices are equivalent (a discount rate could serve the same purpose).

The agent owns £10 and has utility function u+Xv. The value of X is unknown to the agent: it is either +1 or −1, with 50% probability, and this will be revealed at the end of the first day (you can imagine X is the output of some slow computation, or is written on the underside of a rock that will be lifted).

So what will the agent do? It’s easy to see that it can never get more than 10 utilons, as each £1 generates at most 1 utilon (we really need a unit symbol for the utilon!). And it can achieve this: it will spend £5 immediately, creating 5 paperclips, wait until X is revealed, and spend another £5 creating or destroying staples (depending on the value of X).

This looks a lot like a resource-conserving value-learning agent. I doesn’t seem to be “exploitable” in the sense Benja demonstrated. It will still accept some odd deals—one extra paperclip on the first day in exchange for all the staples in the world being destroyed, for instance. But it won’t give away resources for no advantage. And it’s not a perfect value-learning agent. But it still seems to have interesting features of non-exploitable and value-learning that are worth exploring.

Note that this property does not depend on v being symmetric around staple creation and destruction. Assume v hits diminishing returns after creating 5 staples, but after destroying only 4 of them. Then the agent will have the same behaviour as above (in that specific situation; in general, this will cause a slight change, in that the agent will slightly overvalue having money on the first day compared to the original v), and will expect to get 9.75 utilons (50% chance of 10 for X=+1, 50% chance of 9.5 for X=-1). Other changes to u and v will shift how much money is spent on different days, but the symmetry of v is not what is powering this example.