Vladimir_Nesov comments on Formal Philosophy and Alignment Possible Projects

Vladimir_Nesov 30 Jun 2022 14:05 UTC
LW: 4 AF: 2
0
AF

Dealing with no Ground Truth in Human Preferences

A variation on this: If preference is known, but difficult to access in some sense. For example, estimates change in time outside agent’s control, like market data for some security regarding any given question of “expected utility”, actual preference is the dividends that haven’t been voted on yet, or else there is a time-indexed sequence of utility functions that converges in some sense (probably with strong requirements that make the limit predictable in a useful way), and what matters is expected utility according to the limit of this sequence. Or there is a cost for finding out more, so that good things happening without having to be known to be good are better, and it’s useful to work out which queries to preference are worth paying for. Or there is a logical system for reasoning about preference (preference is given by a program). How do you build an agent that acts on this?

Is there something intended to be an optimizer for this setting that ends up essentially doing soft optimization instead, because of the difficulty in accessing preference? One possibility/explanation for why this might happen is treating optimizer’s own unknown/intractable preference as adversarially assigned, as moves of the other player in a game that should be won, packaging intractability of preference in the other player’s strategy.

In the case of preference-as-computation, there is the usual collection of embedded agency issues where the agent might control preference and the question of predicting/computing it is not straightforward, the answer might depend on agent’s own behavior (which is related to demons, ASP, and control via approximate and possibly incorrect predictions), there might be spurious proofs of preference being a certain way (an inner alignment problem of preference specification or of a computation that reasons about it).

It’s often said that if agent’s preference is given by the result of running a program that’s not immediately tractable, then the agent is motivated to work on computing it. How do we build a toy model of this actually happening? Probably something about value of information, but value is still intractable when value of information needs to be noticed.
What links here?