abramdemski comments on Re-Define Intent Alignment?

abramdemski 4 Aug 2021 17:59 UTC
LW: 5 AF: 4
AF
(see the unidentifiability in IRL paper)
Ah, I wasn’t aware of this!
Btw, if you’re aware of any counterpoints to this — in particular anything like a clearly worked-out counterexample showing that one can’t carve up a world, or recover a consistent utility function through this sort of process — please let me know. I’m directly working on a generalization of this problem at the moment, and anything like that could significantly accelerate my execution.
I’m not sure what would constitute a clearly-worked counterexample. To me, a high reliance on an agent/world boundary constitutes a “non-naturalistic” assumption, which simply makes me think a framework is more artificial/fragile.
For example, AIXI assumes a hard boundary between agent and environment. One manifestation of this assumption is how AIXI doesn’t predict its own future actions the way it predicts everything else, and instead, must explicitly plan its own future actions. This is necessary because AIXI is not computable, so treating the future self as part of the environment (and predicting it with the same predictive capabilities as usual) would violate the assumption of a computable environment. But this is unfortunate for a few reasons. First, it forces AIXI to have an arbitrary finite planning horizon, which is weird for something that is supposed to represent unbounded intelligence. Second, there is no reason to carry this sort of thing over to finite, computable agents; so it weakens the generality of the model, by introducing a design detail that’s very dependent on the specific infinite setting.
Another example would be game-theoretic reasoning. Suppose I am concerned about cooperative behavior in deployed AI systems. I might work on something like the equilibrium selection problem in game theory, looking for rationality concepts which can select cooperative equilibria where they exist. However, this kind of work will typically treat a “game” as something which inherently comes with a pointer to the other agents. This limits the real-world applicability of such results, because to apply it to real AI systems, those systems would need “agent pointers” as well. This is a difficult engineering problem (creating an AI system which identifies “agents” in its environment); and even assuming away the engineering challenges, there are serious philosophical difficulties (what really counts as an “agent”?).
We could try to tackle those difficulties, but my assumption will tend to be that it’ll result in fairly brittle abstractions with weird failure modes.
Instead, I would advocate for Pavlov-like strategies which do not depend on actually identifying “agents” in order to have cooperative properties. I expect these to be more robust and present fewer technical challenges.
Of course, this general heuristic may not turn out to apply in the specific case we are discussing. If you control the training process, then, for the duration of training, you control the agent and the environment, and these concepts seem unproblematic. However, it does seem unrealistic to really check every environment; so, it seems like to establish strong guarantees, you’d need to do worst-case reasoning over arbitrary environments, rather than checking environments in detail. This is how I was mainly interpreting jbkjr; perturbation sets could be a way to make things more feasible (at a cost).
- Edouard Harris 4 Aug 2021 19:00 UTC
  LW: 1 AF: 1
  AF Parent
  I’m not sure what would constitute a clearly-worked counterexample. To me, a high reliance on an agent/world boundary constitutes a “non-naturalistic” assumption, which simply makes me think a framework is more artificial/fragile.
  Oh for sure. I wouldn’t recommend having a Cartesian boundary assumption as the fulcrum of your alignment strategy, for example. But what could be interesting would be to look at an isolated dynamical system, draw one boundary, investigate possible objective functions in the context of that boundary; then erase that first boundary, draw a second boundary, investigate that; etc. And then see whether any patterns emerge that might fit an intuitive notion of agency. But the only fundamentally real object here is always going to be the whole system, absolutely.
  As I understand, something like AIXI forces you to draw one particular boundary because of the way the setting is constructed (infinite on one side, finite on the other). So I’d agree that sort of thing is more fragile.
  The multiagent setting is interesting though, because it gets you into the game of carving up your universe into more than 2 pieces. Again it would be neat to investigate a setting like this with different choices of boundaries and see if some choices have more interesting properties than others.