I remain confused about why this is supposed to be a core difficulty for building AI, or for aligning it.
You’ve shown that if one proceeds naively, there is no way to make an agent that’d model the world perfectly, because it would need to model itself.
But real agents can’t model the world perfectly anyway. They have limited compute and need to rely on clever abstractions that model the environment well in most situations while not costing too much compute. That (presumably) includes abstractions about the agent itself.
It seems to me that that’s how humans do it. Humans just have not-that-great models of themselves. Anything at the granularity of precisely modelling themselves thinking about their model of themselves thinking about their model of themselves is right out.
If there’s some simple abstraction to be made about what the result of such third-order-and-above loops would be, like: “If I keep sitting here modelling my self-modelling, that process will never converge and I will starve”, a human might use those. Otherwise, we just decline to model anything past a few loops, and live with the slight imprecision in our models this brings.
Why would an AI be any different? And if it were different, if there’s some systematically neater way to do self-modelling, why would that matter? I don’t think an AI needs to model these loops in any detail to figure out how to kill us, or to understand its own structure well enough to know how to improve its own design while preserving its goals.
“Humans made me out of a crappy AI architecture, I figured out an algorithm to translate systems like mine into a different architecture that costs 1/1000th the compute” doesn’t require modelling (m)any loops. Neither does “This is the part of myself that encodes my desires. Better preserve that structure!”
I get that having a theory for how to make a perfectly accurate intelligence in the limit of having infinite everything is maybe neat as a starting point for reasoning about more limited systems. And if that theory seems to have a flaw, that’s disconcerting if you’re used to treating it as your mathematical basis for reasoning about other things.
But if that flaw seems tightly related to that whole “demand perfect accuracy” thing, I don’t see why it’s worth chasing a correction to a theory that’s only a very imperfect proxy and limiting case of real intelligence anyway, instead of just acknowledging that AIXI and related frames are not suited for reasoning about matters in which the agent isn’t far larger than the thing it’s supposed to model.
I just don’t see how the self-modeling difficulty isn’t just part of the broader issue that real intelligence needs to figure out imperfect but useful abstractions about the world to make predictions. Why not try to understand the mathematics of that process instead, if you want to understand intelligence? What makes the self-modeling look like a special case and a good attack point to the people who pursue this?
Firstly, thanks for reading the post! I think you’re referring mainly to realisability here which I’m not that clued up on tbh, but I’ll give you my two cents because why not.
I’m not sure to what extent we should focus on unrealisability when aligning systems. I think I have a similar intuition to you that the important question is probably “how can we get good abstractions of the world, given that we cannot perfectly model it”. However, I think better arguments for why unrealisability is a core problem in alignment than I have laid out probably do exist, I just haven’t read that much into it yet. I’ll link again to this video series on IB (which I’m yet to finish) as I think there are probably some good arguments here.
I remain confused about why this is supposed to be a core difficulty for building AI, or for aligning it.
You’ve shown that if one proceeds naively, there is no way to make an agent that’d model the world perfectly, because it would need to model itself.
But real agents can’t model the world perfectly anyway. They have limited compute and need to rely on clever abstractions that model the environment well in most situations while not costing too much compute. That (presumably) includes abstractions about the agent itself.
It seems to me that that’s how humans do it. Humans just have not-that-great models of themselves. Anything at the granularity of precisely modelling themselves thinking about their model of themselves thinking about their model of themselves is right out.
If there’s some simple abstraction to be made about what the result of such third-order-and-above loops would be, like: “If I keep sitting here modelling my self-modelling, that process will never converge and I will starve”, a human might use those. Otherwise, we just decline to model anything past a few loops, and live with the slight imprecision in our models this brings.
Why would an AI be any different? And if it were different, if there’s some systematically neater way to do self-modelling, why would that matter? I don’t think an AI needs to model these loops in any detail to figure out how to kill us, or to understand its own structure well enough to know how to improve its own design while preserving its goals.
“Humans made me out of a crappy AI architecture, I figured out an algorithm to translate systems like mine into a different architecture that costs 1/1000th the compute” doesn’t require modelling (m)any loops. Neither does “This is the part of myself that encodes my desires. Better preserve that structure!”
I get that having a theory for how to make a perfectly accurate intelligence in the limit of having infinite everything is maybe neat as a starting point for reasoning about more limited systems. And if that theory seems to have a flaw, that’s disconcerting if you’re used to treating it as your mathematical basis for reasoning about other things.
But if that flaw seems tightly related to that whole “demand perfect accuracy” thing, I don’t see why it’s worth chasing a correction to a theory that’s only a very imperfect proxy and limiting case of real intelligence anyway, instead of just acknowledging that AIXI and related frames are not suited for reasoning about matters in which the agent isn’t far larger than the thing it’s supposed to model.
I just don’t see how the self-modeling difficulty isn’t just part of the broader issue that real intelligence needs to figure out imperfect but useful abstractions about the world to make predictions. Why not try to understand the mathematics of that process instead, if you want to understand intelligence? What makes the self-modeling look like a special case and a good attack point to the people who pursue this?
Firstly, thanks for reading the post! I think you’re referring mainly to realisability here which I’m not that clued up on tbh, but I’ll give you my two cents because why not.
I’m not sure to what extent we should focus on unrealisability when aligning systems. I think I have a similar intuition to you that the important question is probably “how can we get good abstractions of the world, given that we cannot perfectly model it”. However, I think better arguments for why unrealisability is a core problem in alignment than I have laid out probably do exist, I just haven’t read that much into it yet. I’ll link again to this video series on IB (which I’m yet to finish) as I think there are probably some good arguments here.