It seems almost tautologically true that you can’t accurately predict what an agent will do without actually running the agent. Because, any algorithm that accurately predicts an agent can itself be regarded as an instance of the same agent.
That seems manifestly false. You can figure out whether an algorithm halts or not without being accidentally stuck in an infinite loop. You can look at the recursive Fibonacci algorithm and figure out what it would do without ever running it. So there is a clear distinction between analyzing an algorithm and executing it. If anything, one would know more about the agent by using the techniques from analysis of algorithms than the agent would ever know about themselves.
Of course you can predict some properties of what an agent will do. In particular, I hope that we will eventually have AGI algorithms that satisfy provable safety guarantees. But, you can’t make exact predictions. In fact, there probably is a mathematical law that limits how accurate predictions you can get.
An optimization algorithm is, by definition, something that transforms computational resources into utility. So, if your prediction is so close to the real output that it has similar utility, then it means the way you produced this prediction involved the same product of “optimization power per unit of resources” by “amount of resources invested” (roughly speaking, I don’t claim to already know the correct formalism for this). So you would need to either (i) run a similar algorithm with similar resources or (ii) run a dumber algorithm but with more resources or (iii) use less resources but an even smarter algorithm.
So, if you want to accurately predict the output of a powerful optimization algorithm, your prediction algorithm would usually have to be either a powerful optimization algorithm in itself (cases i and iii) or prohibitively costly to run (case ii). The exception is cases when the optimization problem is easy, so a dumb algorithm can solve it without much resources (or a human can figure out the answer by emself).
That seems manifestly false. You can figure out whether an algorithm halts or not without being accidentally stuck in an infinite loop. You can look at the recursive Fibonacci algorithm and figure out what it would do without ever running it. So there is a clear distinction between analyzing an algorithm and executing it. If anything, one would know more about the agent by using the techniques from analysis of algorithms than the agent would ever know about themselves.
In special cases, not in the general case.
Of course you can predict some properties of what an agent will do. In particular, I hope that we will eventually have AGI algorithms that satisfy provable safety guarantees. But, you can’t make exact predictions. In fact, there probably is a mathematical law that limits how accurate predictions you can get.
An optimization algorithm is, by definition, something that transforms computational resources into utility. So, if your prediction is so close to the real output that it has similar utility, then it means the way you produced this prediction involved the same product of “optimization power per unit of resources” by “amount of resources invested” (roughly speaking, I don’t claim to already know the correct formalism for this). So you would need to either (i) run a similar algorithm with similar resources or (ii) run a dumber algorithm but with more resources or (iii) use less resources but an even smarter algorithm.
So, if you want to accurately predict the output of a powerful optimization algorithm, your prediction algorithm would usually have to be either a powerful optimization algorithm in itself (cases i and iii) or prohibitively costly to run (case ii). The exception is cases when the optimization problem is easy, so a dumb algorithm can solve it without much resources (or a human can figure out the answer by emself).