It seems almost tautologically true that you can’t accurately predict what an agent will do without actually running the agent. Because, any algorithm that accurately predicts an agent can itself be regarded as an instance of the same agent.
What I expect the abstract theory of intelligence to do is something like producing a categorization of agents in terms of qualitative properties. Whether that’s closer to “momentum” or “fitness”, I’m not sure the question is even meaningful.
I think the closest analogy is: abstract theory of intelligence is to AI engineering as complexity theory is to algorithmic design. Knowing the complexity class of a problem doesn’t tell you the best practical way to solve it, but it does give you important hints. (For example, if the problem is of exponential time complexity then you can only expect to solve it either for small inputs or in some special cases, and average-case complexity tells you just whether these cases need to be very special or not. If the problem is in NC then you know that it’s possible to gain a lot from parallelization. If the problem is in NP then at least you can test solutions, et cetera.)
And also, abstract theory of alignment should be to AI safety as complexity theory is to cryptography. Once again, many practical considerations are not covered by the abstract theory, but the abstract theory does tell you what kind of guarantees you can expect and when. (For example, in cryptography we can (sort of) know that a certain protocol has theoretical guarantees, but there is engineering work finding a practical implementation and ensuring that the assumptions of the theory hold in the real system.)
It seems almost tautologically true that you can’t accurately predict what an agent will do without actually running the agent. Because, any algorithm that accurately predicts an agent can itself be regarded as an instance of the same agent.
That seems manifestly false. You can figure out whether an algorithm halts or not without being accidentally stuck in an infinite loop. You can look at the recursive Fibonacci algorithm and figure out what it would do without ever running it. So there is a clear distinction between analyzing an algorithm and executing it. If anything, one would know more about the agent by using the techniques from analysis of algorithms than the agent would ever know about themselves.
Of course you can predict some properties of what an agent will do. In particular, I hope that we will eventually have AGI algorithms that satisfy provable safety guarantees. But, you can’t make exact predictions. In fact, there probably is a mathematical law that limits how accurate predictions you can get.
An optimization algorithm is, by definition, something that transforms computational resources into utility. So, if your prediction is so close to the real output that it has similar utility, then it means the way you produced this prediction involved the same product of “optimization power per unit of resources” by “amount of resources invested” (roughly speaking, I don’t claim to already know the correct formalism for this). So you would need to either (i) run a similar algorithm with similar resources or (ii) run a dumber algorithm but with more resources or (iii) use less resources but an even smarter algorithm.
So, if you want to accurately predict the output of a powerful optimization algorithm, your prediction algorithm would usually have to be either a powerful optimization algorithm in itself (cases i and iii) or prohibitively costly to run (case ii). The exception is cases when the optimization problem is easy, so a dumb algorithm can solve it without much resources (or a human can figure out the answer by emself).
It seems almost tautologically true that you can’t accurately predict what an agent will do without actually running the agent. Because, any algorithm that accurately predicts an agent can itself be regarded as an instance of the same agent.
What I expect the abstract theory of intelligence to do is something like producing a categorization of agents in terms of qualitative properties. Whether that’s closer to “momentum” or “fitness”, I’m not sure the question is even meaningful.
I think the closest analogy is: abstract theory of intelligence is to AI engineering as complexity theory is to algorithmic design. Knowing the complexity class of a problem doesn’t tell you the best practical way to solve it, but it does give you important hints. (For example, if the problem is of exponential time complexity then you can only expect to solve it either for small inputs or in some special cases, and average-case complexity tells you just whether these cases need to be very special or not. If the problem is in NC then you know that it’s possible to gain a lot from parallelization. If the problem is in NP then at least you can test solutions, et cetera.)
And also, abstract theory of alignment should be to AI safety as complexity theory is to cryptography. Once again, many practical considerations are not covered by the abstract theory, but the abstract theory does tell you what kind of guarantees you can expect and when. (For example, in cryptography we can (sort of) know that a certain protocol has theoretical guarantees, but there is engineering work finding a practical implementation and ensuring that the assumptions of the theory hold in the real system.)
That seems manifestly false. You can figure out whether an algorithm halts or not without being accidentally stuck in an infinite loop. You can look at the recursive Fibonacci algorithm and figure out what it would do without ever running it. So there is a clear distinction between analyzing an algorithm and executing it. If anything, one would know more about the agent by using the techniques from analysis of algorithms than the agent would ever know about themselves.
In special cases, not in the general case.
Of course you can predict some properties of what an agent will do. In particular, I hope that we will eventually have AGI algorithms that satisfy provable safety guarantees. But, you can’t make exact predictions. In fact, there probably is a mathematical law that limits how accurate predictions you can get.
An optimization algorithm is, by definition, something that transforms computational resources into utility. So, if your prediction is so close to the real output that it has similar utility, then it means the way you produced this prediction involved the same product of “optimization power per unit of resources” by “amount of resources invested” (roughly speaking, I don’t claim to already know the correct formalism for this). So you would need to either (i) run a similar algorithm with similar resources or (ii) run a dumber algorithm but with more resources or (iii) use less resources but an even smarter algorithm.
So, if you want to accurately predict the output of a powerful optimization algorithm, your prediction algorithm would usually have to be either a powerful optimization algorithm in itself (cases i and iii) or prohibitively costly to run (case ii). The exception is cases when the optimization problem is easy, so a dumb algorithm can solve it without much resources (or a human can figure out the answer by emself).