A basic mathematical structure of intelligence
An important concept here on LW is that of a singularity of intelligence, or at least a very rapid growth. Although it seems mostly hopeless, it would be nice if we could find a mathematical approach to quantify these things. I think the first point to note is that intelligence is is surely not one-dimensional. The concept of general intelligence suggest that it might be approximately one-dimensional, in some sense. But how can we really know what is true? I would like to get a better understanding of the geometry of intelligence. To do this, a good starting point is to look for natural operations on intelligence, as operations can lead to geometry.
I see two obvious natural operations on intelligence: Acceleration and Cooperation. Acceleration means speeding up an intelligence or giving it more computation/physical time. Cooperation means taking two intelligences and letting them cooperate on a task.
You may note that this is somewhat similar to sequential and parallel computing, although I can not rigorously state connections yet.
My main message in this post is that The “set of intelligences” with the operations of acceleration and cooperation has the mathematical structure of an semi-module. I believe that a further mathematical study of the “set of intelligences” is likelt to take note of this fundamental structure.
In the rest of the post I will elaborate on what exactly this means.
Consider an abstract set of all “intelligent agents”, where I do not want to give an explicit definition. I think there are many equivalent and non-equivalent ways to define this, but for now an intuitive idea suffices. An intelligent agent is a thing that can be given a task and an amount of time or computation to solve it. The result can then be scored. We therefore think of a task as a function .
Given an agent and a set of tasks , the overall performance of the agent is represented by the induced function . It is this function which grows during a rapid of growth of intelligence, and we would like to know exactly how it might grow. I can not answer this but finding structure is a good start.
We endow the set with an operation . For , the agent represents a team consisting of and , which may now collaborate on any task. Again, I do not wish to rigorously define this, but an example of a rigorous definition could be to model the agents as entities that write into a memory while they still have time/computation left, and that may indicate something they wrote as an answer. The collaboration of two agents is then given by having them both write into a shared memory. They might collaborate and perform better at certain tasks than each on their own, but a priori there also exist agents which simply have no concept of this and do not collaborate, or which misunderstand or even deceive their partner. In general, for some task the inequality
might fail. More on this later.
Acceleration is represented by a semigroup . Specifically, for an agent and a number we define an agent which represents but sped up by a factor , or equivalently giving more time/computation by a factor .
We now state a number of axioms that the structure fulfills, according to intuition.
(A1) We have for all , i.e. cooperation is a commutative operation.
(A2) We have for all , i.e. cooperation is associative (note that we assume that cooperation happens in real-time or in a large number of “turns”, meaning that it does not matter who goes first and who goes last when exchanging information).
(A2) There exists a trivial agent so that . This trivial agent represents an agent that simply gives no answer, they contribute nothing to solving any task.
This makes a semigroup. There are two axioms connecting our structures.
(S1) for all and . It does not matter if you accelerate a model 2x and then 3x, or just once 6x. This means that is a semigroup acting on .
(S2) for all and .. This is an axiom of distributivity which states that the order of the actions of acceleration and cooperation is interchangeable.
With all these axioms, we can say that is an -semimodule. What does this mean? Well for a start, you can think that we almost have a vector space except that there are no inverse elements, i.e. we have semigroups instead of groups, and the distributivity axiom
(S3)
does not hold. In fact, it is quite important that in general The example are non-parallelizable tasks. One might in fact define that a model parallelizes over a task if for every . This is not supposed to be a good definition, I just want to demonstrated that useful definitions may pop out of this approach.
I would now like to state a number of inequalities which are certainly not true for all agents and tasks, but which I believe should be true for all reasonable tasks and for all agents which have some form of general intelligence, understanding of cooperation, and actually want to perform well on the task. If these inequalities hold or when they hold may be a question of interested and a potential for new definitions.
(H1) if and vice-versa. More computation time means better performance.
(H2) . This means that you manage to actually cooperate beneficially.
(H1) for any . This means that cooperating with copies of yourself is never superior to being alone, but accelerated accordingly.
Here are a number of questions that are potential directions for new insight:
(Q1) Are there other obvious/natural operations on intelligence? How does this augment the structure of our semi-module? Could there be a multiplication of agents?
(Q2) Now that we have an algebraic structure on the set of agents, what about the set of tasks? Can we define morphisms from agents to tasks? Note that the maps here are generally non-linear.
(Q3) Are there other intuitive inequalities? Can we decompose the tasks into various categories using such definitions (parallelizable, non-parallelizable etc.)
(Q4) Are there already natural geometries we can define on this structure? (Metric, topologies or inner products?)
(Q5) How might a blow-up of intelligence look? Perhaps we should only expect blow-up of the form but not , as an AGI can copy itself and cooperate but not trivially speed up its computing hardware -fold. This is what I mean that the function might blow up in a “certain way”, i.e. only on parallelizable tasks.
Thank you for reading my post!
An AGI may not be able to speed up its hardware early on, but it can find algorithmic improvements for itself, perhaps enough for many orders of magnitude speedup. So that’s something to take into account.
If we assume that agents are allowed to make choices stochastically, there is a natural topology. It’s basically a product topology. I wonder what the subspace of “reasonable” agents look like?
What do you mean by a product topology here? The product topology being used for a stochastic processes? That requires a topology on the state space in the first place. Right now I have not specified any topologies.
Regarding the stochastic aspect, I have thought about that, but so far I have not yet seen a benefit by including it because any stochastic approach can somehow be seen as just a deterministic approach on the level of distributions. I.e. if a Model M is actually a random variable, and a task T is also a random variable, then the important thing, which is the function M×T⟶R+ and which would now be a random object, can be replaced by a function P(M)×P(T)⟶P(R+). I.e. we map distributions of models and tests to distributions of scores.
Nevertheless on a bit of a different note, consider the following.
I described a task as something which a model can generate an answer to which is then somehow scored. If instead we consider the score of a model on a task to represent the expected value of correct answers given a large amount of tries, then we can say that
T(r⋅M)=r⋅T(M)i.e. we get a new axiom! This states that tasks are no just any functions, but 1-homogeneous functions. But tasks are certainly not linear, as cooperation of a model with itself may bring no improvement on non-parallelizable tasks.
To clarify, suppose that the agents are chatbots. Then given a sequence of previous messages M, it outputs a probability distribution over the next message that the agent wants to say. For example, if the task is rock-paper-scissors, it would output a probability distribution with three possible outputs, “rock”, “paper”, and “scissors”, each with 1⁄3 probability.
Under this structure, there is a product topology indexed over the set is sequences of messages.
If you only want to use the structure defined in the post, another topology would be the finest topology that makes your two operations continuous.
I feel like H2 shouldn’t be true due to the no free lunch theorem. If X + Y is better than X at some task, it must be worse than X for some other task.
This depends on your ontology of course, but just thought I’d point out a case where it fails.
Although we can not rigorously say this yet since we have not chosen a definition of agent, I think this intuitively applies and therefore (H2) can only hold when you are restricted to some set of tasks, perhaps “reasonable tasks”, yea.
I wonder if in the stochastic inteprretation of task this issue disappears because “No Free Lunch” tasks that “diagonalize against a model in a particular fashion have very low probability.