Detecting agents and subagents
A putative new idea for AI control; index here.
Unfortunately, it’s extremely hard to identify an agent. Agents need not come conveniently packaged in some “sensors-actuators-internal structure-utility function” form. If you wanted to obfuscate an agent, hiding it in the natural movements of the solar system, or in the gyrations of the internet, this would be easy to do and hard to detect—and very hard to define what you would be looking for.
Fortunately, it’s much easier to detect superintelligent AI’s that have major impact on the world—ie the kind of agents that we would be worried about. Intuitively, this is true: if you suddenly find half the sky getting optimised for space colonisation, it’s likely there’s an AI somewhere there. How can this be formalised?
The importance of the agent
Imagine taking a slice of space-time around the moment when a superintelligent agent just got turned on. If you wanted to predict the future, what piece of information is most relevant in that time slice? Obviously the superintelligent agent. We could run the ideas for reduced impact in reverse, looking for the part that is of maximum impact.
I would therefore define the agent as the part of this slice that it’s the most valuable to know about, or the part that it would be the most valuable to change, counterfactually, if such changes were possible. Note that this definition doesn’t exactly identify agents, but if it misidentifies something, then that something must also be of great importance.
If we imagine the slice of space-time seeping forwards in time from the past, then events in space time will have varying informativeness/importance. As we approach the creation of the superintelligent AI, importance gets concentrated around the run-up to its creation, before being maximally concentrated at the creation of the AI (or it’s escape, if it was confined).
For a more formal definition, I’d imagine a passive pure-resource gathering agent A being fed the details of the time splice, and only being able to pass on a limited amount of the information to another (active) copy of itself, and seeing what it passed on. The data passed on would be the highest priority for the active copy, so would almost certainly include the existence of an enemy agent (almost all agents are “enemies” to resource gathering agents, as they use up precious resources).
Alternatively, we could give A the option of changing some of the data—flipping some real-world bits, at least conceptually—and seeing which bits it preferred to flip. This definition can no doubt be improved by adding noise or other variations.
Now with counterfactuals and false miracles we might be able to actually construct that situation, or something approximating it. Even without that, this is a formal definition that seems to hone in well on the concept of agent detection.
Detecting subagents
Now that we’ve detected agents, can we detect subagents? It seems that we can. A powerful subagent can be detected in the same way that a powerful agent can be. But the subagent has some extra properties; for instance, knowing about the presence of the subagent could give strong evidence of the existence of a powerful agent in the past. This could correspond to agent A finding the existence of the powerful past agent to be important, the existence of the subagent to be important, but the combination of the two facts to not be as important as you’d expect. You could also compare the importance of knowing “past agent + subagent” versus “past agent + subagent who’s motivation is flipped by a thermodynamic miracle”.
We could also track the “importance” of the agent forward through time, and any sudden jump downwards in importance could correspond either to a failure or the creation of a subagent: a sudden jump of importance from one location to another would be an indication.
It’s interesting to note that this approach can tell the difference between an agent that creates a powerful subagent it still controls (in that case, changing the original agent’s utility is of high value) and one it doesn’t.
If we have a sufficiently well defined definition of counterfactual, we could explore some other options. Like allowing the hypothetical resource gathering agent to plant hypothetical “bombs” across the world. It is likely that each bomb would be placed to eliminate one of it rivals.
It seems a lot could also be done by checking what happens to “importance” under different counterfactual definitions (the subagent has emerged from natural processes, versus emerging from a thermodynamic miracle).
Hacking the definition
A AI that is under some restriction based in this definition of agency will no doubt try and hack the definition. How could it do that? One obvious idea is to run a cryptographic hash function over, eg, the positions of all the atoms in the solar system, and base it’s behaviour on the output. In that case, every atom would be of great importance. However, this can be easily overcome if we notice that every atom is of great importance only because of the agent’s presence. In a counterfactual where the agent is removed, these atom positions are not of great importance intrinsically.
Maybe the agent will try and create lots of independent subagents, spreading importance over many locations? In that case, we should look for a strong centralised importance that becomes dispersed.
AI’s are intrinsically disadvantaged when it comes to hacking this definition (in theory) because if they are going to have a large impact, then the moment of their creation/turning on/escape is a very informative one.
The definition seems to be underspecified rather than easily hackable, which is a good position to start from.
It should be noted that a good satisficer should never be detected as a powerful agent. This could be used as a definition of a satisficer, a point we’ll be returning to in subsequent posts.
- New(ish) AI control ideas by 5 Mar 2015 17:03 UTC; 34 points) (
- The subagent problem is really hard by 18 Sep 2015 13:06 UTC; 13 points) (
- Un-optimised vs anti-optimised by 14 Apr 2015 18:30 UTC; 12 points) (
- Defining a limited satisficer by 11 Mar 2015 14:23 UTC; 10 points) (
- Agents detecting agents: counterfactual versus influence by 18 Sep 2015 16:17 UTC; 5 points) (
- 1 Aug 2022 22:32 UTC; 2 points) 's comment on Why multi-agent safety is important by (
- New(ish) AI control ideas by 31 Oct 2017 12:52 UTC; 0 points) (
- 17 Apr 2015 15:13 UTC; 0 points) 's comment on Anti-Pascaline satisficer by (
- 17 Apr 2015 15:12 UTC; 0 points) 's comment on Un-optimised vs anti-optimised by (
I think your definition is really a definition of powerful things (which is of course extremely relevant!).
I’d had some incomplete thoughts in this direction. I’d taken a slightly different tack to you. I’ll paste the relevant notes in.
Interesting. Some of this may be relevant why I post the “models as definitions” post.
This in an interesting article, though necessarily abstract—how can we take this to implement an actual AI detector?
This could be the combination of research in the areas of antivirus software, network detection and intrusion, stock market agents along with some sort of intelligence honeypot (ie some construed event / data that exists for a split second that only an AI could detect and act on)
If this were a friendly AI, shouldn’t the utility function of the gathering agent take sharing / don’t be greedy into account. Though as AI development advances we can’t be sure that all governments / corporations / random researchers will put the necessary effort into making sure their intelligent agent is friendly.
This is indeed a very preliminary concept.
Friendly AI’s need not be nice in a game-theoretic sense. They can (and likely would) be ruthless and calculating at achieving their goals—it’s just that heir goals are good/safe/positive. This puts some constraints on means (eg the AI will likely not kill everyone just to get to its goals), but it’s not likely that “play nicer than you have to with other AIs” will be such a constraint.
This definitely incidental--
Wouldn’t a super intelligent, resource gathering agent simply figure out the futility of its prime directive and abort with some kind of error? Surely it would realize it exists in a universe of limited resources and that it had been given an absurd objective. I mean maybe it’s controlled controlled by some sort of “while resources exist consume resources” loop that is beyond its free will to break out of—but if so, should it be considered an “agent”?
Contra humans, who for the moment are electing to consume themselves to extinction, if anything resource consumer AIs would be comparatively benign.
“Futility of prime directive” is a values question, not an intelligence question. See http://lesswrong.com/lw/h0k/arguing_orthogonality_published_form/