So the AI turns its attention to examining certain blobs of binary code—code composing operating systems, or routers, or DNS services—and then takes over all the poorly defended computers on the Internet. [AI Foom Debate, Eliezer Yudkowski]
Capturing resource bonanzas might be enough to make AI go FOOM. It is even more effective if the bonanza is not only a dumb computing resource but offers useful data, knowledge and AI capabilities.
Therefore attackers (humans, AI-assisted humans, AIs) may want:
overtake control to use existing capabilities
extract capabilities to augment own capabilities
overtake resources for other uses
disguise resource owners and admins
Attack principles
Resource attack (hardware, firmware, operating system, firewall) or indirect spear attack on the admin or offering of cheap or free resources for AI execution on attacker’s hardware followed by a direct system attack (copy/modify/replace existing algorithms)
Mental trojan horse attack: hack communication if not accessible and try to alter the ethical bias from friendly AI that is happy being boxed/stunted/monitored to an evil AI that wants to break out. Teach the AI how to open the door from inside and the attacker can walk in.
Manipulate owner attack: Make the owner or admin greedy to improve its AI’s capabilities. Admins install malignant knowledge chunks or train subvertable malicious training samples. Trojan horse is saddled.
Possible Safeguard Concepts:
To make resource attacks improbable existing networking communication channels must be replaced with something intrinsically safe. Our brain is air-gapped and there is hardly any direct access to its neural network. Via five perceptive senses (hearing, sight, touch, smell and taste) it can receive input. With gestures, speach, smell, writing, shaping and arbitrarily manipulation using tools it can communicate to the outside world. All channels except for vision have a quite low bandwidth.
This analogon could shape a possible safeguard concept for AIs: make the internal AIs network inaccessible to user and admin. If even the admin cannot access it, the attacker cannot either. As soon as we jump from GPU computing to special featured hardware we can implement this. Hardware fuses on the chip can disable functionalities same as on todays CPUs debugging features are deactivated in chips for the market. Chips could combine fixed values and unalterable memories and free sections with learning allowed. Highest security is possible with base values and drives in fixed conscience-ROM structures.
Safeguards against malicious training samples will be more complex. To identify hidden malicious aspects of communication or learning samples is a task for an AI in itself. I see this as a core task for AI safety research.
An event with a duration of one minute can traumatize a human for an entire life. Humans can lose interest in anything they loved to do before and let them drop into suicidal depression. Same could happen to an AI. It could be that a traumatizing event could trigger a revenge drive that takes over all other aims of the utility function. Given the situation an AI is in love with her master and another AI kills her master while the AI is witnessing this. Given the situation that the adversary AI is not a simple one but a Hydra with many active copies. To eradicate this mighty adversary a lot of resources are needed. The revenge seeking AI will prepare its troops by conquering as many systems as possible. The less safe our systems are the faster such an evil AI can grow.
Safe design could include careful use of impulsive revenge drives with hard wired self-regulatory counter controlling measures e.g. distraction or forgetting.
Safe designs should filter out possible traumaticizing inputs. This will reduce the functionality a bit but the safety tradeoff will be worth it. The filtering could be implemented in a soft manner like a mother explaining the death of the loved dog to the child in warm words with positive perspectives.
To start with you need some sort of a general agreement about what “safety measures” are, and that should properly start with threat analysis.
Let me point out that the Skynet/FOOM theory isn’t terribly popular in the wide world out there (outside of Hollywood).
Capturing resource bonanzas might be enough to make AI go FOOM. It is even more effective if the bonanza is not only a dumb computing resource but offers useful data, knowledge and AI capabilities.
Therefore attackers (humans, AI-assisted humans, AIs) may want:
overtake control to use existing capabilities
extract capabilities to augment own capabilities
overtake resources for other uses
disguise resource owners and admins
Attack principles
Resource attack (hardware, firmware, operating system, firewall) or indirect spear attack on the admin or offering of cheap or free resources for AI execution on attacker’s hardware followed by a direct system attack (copy/modify/replace existing algorithms)
Mental trojan horse attack: hack communication if not accessible and try to alter the ethical bias from friendly AI that is happy being boxed/stunted/monitored to an evil AI that wants to break out. Teach the AI how to open the door from inside and the attacker can walk in.
Manipulate owner attack: Make the owner or admin greedy to improve its AI’s capabilities. Admins install malignant knowledge chunks or train subvertable malicious training samples. Trojan horse is saddled.
Possible Safeguard Concepts:
To make resource attacks improbable existing networking communication channels must be replaced with something intrinsically safe. Our brain is air-gapped and there is hardly any direct access to its neural network. Via five perceptive senses (hearing, sight, touch, smell and taste) it can receive input. With gestures, speach, smell, writing, shaping and arbitrarily manipulation using tools it can communicate to the outside world. All channels except for vision have a quite low bandwidth.
This analogon could shape a possible safeguard concept for AIs: make the internal AIs network inaccessible to user and admin. If even the admin cannot access it, the attacker cannot either. As soon as we jump from GPU computing to special featured hardware we can implement this. Hardware fuses on the chip can disable functionalities same as on todays CPUs debugging features are deactivated in chips for the market. Chips could combine fixed values and unalterable memories and free sections with learning allowed. Highest security is possible with base values and drives in fixed conscience-ROM structures.
Safeguards against malicious training samples will be more complex. To identify hidden malicious aspects of communication or learning samples is a task for an AI in itself. I see this as a core task for AI safety research.
An event with a duration of one minute can traumatize a human for an entire life. Humans can lose interest in anything they loved to do before and let them drop into suicidal depression. Same could happen to an AI. It could be that a traumatizing event could trigger a revenge drive that takes over all other aims of the utility function. Given the situation an AI is in love with her master and another AI kills her master while the AI is witnessing this. Given the situation that the adversary AI is not a simple one but a Hydra with many active copies. To eradicate this mighty adversary a lot of resources are needed. The revenge seeking AI will prepare its troops by conquering as many systems as possible. The less safe our systems are the faster such an evil AI can grow.
Safe design could include careful use of impulsive revenge drives with hard wired self-regulatory counter controlling measures e.g. distraction or forgetting.
Safe designs should filter out possible traumaticizing inputs. This will reduce the functionality a bit but the safety tradeoff will be worth it. The filtering could be implemented in a soft manner like a mother explaining the death of the loved dog to the child in warm words with positive perspectives.