The novice thinks that Friendly AI is a problem of coercing an AI to make it do what you want, rather than the AI following its own desires. But the real problem of Friendly AI is one of communication—transmitting category boundaries, like “good”, that can’t be fully delineated in any training data you can give the AI during its childhood.
Or more generally, not just a binary classification problem but a measurement issue: How to measure benefit to humans or human satisfaction.
It has sometimes struck me that this FAI requirement has a lot in common with something we were talking about on the futarchy list a while ago. Specifically, how to measure a populace’s satisfaction in a robust way. (Meta: exploring the details here would be going off on a tangent. Unfortunately I can’t easily link to the futarchy list because Typepad has decided Yahoo links are “potential comment spam”)
Of course with futarchy we want to do so for a different purpose, informing a decision market. At first glance the purposes might seem to have little in common. Futarchy contemplates just human participants. The human participants might well be aided by machines, but that is their business alone. FAI contemplates transcendent AI, where humanity cannot hope to truly control it anymore but can only hope that we have raised it properly (so to speak).
But beneath the surface they have important properties in common. They each contemplate an immensely intelligent mechanism that must do the right thing across an unimaginably broad panorama of issues. They both need to inform this mechanism’s utility function, so they need to measure benefit to humans accurately and robustly. They both could be dangerous if the metric has loopholes. So they both need a metric that is not a fallible proxy for benefit to humans but a true measure of it. They
both need this metric to be secure against intelligent attack—even the best metric does little good if an attacker can change it into something else. They both have to be started with the right metric or something that leads quite surely to it, because correcting them later will be impossible. (Robin speculated that futarchy could generate its own future utility function but I believe such an approach can only cause degeneration)
I conclude that there must be at least a strong resemblance between a desirable utility metric for futarchy and a desirable utility metric for FAI.
Beyond this, I speculate that futarchy has advantages as a sort of platform for FAI. I’ll call the combination “futurAIrchy”.
First, it might teach a young FAI better than any human teacher could. Like, the young FAI (or several versions or instances of it) would participate much like any other trader, but use the market feedback to refine its knowledge and procedures.
However, certain caprices of the market (January slump, that sort of thing) might lead to FAI learning bad or irrelevant tenets (eg, “January is an evil time”). That pseudo-knowledge would cause sub-optimal decisions and would risk insane behavior (eg, “Forcibly sedate everyone during january”)
So I think we’d want FAI trader(s) to be insulated from the less meaningful patterns of the market. I propose that FAIs would trade thru a front end that only concerns itself with hedging against such patterns, and makes them irrelevant as far as the FAI can tell. Call it a “front-end AI”. (Problems: Determining the right borderline as they both get more sophisticated. Who or what determines that, under what rules, and how could they abuse the power? Should there be just one front-end AI,
arbitrarily many, or many but according to some governing rule?)
Secondly, the structure above might be an unusually safe architecture for FAI. Like, forever it is the rule that the only legitimate
components are:
Many FAI’s that do nothing except discover information and trade in the futarchy market thru a front-end AI. They merely try to maximize their profit (under some predetermined risk-tolerance, etc details)
One or many front-end AI’s that do nothing except discover information and hedge in the market. Also maximizing their profit.
Decision mechanism governing the borderline between FAIs and front-end AIs. Might just be a separate decision market.
Many subordinate AIs whose scope of action is not limited by rules given here, but which are entirely subordinate to the decisions of the futarchy market, to the point where it’s hard-wired that the market can pull a subordinate AIs plug.
A mechanism to measure human satisfaction or benefit to humans. This is ultimately what controls futurAIrchy. The metric has to be generated from humans’ self-reports and situations. There’s a lot more to be said.
Problems: “log-rolling” where different components collude and thereby accidentally defeat the system. I don’t see an exploit yet but that doesn’t mean there isn’t one. Is there yet a separate mechanism for securing the system against collusion?
What becomes of the profit that these AIs make? Surely we don’t put so much real spending power in their silicon hands. But then, all they can do is re-invest it. Perhaps the money ceases to be human-spendable money and becomes just tokens.
What if a FAI goes bankrupt, or becomes inordinately wealthy? I propose that the behavior be that of a population search algorithm (eg genetic algorithm, though it’s not clear how or whether crossover should be used). Bankrupt FAIs, or even low-scoring ones, cease to exist, and successful ones reproduce.
If FAI’s are like persisting individuals, their hardware is an issue. Like, when a bankrupt FAI is replaced by a wealthy one’s offspring, what if the bankrupt one’s hardware just isn’t fast enough? One proposal: it is all somehow hardware-balanced so that only the algorithms make a difference. Another proposal: FAIs (or another component that works with them) can buy and sell the hardware FAIs run on. Thus a bankrupt FAI’s hardware is already sold. But then it is not so obvious how
reproduction should be managed.
There’s plenty more to be said about futurAIrchy but I’ve gone on long enough for now.
Or more generally, not just a binary classification problem but a measurement issue: How to measure benefit to humans or human satisfaction.
It has sometimes struck me that this FAI requirement has a lot in common with something we were talking about on the futarchy list a while ago. Specifically, how to measure a populace’s satisfaction in a robust way. (Meta: exploring the details here would be going off on a tangent. Unfortunately I can’t easily link to the futarchy list because Typepad has decided Yahoo links are “potential comment spam”)
Of course with futarchy we want to do so for a different purpose, informing a decision market. At first glance the purposes might seem to have little in common. Futarchy contemplates just human participants. The human participants might well be aided by machines, but that is their business alone. FAI contemplates transcendent AI, where humanity cannot hope to truly control it anymore but can only hope that we have raised it properly (so to speak).
But beneath the surface they have important properties in common. They each contemplate an immensely intelligent mechanism that must do the right thing across an unimaginably broad panorama of issues. They both need to inform this mechanism’s utility function, so they need to measure benefit to humans accurately and robustly. They both could be dangerous if the metric has loopholes. So they both need a metric that is not a fallible proxy for benefit to humans but a true measure of it. They both need this metric to be secure against intelligent attack—even the best metric does little good if an attacker can change it into something else. They both have to be started with the right metric or something that leads quite surely to it, because correcting them later will be impossible. (Robin speculated that futarchy could generate its own future utility function but I believe such an approach can only cause degeneration)
I conclude that there must be at least a strong resemblance between a desirable utility metric for futarchy and a desirable utility metric for FAI.
Beyond this, I speculate that futarchy has advantages as a sort of platform for FAI. I’ll call the combination “futurAIrchy”.
First, it might teach a young FAI better than any human teacher could. Like, the young FAI (or several versions or instances of it) would participate much like any other trader, but use the market feedback to refine its knowledge and procedures.
However, certain caprices of the market (January slump, that sort of thing) might lead to FAI learning bad or irrelevant tenets (eg, “January is an evil time”). That pseudo-knowledge would cause sub-optimal decisions and would risk insane behavior (eg, “Forcibly sedate everyone during january”)
So I think we’d want FAI trader(s) to be insulated from the less meaningful patterns of the market. I propose that FAIs would trade thru a front end that only concerns itself with hedging against such patterns, and makes them irrelevant as far as the FAI can tell. Call it a “front-end AI”. (Problems: Determining the right borderline as they both get more sophisticated. Who or what determines that, under what rules, and how could they abuse the power? Should there be just one front-end AI, arbitrarily many, or many but according to some governing rule?)
Secondly, the structure above might be an unusually safe architecture for FAI. Like, forever it is the rule that the only legitimate components are:
Many FAI’s that do nothing except discover information and trade in the futarchy market thru a front-end AI. They merely try to maximize their profit (under some predetermined risk-tolerance, etc details)
One or many front-end AI’s that do nothing except discover information and hedge in the market. Also maximizing their profit.
Decision mechanism governing the borderline between FAIs and front-end AIs. Might just be a separate decision market.
Many subordinate AIs whose scope of action is not limited by rules given here, but which are entirely subordinate to the decisions of the futarchy market, to the point where it’s hard-wired that the market can pull a subordinate AIs plug.
A mechanism to measure human satisfaction or benefit to humans. This is ultimately what controls futurAIrchy. The metric has to be generated from humans’ self-reports and situations. There’s a lot more to be said.
Problems: “log-rolling” where different components collude and thereby accidentally defeat the system. I don’t see an exploit yet but that doesn’t mean there isn’t one. Is there yet a separate mechanism for securing the system against collusion?
What becomes of the profit that these AIs make? Surely we don’t put so much real spending power in their silicon hands. But then, all they can do is re-invest it. Perhaps the money ceases to be human-spendable money and becomes just tokens.
What if a FAI goes bankrupt, or becomes inordinately wealthy? I propose that the behavior be that of a population search algorithm (eg genetic algorithm, though it’s not clear how or whether crossover should be used). Bankrupt FAIs, or even low-scoring ones, cease to exist, and successful ones reproduce.
If FAI’s are like persisting individuals, their hardware is an issue. Like, when a bankrupt FAI is replaced by a wealthy one’s offspring, what if the bankrupt one’s hardware just isn’t fast enough? One proposal: it is all somehow hardware-balanced so that only the algorithms make a difference. Another proposal: FAIs (or another component that works with them) can buy and sell the hardware FAIs run on. Thus a bankrupt FAI’s hardware is already sold. But then it is not so obvious how reproduction should be managed.
There’s plenty more to be said about futurAIrchy but I’ve gone on long enough for now.