The problem appears to be that no one has a clue how to work on FAI and make sure that it actually is FAI. If someone made what they thought was FAI and it wasn’t actually FAI, how could you tell until it was too late?
If someone made what they thought was FAI and it wasn’t actually FAI, how could you tell until it was too late?
How can you tell that a theorem is correct? By having a convincing proof, or perhaps other arguments to the effect of the theorem being correct, and being trained to accept correct proofs (arguments) and not incorrect ones.
Even though we currently have no idea about how to construct a FAI or how an argument for a particular design being a FAI might look like, we can still refuse to believe that something is a FAI when we are not given sufficient grounds for believing it is one, provided the people who make such decisions are skilled enough to accept correct arguments but not incorrect ones.
The rule to follow is to not actually build or run things that you don’t know to be good. By default, they don’t even work, and if they do, they are not good, because value is complex and won’t be captured by chance. There need to be strong grounds for believing that a particular design will work, and here rationality training is essential, because by default people follow one of the many kinds of crazy reasoning about these things.
I read a science fiction story where they made a self sustaining space station. Placed a colony of scientist and engineers needed to run it and then sealed it off with no connection to the outside world. Then they modified all the computer files to make it appear as though the humans had evolved on the station and there was nothing else but the station. Then they woke up the AI and start stress testing it by attacking it in non-harmful ways.
It was an interesting story, not sure how useful it would be in real life. The AI actually manages to figure out that there is stuff outside the station and they are only saved because it creates its own moral code in which killing is wrong. This was a very convenient plot point so I wouldn’t trust it in real life.
How do you know that a person is really friendly? You use methods that have worked in the past and look for manipulative techniques that misleadingly friendly people use to make you think they are friendly. We know that someone is friendly via the same methodology that we determine what it means to be friendly, subjective benefits (emotional support etc.) and goal assistance (helping you move when they could simply refuse to do so) without malicious motives that ultimately disservice you.
In the case of FAI we want more surety, and we can presumably get this via simulation and proofs of correctness. I would assume that even after we had a proof of correctness for a meta-ethical system we would want to run it through as many virtual scenarios as possible, since the human brain is simply not capable of the chains of reasoning within the meta-ethics that the machine would be, so we would want to introduce it to scenarios that are as complex as possible in order to determine that it fits our intuition of friendliness.
It seems to me that the bulk of the work is in the arena of identifying the most friendly meta-ethical architecture. The Lokhorst paper lukeprog posted a while ago clarified a few things for me, though I have no access to the cutting edge work on FAI (save for what leaks out into the blog posts), and judging by what Will Newsome has said in the past (cannot find the post) they have compiled a relatively large list of possibly relevant sub-problems that I would be very interested to see (even if many of them are likely to be time drains).
The problem appears to be that no one has a clue how to work on FAI and make sure that it actually is FAI. If someone made what they thought was FAI and it wasn’t actually FAI, how could you tell until it was too late?
How can you tell that a theorem is correct? By having a convincing proof, or perhaps other arguments to the effect of the theorem being correct, and being trained to accept correct proofs (arguments) and not incorrect ones.
Even though we currently have no idea about how to construct a FAI or how an argument for a particular design being a FAI might look like, we can still refuse to believe that something is a FAI when we are not given sufficient grounds for believing it is one, provided the people who make such decisions are skilled enough to accept correct arguments but not incorrect ones.
The rule to follow is to not actually build or run things that you don’t know to be good. By default, they don’t even work, and if they do, they are not good, because value is complex and won’t be captured by chance. There need to be strong grounds for believing that a particular design will work, and here rationality training is essential, because by default people follow one of the many kinds of crazy reasoning about these things.
See! You’ve found a problem to work on already! :)
[The downvote on your comment isn’t mine btw.]
I read a science fiction story where they made a self sustaining space station. Placed a colony of scientist and engineers needed to run it and then sealed it off with no connection to the outside world. Then they modified all the computer files to make it appear as though the humans had evolved on the station and there was nothing else but the station. Then they woke up the AI and start stress testing it by attacking it in non-harmful ways.
It was an interesting story, not sure how useful it would be in real life. The AI actually manages to figure out that there is stuff outside the station and they are only saved because it creates its own moral code in which killing is wrong. This was a very convenient plot point so I wouldn’t trust it in real life.
How do you know that a person is really friendly? You use methods that have worked in the past and look for manipulative techniques that misleadingly friendly people use to make you think they are friendly. We know that someone is friendly via the same methodology that we determine what it means to be friendly, subjective benefits (emotional support etc.) and goal assistance (helping you move when they could simply refuse to do so) without malicious motives that ultimately disservice you.
In the case of FAI we want more surety, and we can presumably get this via simulation and proofs of correctness. I would assume that even after we had a proof of correctness for a meta-ethical system we would want to run it through as many virtual scenarios as possible, since the human brain is simply not capable of the chains of reasoning within the meta-ethics that the machine would be, so we would want to introduce it to scenarios that are as complex as possible in order to determine that it fits our intuition of friendliness.
It seems to me that the bulk of the work is in the arena of identifying the most friendly meta-ethical architecture. The Lokhorst paper lukeprog posted a while ago clarified a few things for me, though I have no access to the cutting edge work on FAI (save for what leaks out into the blog posts), and judging by what Will Newsome has said in the past (cannot find the post) they have compiled a relatively large list of possibly relevant sub-problems that I would be very interested to see (even if many of them are likely to be time drains).