[SEQ RERUN] Magical Categories
Today’s post, Magical Categories was originally published on 24 August 2008. A summary (taken from the LW wiki):
We underestimate the complexity of our own unnatural categories. This doesn’t work when you’re trying to build a FAI.
Discuss the post here (rather than in the comments to the original post).
This post is part of the Rerunning the Sequences series, where we’ll be going through Eliezer Yudkowsky’s old posts in order so that people who are interested can (re-)read and discuss them. The previous post was Unnatural Categories, and you can use the sequence_reruns tag or rss feed to follow the rest of the series.
Sequence reruns are a community-driven effort. You can participate by re-reading the sequence post, discussing it here, posting the next day’s sequence reruns post, or summarizing forthcoming articles on the wiki. Go here for more details, or to have meta discussions about the Rerunning the Sequences series.
- [SEQ RERUN] Three Fallacies of Teleology by 14 Aug 2012 7:45 UTC; 6 points) (
- 14 Aug 2012 20:40 UTC; 0 points) 's comment on [SEQ RERUN] Three Fallacies of Teleology by (
Rereading this part of the Sequences makes me wonder if an AI could make use of a kind of reCAPTCHA approach for sussing out some of these Magical Categories. It certainly would slow up the AI a lot, but could generate a lot of examples and classifications.
I doubt this would be a very efficient solution, but now I’m pretty amused by the prospect of trying to post a blog comment and getting a normal CAPTCHA plus something like this:
The amount of time it would take to get a reasonable dataset would likely exceed the projected lifespan of the universe, I imagine.
Why the assumption that an AGI will be smart enough to drug the entire population with soma drugs before it is smart enough to recognize that soma-happiness isn’t the same thing as happiness? It’s clear argument for why we should never hard-code values into a general AI that cannot be changed when we figure out what is wrong about them.
The problem is that we need to be careful about assuming more intelligence automatically draws a distinction between soma-happiness and real-happinesss. We’re pretty sure we know what the right answer is, so we assume anything at least as smart as we are will draw the same distinction.
We don’t need to wait for AI for counterexamples, humans already spend a lot of time arguing about what counts as soma-happiness and what counts as real-happiness. Some examples of contested classification off the top of my head (watching football, using ecstasy, nsa sex, reading Fifty Shades of Gray etc).
I can think of one common division: intellectual pleasures are better than physical pleasures, but I’m not confident that’s a good rule, period, let alone that the distinction is clear enough to train an AI on, and I’m definitely not confident that it’s a safe goal. (For Dollhouse fans, I could imagine this resulting in a slightly nicer version of the Attic).
So just turbocharging a brain isn’t likely to give us a clear algorithm to distinguish between higher and lower pleasures. In this case, you’d have a heckuva time getting the experimenters to agree on classifications for the training set.
But the Sequence linked here goes beyond the soma-happiness example, where humans have trouble classifying consistently. Even if we’re really confident in our distinction (say, cats vs rocks), it’s hard to be sure that’s the only way to partition the world. It’s the most useful for the way humans interact with cats and rocks, but a different agent, shown examples of both might learn a less salient-for-humans distinction (warm and cool things) and be unable to correctly classify a dead cat or a warm hearthstone.
You can say that we can fix this problem by showing the AI dead cats and hot rocks in the training set, but that’s only a patch for the problem I just raised. The trouble is that, as, humans, we don’t think about radiant heat when we distinguish cats and rocks, so we don’t flag it as a possible error mode. The danger is that we’re going to be really sucky at coming up with possible errors and then train AIs that look like they’re making correct classifications but fail when we let them out of the box.
It’s the allocation of intelligence to a scale that conserves relative rankings that is confusing here. What you are really fearing is something which is intelligent in a different way from us having hard-coded values that only appear to be similar to ours until they start to be realized on a large scale- a genie.
What I am saying is that long before we create a genie, we need to create a lesser AI that is capable of figuring out what we are wishing for.
If we’re going to hard-code any behavior at all, we need to hard-code honesty. That way we can at least ask questions and be sure that we are getting the true answer, rather than the answer which is calculated to convince us to let the AI ‘out of the box’.
In any case, the first goal for a suitably powerful AI should be “Communicate to humans how to create the AI they want.”.