I definitely have substantial probability on the possibility that AIs will use a bunch of alien (i.e. non-interoperable or hard-to-interoperate) concepts. And in worlds where that’s true, I largely agree that those are the most important (i.e. hardest/rate-limiting) part of the technical problems of AI safety.
That said:
I have substantial probability that AIs basically don’t use a bunch of non-interoperable concepts (or converge to more interoperable concepts as capabilities grow, or …). In those worlds, I expect that “how to understand human concepts” is the rate-limiting part of the problem.
Even in worlds where AIs do use lots of alien concepts, it feels like understanding human concepts is “earlier on the tech tree” than figuring out what to do with those alien concepts. Like, it is a hell of a lot easier to understand those alien concepts by first understanding human concepts and then building on that understanding, than by trying to jump straight to alien concepts.
What would constitute “understanding human concepts” in the relevant sense?
In another comment, I suggested that human concepts can be represented in human language. This might miss out on some important human mental content, but it would not miss out on anything that the magic box spits out, since the magic box is specifically dealing with language.
This trivializes the magic box; it becomes the identity function, or at best, a paraphrasing function. But what, exactly, is wrong with such a trivial understanding of the magic box? Where does it fall short of the sort of understanding you seek to achieve?
It frames things in terms of events (each event labeled with a natural-language sentence) rather than random variables, like you want, but I can trivially reframe it in terms of random variables by considering the truth value of the sentences as 0,1 instead of true,false.
Yes, I intuitively feel that this is a dumb trivial proposal that contributes nothing to our understanding of concepts. But, I quote:
At this point, we’re not even necessarily looking for “the right” class of random variables, just any class which satisfies the above criteria and seems approximately plausible.
One example: you know that thing where I point at a cow and say “cow”, and then the toddler next to me points at another cow and is like “cow?”, and I nod and smile? That’s the thing we want to understand. How the heck does the toddler manage to correctly point at a second cow, on their first try, with only one example of me saying “cow”? (Note that same question still applies if they take a few tries, or have heard me use the word a few times.)
The post basically says that the toddler does a bunch of unsupervised structure learning, and then has a relatively small set of candidate targets, so when they hear the word once they can assign the word to the appropriate structure. And then we’re interested in questions like “what are those structures?”, and interoperability helps narrow down the possibilities for what those structures could be.
… and I don’t think I’ve yet fully articulated the general version of the problem here, but the cow example is at least one case where “just take the magic box to be the identity function” fails to answer our question.
I definitely have substantial probability on the possibility that AIs will use a bunch of alien (i.e. non-interoperable or hard-to-interoperate) concepts. And in worlds where that’s true, I largely agree that those are the most important (i.e. hardest/rate-limiting) part of the technical problems of AI safety.
That said:
I have substantial probability that AIs basically don’t use a bunch of non-interoperable concepts (or converge to more interoperable concepts as capabilities grow, or …). In those worlds, I expect that “how to understand human concepts” is the rate-limiting part of the problem.
Even in worlds where AIs do use lots of alien concepts, it feels like understanding human concepts is “earlier on the tech tree” than figuring out what to do with those alien concepts. Like, it is a hell of a lot easier to understand those alien concepts by first understanding human concepts and then building on that understanding, than by trying to jump straight to alien concepts.
What would constitute “understanding human concepts” in the relevant sense?
In another comment, I suggested that human concepts can be represented in human language. This might miss out on some important human mental content, but it would not miss out on anything that the magic box spits out, since the magic box is specifically dealing with language.
This trivializes the magic box; it becomes the identity function, or at best, a paraphrasing function. But what, exactly, is wrong with such a trivial understanding of the magic box? Where does it fall short of the sort of understanding you seek to achieve?
It frames things in terms of events (each event labeled with a natural-language sentence) rather than random variables, like you want, but I can trivially reframe it in terms of random variables by considering the truth value of the sentences as 0,1 instead of true,false.
Yes, I intuitively feel that this is a dumb trivial proposal that contributes nothing to our understanding of concepts. But, I quote:
One example: you know that thing where I point at a cow and say “cow”, and then the toddler next to me points at another cow and is like “cow?”, and I nod and smile? That’s the thing we want to understand. How the heck does the toddler manage to correctly point at a second cow, on their first try, with only one example of me saying “cow”? (Note that same question still applies if they take a few tries, or have heard me use the word a few times.)
The post basically says that the toddler does a bunch of unsupervised structure learning, and then has a relatively small set of candidate targets, so when they hear the word once they can assign the word to the appropriate structure. And then we’re interested in questions like “what are those structures?”, and interoperability helps narrow down the possibilities for what those structures could be.
… and I don’t think I’ve yet fully articulated the general version of the problem here, but the cow example is at least one case where “just take the magic box to be the identity function” fails to answer our question.