The second central problem of Interoperable Semantics is to account for Alice and Bob’s agreement. In the Bayesian frame, this means that we should be able to establish some kind of (approximate) equivalence between at least some of the variables in the two agents’ world models, and the outputs of the magic semantics box should only involve those variables for which we can establish equivalence.
To me, this seems like a strange way to go about it, if your hope is to address AI safety concerns. If Alice is trying to understand Bob, and Alice sees that Bob uses a weird blob of incomprehensible gibberish as a key step in his reasoning, then Alice should think she has failed, rather than thinking she should ignore that part.
In some sense, agents come equipped with a 1st-person perspective (a set of cognitive tools which is useful for predicting their own sense-data and managing their own actions), and the challenge we face is one of translating that 1st-person perspective to a 3rd-person perspective (an interoperable language which can readily be translated into many different 1st person perspectives, ie, understood by many different agents).
That particular paragraph was intended to be about two humans. The application to AI safety is less direct than “take Alice to be a human, and Bob to be an AI” or something like that.
That makes sense. But, effectively, you are deferring the question of how it relates to AI safety. If I have my intuition (roughly, that the most important part of the problem is how to understand alien concepts which AIs might have) and you have your intuition (roughly, that the most important part of the problem is how to understand human concepts) then presumably we can try and articulate some reasons.
I’ve said something about why I think it seems important not to give up on mental content that seems hard to translate. Perhaps you could say a bit more about why you are interested in a thingy that only looks for easily translatable content and ignores hard-to-translate content?
I definitely have substantial probability on the possibility that AIs will use a bunch of alien (i.e. non-interoperable or hard-to-interoperate) concepts. And in worlds where that’s true, I largely agree that those are the most important (i.e. hardest/rate-limiting) part of the technical problems of AI safety.
That said:
I have substantial probability that AIs basically don’t use a bunch of non-interoperable concepts (or converge to more interoperable concepts as capabilities grow, or …). In those worlds, I expect that “how to understand human concepts” is the rate-limiting part of the problem.
Even in worlds where AIs do use lots of alien concepts, it feels like understanding human concepts is “earlier on the tech tree” than figuring out what to do with those alien concepts. Like, it is a hell of a lot easier to understand those alien concepts by first understanding human concepts and then building on that understanding, than by trying to jump straight to alien concepts.
What would constitute “understanding human concepts” in the relevant sense?
In another comment, I suggested that human concepts can be represented in human language. This might miss out on some important human mental content, but it would not miss out on anything that the magic box spits out, since the magic box is specifically dealing with language.
This trivializes the magic box; it becomes the identity function, or at best, a paraphrasing function. But what, exactly, is wrong with such a trivial understanding of the magic box? Where does it fall short of the sort of understanding you seek to achieve?
It frames things in terms of events (each event labeled with a natural-language sentence) rather than random variables, like you want, but I can trivially reframe it in terms of random variables by considering the truth value of the sentences as 0,1 instead of true,false.
Yes, I intuitively feel that this is a dumb trivial proposal that contributes nothing to our understanding of concepts. But, I quote:
At this point, we’re not even necessarily looking for “the right” class of random variables, just any class which satisfies the above criteria and seems approximately plausible.
One example: you know that thing where I point at a cow and say “cow”, and then the toddler next to me points at another cow and is like “cow?”, and I nod and smile? That’s the thing we want to understand. How the heck does the toddler manage to correctly point at a second cow, on their first try, with only one example of me saying “cow”? (Note that same question still applies if they take a few tries, or have heard me use the word a few times.)
The post basically says that the toddler does a bunch of unsupervised structure learning, and then has a relatively small set of candidate targets, so when they hear the word once they can assign the word to the appropriate structure. And then we’re interested in questions like “what are those structures?”, and interoperability helps narrow down the possibilities for what those structures could be.
… and I don’t think I’ve yet fully articulated the general version of the problem here, but the cow example is at least one case where “just take the magic box to be the identity function” fails to answer our question.
To me, this seems like a strange way to go about it, if your hope is to address AI safety concerns. If Alice is trying to understand Bob, and Alice sees that Bob uses a weird blob of incomprehensible gibberish as a key step in his reasoning, then Alice should think she has failed, rather than thinking she should ignore that part.
In some sense, agents come equipped with a 1st-person perspective (a set of cognitive tools which is useful for predicting their own sense-data and managing their own actions), and the challenge we face is one of translating that 1st-person perspective to a 3rd-person perspective (an interoperable language which can readily be translated into many different 1st person perspectives, ie, understood by many different agents).
That particular paragraph was intended to be about two humans. The application to AI safety is less direct than “take Alice to be a human, and Bob to be an AI” or something like that.
That makes sense. But, effectively, you are deferring the question of how it relates to AI safety. If I have my intuition (roughly, that the most important part of the problem is how to understand alien concepts which AIs might have) and you have your intuition (roughly, that the most important part of the problem is how to understand human concepts) then presumably we can try and articulate some reasons.
I’ve said something about why I think it seems important not to give up on mental content that seems hard to translate. Perhaps you could say a bit more about why you are interested in a thingy that only looks for easily translatable content and ignores hard-to-translate content?
I definitely have substantial probability on the possibility that AIs will use a bunch of alien (i.e. non-interoperable or hard-to-interoperate) concepts. And in worlds where that’s true, I largely agree that those are the most important (i.e. hardest/rate-limiting) part of the technical problems of AI safety.
That said:
I have substantial probability that AIs basically don’t use a bunch of non-interoperable concepts (or converge to more interoperable concepts as capabilities grow, or …). In those worlds, I expect that “how to understand human concepts” is the rate-limiting part of the problem.
Even in worlds where AIs do use lots of alien concepts, it feels like understanding human concepts is “earlier on the tech tree” than figuring out what to do with those alien concepts. Like, it is a hell of a lot easier to understand those alien concepts by first understanding human concepts and then building on that understanding, than by trying to jump straight to alien concepts.
What would constitute “understanding human concepts” in the relevant sense?
In another comment, I suggested that human concepts can be represented in human language. This might miss out on some important human mental content, but it would not miss out on anything that the magic box spits out, since the magic box is specifically dealing with language.
This trivializes the magic box; it becomes the identity function, or at best, a paraphrasing function. But what, exactly, is wrong with such a trivial understanding of the magic box? Where does it fall short of the sort of understanding you seek to achieve?
It frames things in terms of events (each event labeled with a natural-language sentence) rather than random variables, like you want, but I can trivially reframe it in terms of random variables by considering the truth value of the sentences as 0,1 instead of true,false.
Yes, I intuitively feel that this is a dumb trivial proposal that contributes nothing to our understanding of concepts. But, I quote:
One example: you know that thing where I point at a cow and say “cow”, and then the toddler next to me points at another cow and is like “cow?”, and I nod and smile? That’s the thing we want to understand. How the heck does the toddler manage to correctly point at a second cow, on their first try, with only one example of me saying “cow”? (Note that same question still applies if they take a few tries, or have heard me use the word a few times.)
The post basically says that the toddler does a bunch of unsupervised structure learning, and then has a relatively small set of candidate targets, so when they hear the word once they can assign the word to the appropriate structure. And then we’re interested in questions like “what are those structures?”, and interoperability helps narrow down the possibilities for what those structures could be.
… and I don’t think I’ve yet fully articulated the general version of the problem here, but the cow example is at least one case where “just take the magic box to be the identity function” fails to answer our question.