As always, nice post. The problem does seem central to many applications of abstraction indeed, especially assuming as you do that alignment reduces to translation between our ontology and the AI’s ontology.
I especially like this summary/main takeaway:
Things should ultimately be groundable in abstraction from the low level, but it seems like we shouldn’t need a detailed low-level model in order to translate between ontologies.
Also, reading this, it seems like you consider that you have solved abstraction (you write about this being your next project). Is that the case, or are you just changing problem for a while to keep things fresh?
At this point, I think that I personally have enough evidence to be reasonably sure that I understand abstraction well enough that it’s not a conceptual bottleneck. There are still many angles to pursue—I still don’t have efficient abstraction learning algorithms, there’s probably good ways to generalize it, and of course there’s empirical work. I also do not think that other people have enough evidence that they should believe me at this point, when I claim to understand well enough. (In general, if someone makes a claim and backs it up by citing X, then I should assign the claim lower credence than if I stumbled on X organically, because the claimant may have found X via motivated search. This leads to an asymmetry: sometimes I believe a thing, but I do not think that my claim of the thing should be sufficient to convince others, because others do not have visibility into my search process. Also I just haven’t clearly written up every little piece of evidence.)
Anyway, when I consider what barriers are left assuming my current model of abstraction and how it plays with the world are (close enough to) correct, the problems in the OP are the biggest. One of the main qualitative takeaways from the abstraction project is that clean cross-model correspondences probably do exist surprisingly often (a prediction which neural network interpretability work has confirmed to some degree). But that’s an answer to a question I don’t know how to properly set up yet, and the details of the question itself seem important. What criteria do we want these correspondences to satisfy? What criteria does the abstraction picture predict they satisfy in practice? What criteria do they actually satisfy in practice? I don’t know yet.
As always, nice post. The problem does seem central to many applications of abstraction indeed, especially assuming as you do that alignment reduces to translation between our ontology and the AI’s ontology.
I especially like this summary/main takeaway:
Also, reading this, it seems like you consider that you have solved abstraction (you write about this being your next project). Is that the case, or are you just changing problem for a while to keep things fresh?
At this point, I think that I personally have enough evidence to be reasonably sure that I understand abstraction well enough that it’s not a conceptual bottleneck. There are still many angles to pursue—I still don’t have efficient abstraction learning algorithms, there’s probably good ways to generalize it, and of course there’s empirical work. I also do not think that other people have enough evidence that they should believe me at this point, when I claim to understand well enough. (In general, if someone makes a claim and backs it up by citing X, then I should assign the claim lower credence than if I stumbled on X organically, because the claimant may have found X via motivated search. This leads to an asymmetry: sometimes I believe a thing, but I do not think that my claim of the thing should be sufficient to convince others, because others do not have visibility into my search process. Also I just haven’t clearly written up every little piece of evidence.)
Anyway, when I consider what barriers are left assuming my current model of abstraction and how it plays with the world are (close enough to) correct, the problems in the OP are the biggest. One of the main qualitative takeaways from the abstraction project is that clean cross-model correspondences probably do exist surprisingly often (a prediction which neural network interpretability work has confirmed to some degree). But that’s an answer to a question I don’t know how to properly set up yet, and the details of the question itself seem important. What criteria do we want these correspondences to satisfy? What criteria does the abstraction picture predict they satisfy in practice? What criteria do they actually satisfy in practice? I don’t know yet.