As the internal ontology takes on any reflective aspects, parts of the representation that mix with facts about the AI’s internals, I expect to find much larger differences
It could be worth exploring reflection in transparency-based AIs, the internals of which are observable. We can train a learning AI, which only learns concepts by grounding them on the AI’s internals (consider the example of a language-based AI learning a representation linking saying words and its output procedure). Even if AI-learned concepts do not coincide with human concepts, because the AI’s internals greatly differ from human experience (e.g. a notion of “easy to understand” assuming only a metaphoric meaning for an AI), AI-learned concepts remain interpretable to the programmer of the AI given the transparency of the AI (and the programmer of the AI could engineer control mechanisms to deal with disalignment). In other words, there will be unnatural abstractions, but they will be discoverable on the condition of training a different kind of AI—as opposed to current methods which are not inherently interpretable. This is monumental work, but desperately needed work
I am skeptical about your theory of impact for investigating the question of which concepts would be convergent across minds, specifically your expectation that concepts validated through linguistic conventions may assist in non-ad-hoc interpretability of deep learning networks. Yet, I am interested in investigating semantics for the purpose of alignment. Let me try to explain how my model differs from yours.
First, for productively studying semantics, I recommend keeping a distinction between a semantics for vision (as the prototypical sensory input) and one for symbolic reasoning. I have the impression that your project can be described as curriculum learning for a visual reasoner. In the space of minds or programs, we have diffusion models and multi-modal LLMs, but there is space for programs (say a probabilistic computer vision program) learning to make visual predictions from visual data more efficiently than deep learning. It is a legitimate research question how to design a curriculum enabling a visual reasoner to successfully form concepts of increasing complexity or whether an algorithm exists enabling the visual reasoner to navigate in autonomy through a mass of visual data to successfully reach the same objective (in this paper, I described the visual reasoner as “visual AGI”).
Early milestones could be learning concepts of edges, tridimensional space, rigid-body motion (very much in the direction of your toy example). Other milestones would be natural latents for shadows, textures, shapes, leaves, trees, laws of physics. I do not see any fundamental reasons why to stop here, instead you can indeed project learning concepts for animals, concepts for various actions, concepts for the sense of sights such as eyes as the organ of vision and attention focus of an animal, concepts related to animal behavior. Humans are a special case of animal, this can be proven also through ethology.
I have two comments:
- you say you want to use language as one source of information about latents. There are two possibilities. The first possibility involves the use of labels and even knowledge bases, but still requires a visual reasoner, only assuming its operation on an augmented reality. This modality does not exist in nature and is available only to programs, but there certainly exist programs successfully navigating linguistic conventions (see also Bill Benzon’s comment). It seems to me that you plan to use language in this way. The alternative is postulating a symbolic reasoner which somehow got to master “language”, but then a different kind of semantics need to be accounted for and I could not find any discussion of it in your post.
- a visual reasoner does not seem an existential risk for humanity. Thanks to deep learning, we already have visual reasoners (it was easy to turn visual predictors into visual reasoners, even if adversarial attacks are still possible against AI). The naturality of laws of physics or patterns of human individual/social behavior is scientific inquiry which may however not be central to AI safety. The study of semantics of visual reasoning would support an interpretation of our AI as a naturalist, or even a human ethologist. On the contrary, I see risks connected to a possible development of an AI scientist (or computer scientist), but this development is subordinate to developing symbolic reasoning. I consider LLMs symbolic predictors whose internals are difficult to interpret. Can LLMs be turned into (potentially deadly) symbolic reasoners? I am not an expert; such a possibility cannot be safely excluded. My intuitions below about a semantics for symbolic reasoning would not provide much hint to those looking to integrate this reasoning capabilities in LLMs, but would not assist in advancing their interpretability either. However, I think they can help with AI safety.
In the space of minds or programs, there is space for programs learning symbolic reasoning in a way differing from human brains and in a way that is transparent to their programmer. A symbolic reasoner can be a program exposed to words and other symbols solely, for example does not need to have any sense of sight, can be very data-efficient, does not need to have any equivalent in nature. A symbolic reasoner would start from a concept that someone used words to say something and create increasingly complex reflective concepts about words, about the inventors of the words and their other inventions such as mathematics and computer programming. I pioneered investigation into symbolic reasoning some years ago. A early milestone is understanding reported saying (such as in “Augrh!” said Father Wolf. “It is time to hunt again.”). Other milestones are learning concepts of being an element of a class, of truth, of negation, of conjunction, of believing, of wanting, of functional application. You can see that the curriculum is completely different from that for a visual reasoner. The resulting semantics is less about the real world, but involves rather reflection about a “mental” internal world.
Now: can a semantics for symbolic reasoning help with AI safety? My intuition is that a symbolic reasoner with the property of transparency can be made to reason about the goals of its programmer such that there is at least a technical possibility of alignment (as opposed to language models whose internals we do not know how to interpret). The bonus point is the possibility to experiment with alignment at an early stage of the curriculum and to decide how far go with the curriculum (when it is safe to pursue distant milestones such as geometry, algebra, physics, biology).