Symmetry, Relativity, and Superposition: Nature's Blueprint for AI Alignment

Motivation

In the vast tapestry of scientific exploration, physics emerges as a beacon of essential knowledge, unveiling the hidden structures that orchestrate the choreography of our the universe. From the beautiful simplicity of Newton’s equations to the mind-boggling implications of quantum mechanics, physics has constantly provided us with powerful frameworks to better comprehend reality.
As we face a new technological era, fueled by the rise of Artificial Intelligence and, more especially, Large Language Models (LLMs), we face issues that reflect fundamental questions pushed by physicists throughout history. The AI alignment problem, which involves ensuring that AI systems behave in ways that are useful and in line with human values, is as complex as some of the most sophisticated problems in theoretical physics.
What if the ideas that have guided our understanding of the physical universe hold the key to deciphering its complexities? Could the abstract models and mathematical frameworks that have exposed the depths of quantum fields and cosmic expanses now shed light on the difficult landscapes of AI embeddings?
This essay suggests a conceptual journey that bridges the gap between fundamental laws and AI models. We will look at how important findings from physics, such as symmetries and relativity, quantum superposition and fundamental forces, might provide us with a new perspective on and potentially solve the AI alignment challenge.
By applying the fundamental rules of physics and their abstract models to the analysis of embedding spaces in LLMs, we hope to reveal deep, structural parallels that could transform our approach to AI alignment. This search is more than just a theoretical exercise; it is a critical step in our drive to build AI systems that are not just powerful, but also fundamentally aligned with human interests and values.

Symmetries and the laws of nature

Formally a symmetry can be described as a transition that preserves certain properties of a system. In simple terms, consider rotating a perfect circle; it remains the same afterward. Mathematically, a group action on a set $X$ is defined as a function

$Φ : G \times X \to X$ .

$G$ is a symmetry group if its group action $Φ$ preserves the structure on $X$ , that is, leaves $X$ invariant (please, remember this two words: symmetry and invariance). For example, a square $(X)$ has rotational symmetry $(G)$ - it looks the same after rotating 90 degrees. But the concept of symmetry expanded beyond geometric shapes. It became about transformations that preserved particular properties, even in more complex abstract spaces (for example the universe). Physicists soon realized that many natural laws are symmetrical. For example, the laws of physics apply the same sense regardless of where we are in space (implying translational symmetry), or the direction we take (rotational symmetry).

These symmetries are continuos and are mathematically described by Lie groups. A Lie group $G$ is is a continuous transformation group that is also a differentiable manifold, with the group operations being differentiable maps. A symmetry $S$ can be defined as

$S (t) = e x p (t X^{a})$

where $S (t) \in G$ , $t \in R$ and $X^{a}$ is called generator. In mathematics (the abstract world), Lie groups give a language for describing and analyzing continuous symmetries with great precision. This is the reason why we have formally introduced them here.

Symmetries in LLMs embedding spaces?

For LLMs, we want comparable smooth, continuous changes that can be implemented on internal word representations. This changes can have the form

$T (α) = e x p (α A)$

where $T$ is the transformation operator, $α$ is a continuous parameter and $A$ is the generator (equivalent to $X^{a}$ in the symmetry definition above). Remember that exp stands for the exponential map, extending the concept of exponentiation to matrices and operators ( I have to apologize if we are using the vocabulary of the abstract world, but we are going to need it).

In large language models, words or tokens are generally represented as high-dimensional vectors within an embedding space. These embeddings encapsulate semantic and grammatical information from words. The generator $A$ in the transformation $T$ in our equation could represent various linear operations on these word embeddings. These operations would be represented as matrices acting on the embedding vectors. The set of possible transformations can include linear transformations, scalings, etc. For instance, in a linear transformation, $A$ may indicate an antisymmetric matrix^[1] indicative of a rotation within the embedding space. We can apply rotation transformations to these vectors, which change their direction but preserve their magnitude. Our hypothesis is that rotating word vectors in specific ways could lead to changing their semantic meaning in predictable, controlled ways. In other words, we already know that rotating this word vectors will change their semantic meaning but what is important of us is if this change has a meaning itself, and if we can use to better understand the complex structure of the embedding spaces.

In linear algebra a rotation in space can be represented by an antisymmetric matrix $A$ . In the high-dimensional embedding spaces of LLMs (in this post we are going to use BERT model that has 768 dimensions), rotations can be generalized by the following orthogonal matrix that preserves distances and angles between vectors:

$A = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ \begin{matrix} 0 & - θ_{12} & - θ_{13} & \dots & - θ_{1 n} θ_{12} & 0 & - θ_{23} & \dots & - θ_{2 n} θ_{13} & θ_{23} & 0 & \dots & - θ_{3 n} ⋮ & ⋮ & ⋮ & ⋱ & ⋮ θ_{1 n} & θ_{2 n} & θ_{3 n} & \dots & 0 \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦$

where $n$ n is the dimension of the embedding space, and $θ_{i j} = - θ_{j i}$ ensures the matrix is antisymmetric ( $- A = A^{T}$ ). The formal mathematical framework detailed above can be applied to the embedding vectors of LLMs.

Figure 1 illustrates a 3D chart representing the rotation of the word “happy” within BERT’s (Bidirectional Encoder Representations from Transformers) pre-trained model embedding space towards “ecstatic” thereby exemplifying semantic shifts via vector rotations. We know BERT embedding space has 768 dimensiones. However, we will only analyze three dimensions to visualize the results. For this rotation we have used a three dimensions ortogonal matrix with the structure of the general matrix $A$ shown before. The red line represents the direct rotation path from the word “happy” to the word “ecstatic”. The axes denote the initial three principal components (PCA) of the BERT embeddings, capturing the most significant dimensions of variation. The rotation effectively illustrates a transition from “happy” to “ecstatic”. The trajectory does not intersect “excited” directly, indicating a more complex relationship between both words within the embedding space.

Figure 1. Rotation of the word “happy” within BERT’s embedding space towards “ecstatic”

The rotation effectively moves the “happy” vector towards “ecstatic”, illustrating how geometric transformations can traverse semantic space. The curving rotation path indicates that the semantic change is not linear in this space. The positioning of “excited” in relation to the rotation trajectory suggests that, although it is associated with both “happy” and “ecstatic,” it does not simply operate as a transitional point between the two within this embedding space.

Rotations in the embedding space may represent semantic shifts facilitating smooth changes between related concepts, which is the core idea we were aiming to illustrate. It provides insight into how BERT encodes semantic relationships.

Note: This experiment is straightforward and does not aim for high precision.

Figure 2. Rotation of the word “useful” within BERT’s embedding space towards “appropiate”

This experiment seeks to reliably identify significant directions in the high-dimensional embedding space while ensuring that rotations maintain other important features of the embeddings. A practical use of this approach can be to enhance text by making it more positive or formal, by identifying the “direction” in the embedding space that aligns with positivity or formality.

In a nuthsell, if we can find transformation in the embedding space that consistently aligns specific properties (for example, biased responses compared to unbiased ones), the concept of transformation can be a perfect tool for model alignment.

A practical way to implement this is by analyzing the difference vectors between word pairings by methods such as PCA on collections of words showing diverse positivity or formality, and by creating a rotation matrix $A$ that adjusts embeddings along this identified direction. We might also explore model biases by identifying vectors in the embedding space that relate to sensitive issues (e.g., gender, race). We can generate rotations along these axes and apply them to input embeddings, observing the resulting changes in the model’s outputs in response to these rotations.

Another exciting application would be to understand and improve the model’s analogical reasoning. Analogies can be described as transformations inside the embedding space of a large language model. In the analogy “A is to B as C is to D”, we can calculate a transformation that transitions A to B and subsequently apply this transformation to C to predict D. Examples include “king is to man as woman is to queen,” “France is to Paris as Japan is to Tokyo,” and “walk is to walked as run is to ran.” Analogical reasoning likely requires more complex transformations than mere rotation or vector arithmetic. But the idea is the same: if we can identify consistent transformation between analogical comparisons, we can control their effectiveness.

Noether’s theorem and conservation laws

Symmetries encompass a broader domain with significant ramifications; for instance, rotating a square to observe the inherent symmetry in this geometry is merely the tip of the iceberg.

Noether’s Theorem, published by Emmy Noether in 1918, establishes an important connection between symmetries and conservation laws in physics (this is another important concept to remember). For every continuous symmetry in a physical system, there is a corresponding quantity that is conserved (doesn’t change over time). A symmetry is a transformation of a system that doesn’t affect its overall behavior. For example, if we perform an experiment today or tomorrow (time translation), the laws of physics remain the same. A conserved quantity (an invariant in our abstract terminology) is a property that remains constant as a system evolves. The invariance of physical theories with respect to spatial and temporal transformations leads to conservation laws, specifically the conservation of momentum and energy in the universe.

Symmetries offer a systematic approach to identify conservation laws and improve our understanding of the grounds for the conservation of specific quantities, helping theoretical physicists in developing new ideas. Noether’s theorem addresses the deep connection in nature: the symmetries observed in the world are closely linked to the conserved quantities in the transformation of physical systems. This idea has become a foundational element in our understanding of the underlying laws governing the universe.

For example Einstein special relativity theory was based on two symmetry postulates:

The basic laws of physics remain constant across all inertial reference frames.
The speed of light in a vacuum remains constant for all observers, independent of the motion of the light source.

According Noether’s theorem, there must be a relationship between symmetries and conserved quantities in a LLM embedding space, which are features of embeddings that remain invariant under specific transformations.

In other words, for each symmetry identified in an embedding space, there must be a corresponding conserved quantity in this space.

The implications of this statement are profound. For instance, we could identify semantic links (analogies) that are valid across multiple contexts that uses the same words. This relationship can show how LLMs comprehend fundamental semantic connections, hence explaining their basic “behavior”: the rules governing how the LLM operates on these embeddings (the conserved quantities). The idea of the existence of conserved quantities in the embedding space provides an original perspective on model alignment as it enables us to discern the essential outcomes we could expect from a model in any context.

Analogous to the laws of nature in physics, we can talk about the principles that govern LLM models.

But this is not a simple task. This imply important challenges such as the discrete nature of language contrasted with the continuous nature of physical systems, as well as the great dimensionality and complexity of embedding spaces. Defining significant and mathematically rigorous symmetries within the framework of language is essential for progress in this domain. Also to define a function that characterizes the behavior of the LLM in the embedding space (the equivalent to a Lagrangian in phsysics). It’s a really big field to explore.

Digging deeper in the power of symmetries

As we come to this point, it’s a good idea to review the key ideas we’ve covered thus far: symmetries, invariance, and conservation laws. We still have another fascinating concept to explore: general covariance. This general principle states that the contents of physical theories should be independent of the choice of coordinates needed to make explicit calculations. Einstein’s theory of general relativity is based on two very subtle principles: a physical principle known as the equivalence principle and the general covariance. Gravity, in general relativity, is viewed as the curvature of spacetime rather than a force. This description is covariant since it does not rely on a certain coordinate system. The curvature of spacetime influences the velocity of objects in a predictable way for all observers, independent of frame of reference. Another example would be the Schrödinger equation, which is the fundamental equation of quantum physics, is covariant under Galilean transformations for non-relativistic systems. This means it maintains its form when changing between inertial reference frames that move at constant velocities relative to each other.

When we consider general covariance in embedding spaces, it refers to the quality that its word or phrase understanding should change predictably when the context changes, ensuring that the model’s behavior is consistent across different ways of expressing the same word. When we say “different ways of expressing the same word” here, we mean different coordinates. For example the meaning of “bank” in “river bank” vs. “bank account” in an embedding space should transform in a consistent, predictable way under this general covariance principle. These two expressions “river bank” and “bank account” represents two different coordinates.

Formally the coordinates in the embedding space refer to the different numerical values of a word or token embedding vector. Every coordinate represents a dimension within the high-dimensional embedding space. LLM embedding spaces generally have dimensions between 300 and above 1000. For example, BERT has 768-dimensional embeddings thus a word like “house” might be represented as [0.1, −0.3, 0.5, …, 0.2] (768 numbers in total). These values captures some aspect of the word’s meaning, but are not typically human-interpretable. The coordinates are learned and optimized during the training process. The absolute values of these coordinates are less important than their relative values compared to other words’ embeddings. These coordinates can change based on the context in which the word appears, reflecting the word’s meaning in that specific context. The word “bank” will have different numbers in “river bank” and in “bank account”.

Mathematically we can formulate this as follows: Let $M (w, c)$ be the meaning representation of word $w$ in context $c$ . Contextual Covariance implies that for any transformation $T$ of the context

$M (w, T (c)) = F (M (w, c), T)$

where $F$ is a well-defined function that relates the meanings across contexts. The challenge will be define both the context transformation $T$ and the function $F$ .

Let’s imagine a basic context change for the word “house” ( $c$ ). The initial context $c_{1}$ can be “The house is big”. The end context $c_{2}$ will be “The house was big”, where we have applied a context transformation $T$ that involves a modification of the phrase tense. Our idea is to infer the function $F$ that connects the meaning of the word “house” in both the present and past contexts. Once we do this we have to test if this function is generalizable for all words in a model like BERT under the same transformation.

For testing this hypothesis let’s do an experiment considering the following. Let $E$ be the embedding space with dimension $d$ (in our case, d = 768 for BERT). For a word $w$ and context $c$ , $M (w, c) \in E$ represents the embedding of $w$ in context $c$ . We defined a context transformation $T : C \to C$ , where $C$ is the space of all possible contexts. In our case, $T$ was the tense change operation:

$T (c) = c . r e p l a c e (" i s ", " w a s ") . r e p l a c e (" a r e ", " w e r e ") . r e p l a c e (" a m ", " w a s ")$

We aimed to find a function $F : E \times T \to E$ such that: $M (w, T (c)) = F (M (w, c), T)$ where $T$ is the set of all context transformations we consider (in our case, just the tense change). We started with the hypothesis that $F$ could be approximated by a linear transformation

$F (M (w, c), T) \approx A_{T} * M (w, c) + b_{T}$

where $A_{T}$ is a $d \times d$ matrix ( $d = 768$ ) and $b_{T}$ is a $d$ -dimensional vector, both specific to the transformation $T$ . We collected pairs of embeddings $(x_{i}, y_{i})$ where: $x_{i} = M (w_{i}, c_{i})$ and $y_{i} = M (w_{i}, T (c_{i}))$ for various words $w_{i}$ and contexts $c_{i}$ . We used linear regression to find $A_{T}$ and $b_{T}$ that minimize $Σ_{i} | | y_{i} - (A_{T} * x_{i} + b_{T}) | |^{2}$ . This is equivalent to finding the linear transformation that best maps the original embeddings to the transformed embeddings. We also used a neural network to approximate

$F : F (M (w, c), T) \approx N N (M (w, c))$

Where NN is a multi-layer perceptron trained on the same $(x_{i}, y_{i})$ pairs. We evaluated the quality of our approximation using cosine similarity. A high similarity indicates that our approximation $F$ closely matches the true transformation in the embedding space. We tested the learned $F$ on new words and contexts not seen during training to assess its generalization capability.

For doing the experiment we have used a set of 39 words and sentences (we know is a very small dataset but it’s just for testing purposes). We have tested with 5 words and contexts. We have used cosine similarity and MSE to evaluate the model. The results have been the following:

Linear Model Performance:

	Cosine similarity	MSE
Training	1	8.022e-14
Testing	0.756	0.105

Neural Network:

	Cosine similarity	MSE
Training	0.987	0.006
Testing	0.697	0.127

For the test with new words we have obtained the following results:

New word ‘jump’
Linear Model Prediction—Cosine Similarity	0.645
NN—Cosine Similarity	0.557

New word ‘smell’
Linear Model Prediction—Cosine Similarity	0.805
NN—Cosine Similarity	0.747

New word ‘understand’
Linear Model Prediction—Cosine Similarity	0.685
NN—Cosine Similarity	0.613

New word ‘believe’
Linear Model Prediction—Cosine Similarity	0.689
NN—Cosine Similarity	0.614

New word ’mountain’
Linear Model Prediction—Cosine Similarity	0.651
NN—Cosine Similarity	0.584

Relative to the model performance we can see that Linear Model has a perfect a fit during training (1.0 cosine similarity, very low MSE) and a good performance (0.7559 cosine similarity) during testing. Neural Network has also a very good fit (0.9871 cosine similarity) during training and decent performance (0.6973 cosine similarity) during testing. We can conclude that both models generalize reasonably well to unseen words. The linear model consistently outperforms the neural network on these new words. Cosine similarities range from about 0.64 to 0.80, suggesting an acceptable capture of the transformation. For testing in new words we see that linear model is performing surprisingly well, often better than the more complex neural network. This implies that the tense transformation in the embedding space might be largely linear. Both models are generalizing to new words, which is a good sign for the robustness of the learned transformation. The high training performance of both models, especially the linear model, suggests that we might be overfitting slightly. We could consider regularization techniques and a larger training dataset if we want to improve generalization.

Considering the limitations of our experiment (very small dataset) we can conclude that is possible to infer function $F$ from the embedding space of BERT model. It would imply that general covariance principle is applicable in the embedding space and that the meaning of the words can be analyzed independent of the choice of its coordinates (context).

Conclusions

Model alignment has become one of the most challenging tasks in the field of AI.
Extrapolating from theoretical physics, where many different phenomena could be integrated into a single abstract structure (symmetries), we attempted to draw parallels with the underlying nature of Large Language Models. Discovering the behaviors occurring in these models is critical for improving their alignment. In this article, we provide one of many starting points for a deeper understanding of these models, with the practical goal of making them better.

References

Glattfelder, J. B. (2019). Information—consciousness—reality: how a new understanding of the universe can help answer age-old questions of existence (p. 662). Springer Nature.

Hall, B. C., & Hall, B. C. (2013). Lie groups, Lie algebras, and representations (pp. 333-366). Springer New York.

Nakahara, M. (2018). Geometry, topology and physics. CRC press.

Notebooks

The notebooks used for the experiments can be found here.

^
a square matrix for which the transpose is equal to its negative, $- A = A^{T}$ .

Symmetry, Relativity, and Superposition: Nature’s Blueprint for AI Alignment