An interesting analogy, closer to ML, would be to look at neuroscience. It’s an older field than ML, and it seems that the physics perspective has been fairly productive, even though not successful at providing a grand unified theory of cognition yet. Some examples:
Using methods from electric circuits to explain neurons (Hodgkin-Huxley model, cable theory)
Dynamical systems to explain phenomena like synchronization in neuronal oscillations (ex: Kuramoto model)
Ising models to model some collective behaviour of neurons
Information theory is commonly used in neuroscience to analyze neural data and model the brain (ex: efficient coding hypothesis)
Attempts at general theories of cognition like predictive processing, or the free energy principle which also have a strong physics inspiration (drawing from statistical physics, and the least action principle)
I can recommend the book Models of the Mind, from Grace Lindsay, which gives an overview of the many way physics contributed to neuroscience.
In principle, one might think that it would be easier to make progress using a physics perspective on AI than in neuroscience, for example because it is easier to do experiments in AI (in neuroscience we do not have access to the value of the weights, we do not always have access to all the neurons, and often it is not possible to intervene on the system).
Another perspective would be too look at the activations of an autoregressive deep learning model, e.g. a transformer, during inference as a stochastic process: the collection of activation (Xt) at some layer as random variables indexed by time t, where t is token position.
One could for example look at mutual information between the history X−t=(Xt,Xt−1,...) and the future of the activations Xt+1, or look at (conditional) mutual information between the past and future of subprocesses of Xt (note: transfer entropy can be a useful tool to quantify directed information flow between different stochastic processes). There are many information-theoretic quantities one could be looking at.
If you want to formally define a probability distribution over activations, you could maybe push forward the discrete probability distribution over tokens (in particular the predictive distribution) via the embedding map.
In the context of computational mechanics this seems like a useful perspective, for example to find belief states by optimizing mutual information between some coarse graining of the past states and future states to find belief states in a data-driven way (this is too vague stated like that, and i am working on a draft that get into more details about that perspective).