I will certainly agree that a big problem for the FEP is related to its presentation. They start with the equations of mathematical physics and show how to get from there to information theory, inference, beliefs, etc. This is because they are trying to get from matter to mind. But they could have gone the other way since all the equations of mathematical physics have an information theoretic derivation that includes a notion of free energy. This means that all the stuff about Langevin dynamics of sparsely connected systems (the ‘particular’ fep) could have been included as a footnote in a much simpler derivation.
As you note, the other problem with the FEP is that it seems to add very little to the dominant RL framework. I would argue that this is because they are really not interested in designing better agents, but rather in figuring out what it means for mind to arise from matter. So basically it is physics inspired philosophy of mind, which does sound like something that has no utility whatsoever. But explanatory paradigms can open up new ways of thinking.
For example, relevant to your interests, it turns out that the FEP definition of an agent has the potential to bypass one of the more troubling AI safety concerns associated with RL. When using RL there is a substantial concern that straight-up optimizing a reward function can lead to undesirable results, i.e. the imperative to ‘end world hunger’ leads to ‘kill all humans’. In contrast, in the standard formulation of the FEP the reward function is replaced by a stationary distribution over actions and outcomes. This suggests the following paradigm for developing a safer AI agent. Observe human decision making in some area to get a stationary distribution over actions and outcomes that are considered acceptable but perhaps not optimal. Optimize the free energy of the expected future (FEEF) applied to the observed distribution of actions and outcomes (instead of just outcomes as is usually done) to train an agent to reproduce human decision-making behavior. Assuming it works you now have an automated decision maker that, on average, replicates human behavior, i.e you have an agent that is weakly equivalent to the average human. Now suppose that there are certain outcomes that we would like to make happen more frequently than human decision-makers have been able to achieve, but don’t want the algorithm to take any drastic actions. No problem: train a second agent to produce this new distribution of outcomes while keeping the stationary distribution over actions the same.
This is not guaranteed to work as some outcome distributions are inaccessible, but one could conceive an iterative process where you explore the space of accessible outcome distributions by slightly perturbing the outcome distribution and retraining and repeating...
The short answer is that, in a POMDP setting, FEP agents and RL agents can be mapped one onto the other via appropriate choice of reward function and inference algorithm. One of the goals of the FEP is to come with a normative definition of the reward function (google the misleadingly titled “optimal control without cost functions” paper or, for a non-FEP version of the same, thing google the accurately titled: “Revisiting Maximum Entropy Inverse Reinforcement Learning”). Despite the very different approaches, the underlying mathematics is very similar as both are strongly tied to KL control theory and Jaynes’ maximum entropy principle. But the ultimate difference between FEP and RL in a POMDP setting is how an agent is defined. RL needs an inference algorithm and a reward function that operates on action and outcomes, R(o,a). The FEP needs stationary blanket statistics, p(o,a), and nothing else. The inverse reinforcement paper shows how to go from p(o,a) to a unique R(o,a) assuming a bayes optimal RL agent in a MDP setting. Similarly, if you start with R(o,a) and optimize it, you get a stationary distribution, p(o,a). This distribution is also unique under some ‘mild’ conditions. So they are more or less equivalent in terms of expressive power. Indeed, you can generalize all this crap to show any subsystem of any physical system can be mathematically described as Bayes optimal RL agent. You can even identify the reward function with a little work. I believe this is why we intuitively anthropomorphize physical systems, i.e. when we say things like they system is “seeking” a minimum energy state.
But regardless, from a pragmatic perspective they are equally expressive mathematical systems. The advantage of one over the other depends upon your prior knowledge and goals. If you know the reward function and have knowledge of how the world works use RL. If you know the reward function but are in a POMDP setting without knowledge of how the world works, use an information seeking version of RL (maxentRL or BayesianRL). If you dont know the reward function but do know how the world works and have observations of behavior use max ent inverseRL).
The problem with RL is that its unclear how to use it when you don’t know how the world works and you don’t know what the reward function is, but do have observations of behavior. This is the situation when you are modeling behavior as in the url you cited. In this setting, we don’t know what model humans are using to form their inferences and we don’t know what motivates their behavior. If we are lucky we can glean some notion of their policy by observing behavior, but usually that notion is very coarse i.e. we may only know the average distribution of their actions and observations, p(o,a). The utility of the FEP is that p(o,a) defines the agent all by itself. This means we can start with a policy and infer both belief and reward. This is not something RL was designed to do. RL is for going from reward and belief (or belief formation rules) to policy, not the other way around. IRL can go backward, but only if your beliefs are Bayes optimal.
As for the human brain, I am fully committed to the Helmholtzian notion that the brain is a statistical learning machine as in the Bayesian brain hypothesis with the added caveat that it is important to remember that the brain is massively suboptimal.