Tomorrow can be brighter than today
Although the night is cold
The stars may seem so very far awayBut courage, hope and reason burn
In every mind, each lesson learned
Shining light to guide our wayMake tomorrow brighter than today
Dalcy
Thanks, it seems like the link got updated. Fixed!
Quick paper review of Measuring Goal-Directedness from the causal incentives group.
tl;dr, goal directedness of a policy wrt a utility function is measured by its min distance to one of the policies implied by the utility function, as per the intentional stance—that one should model a system as an agent insofar as doing so is useful.
Details
how is “policies implied by the utility function” operationalized? given a value , we define a set containing policies of maximum entropy (of the decision variable, given its parents in the causal bayes net) among those policies that attain the utility .
then union them over all the achievable values of to get this “wide set of maxent policies,” and define goal directedness of a policy wrt a utility function as the maximum (negative) cross entropy between and an element of the above set. (actually we get the same result if we quantify the min operation over just the set of maxent policies achieving the same utility as .)
Intuition
intuitively, this is measuring: “how close is my policy to being ‘deterministic,’ while ‘optimizing at the competence level ’ and not doing anything else ‘deliberately’?”
“close” / “deterministic” ~ large negative means small
“not doing anything else deliberately’” ~ because we’re quantifying over maxent policies. the policy is maximally uninformative/uncertain, the policy doesn’t take any ‘deliberate’ i.e. low entropy action, etc.
“at the competence level ” ~ … under the constraint that it is identically competent to
and you get the nice property of the measure being invariant to translation / scaling of .
obviously so, because a policy is maxent among all policies achieving on iff that same policy is maxent among all policies achieving on , so these two utilities have the same “wide set of maxent policies.”
Critiques
I find this measure problematic in many places, and am confused whether this is conceptually correct.
one property claimed is that the measure is maximum for uniquely optimal / anti-optimal policy.
it’s interesting that this measure of goal-directedness isn’t exactly an ~increasing function of , and i think it makes sense. i want my measure of goal-directedness to, when evaluated relative to human values, return a large number for both aligned ASI and signflip ASI.
… except, going through the proof one finds that the latter property heavily relies on the “uniqueness” of the policy.
My policy can get the maximum goal-directedness measure if it is the only policy of its competence level while being very deterministic. It isn’t clear that this always holds for the optimal/anti-optimal policies or always relaxes smoothly to epsilon-optimal/anti-optimal policies.
Relatedly, the fact that the quantification is only happening over policies of the same competence level, which feels problematic.
minimum for uniformly random policy(this would’ve been a good property, but unless I’m mistaken I think the proof for the lower bound is incorrect, because negative cross entropy is not bounded below.)honestly the maxent motivation isn’t super clear to me.
not causal. the reason you need causal interventions is because you want to rule out accidental agency/goal-directedness, like a rock that happens to be the perfect size to seal a water bottle—does your rock adapt when I intervene to change the size of the hole? discovering agents is excellent in this regards.
Thank you, that is very clarifying!
I’ve been doing a deep dive on this post, and while the main theorems make sense I find myself quite confused about some basic concepts. I would really appreciate some help here!
So ‘latents’ are defined by their conditional distribution functions whose shape is implicit in the factorization that the latents need to satisfy, meaning they don’t have to always look like , they can look like , etc, right?
I don’t get the ‘standard form’ business. It seems like a procedure to turn one latent variable into another relative to ? I don’t get what the notation means—does it mean that it takes defined by some conditional distribution function like and converts it into ? That doesn’t seem so, the notation looks more like a likelihood function than a conditional distribution. But then what conditional distribution defines this latent ?
The Resampling stuff is a bit confusing too:
if we have a natural latent , then construct a new natural latent by resampling conditional on (i.e. sample from ), independently of whatever other stuff we’re interested in.
I don’t know what operation is being performed here—what CPDs come in, what CPDs leave.
“construct a new natural latent by resampling conditional on (i.e. sample from ), independently of whatever other stuff we’re interested in.” isn’t this what we are already doing when stating a diagram like , which implies a factorization , none of which have ! What changes when resampling? aaaaahhh I think I’m really confused here.
Also does all this imply that we’re starting out assuming that shares a probability space with all the other possible latents, e.g. ? How does this square with a latent variable being defined by the CPD implicit in the factorization?
And finally:
In standard form, a natural latent is always approximately a deterministic function of . Specifically: .
...
Suppose there exists an approximate natural latent over . Construct a new random variable sampled from the distribution . (In other words: simultaneously resample each given all the others.) Conjecture: is an approximate natural latent (though the approximation may not be the best possible). And if so, a key question is: how good is the approximation?
Where is the top result proved, and how is this statement different from the Universal Natural Latent Conjecture below? Also is this post relevant to either of these statements, and if so, does that mean they only hold under strong redundancy?
Does anyone know if Shannon arrive at entropy from the axiomatic definition first, or the operational definition first?
I’ve been thinking about these two distinct ways in which we seem to arrive at new mathematical concepts, and looking at the countless partial information decomposition measures in the literature all derived/motivated based on an axiomatic basis, and not knowing which intuition to prioritize over which, I’ve been assigning less premium on axiomatic conceptual definitions than i used to:
decision theoretic justification of probability > Cox’s theorem
shannon entropy as min description length > three information axioms
fernando’s operational definition of synergistic information > rest of the literature with its countless non-operational PID measures
The basis of comparison would be its usefulness and ease-of-generalization to better concepts:
at least in the case of fernando’s synergistic information, it seems far more useful because i at least know what i’m exactly getting out of it, unlike having to compare between the axiomatic definitions based on handwavy judgements.
for ease of generalization, the problem with axiomatic definitions is that there are many logically equivalent ways to state the initial axiom (from which they can then be relaxed), and operational motivations seem to ground these equivalent characterizations better, like logical inductors from the decision theoretic view of probability theory
(obviously these two feed into each other)
Just finished the local causal states paper, it’s pretty cool! A couple of thoughts though:
I don’t think the causal states factorize over the dynamical bayes net, unlike the original random variables (by assumption). Shalizi doesn’t claim this either.
This would require proving that each causal state is conditionally independent of its nondescendant causal states given its parents, which is a stronger theorem than what is proved in Theorem 5 (only conditionally independent of its ancestor causal states, not necessarily all the nondescendants)
Also I don’t follow the Markov Field part—how would proving:
if we condition on present neighbors of the patch, as well as the parents of the patch, then we get independence of the states of all points at time t or earlier. (pg 16)
… show that the causal states is a markov field (aka satisfies markov independencies (local or pairwise or global) induced by an undirected graph)? I’m not even sure what undirected graph the causal states would be markov with respect to. Is it the …
… skeleton of the dynamical Bayes Net? that would require proving a different theorem: “if we condition on parents and children of the patch, then we get independence of all the other states” which would prove local markov independency
… skeleton of the dynamical Bayes Net + edges for the original graph for each t? that would also require proving a different theorem: “if we condition on present neighbors, parents, and children of the patch, then we get independence of all the other states” which would prove local markov independency
Also for concreteness I think I need to understand its application in detecting coherent structures in cellular automata to better appreciate this construction, though the automata theory part does go a bit over my head :p
a Markov blanket represents a probabilistic fact about the model without any knowledge you possess about values of specific variables, so it doesn’t matter if you actually do know which way the agent chooses to go.
The usual definition of Markov blankets is in terms of the model without any knowledge of the specific values as you say, but I think in Critch’s formalism this isn’t the case. Specifically, he defines the ‘Markov Boundary’ of (being the non-abstracted physics-ish model) as a function of the random variable (where he writes e.g. ), so it can depend on the values instantiated at .
it would just not make sense to try to represent agent boundaries in a physics-ish model if we were to use the usual definition of Markov blankets—the model would just consist of local rules that are spacetime homogeneous, so there is no reason to expect one can apriori carve out an agent from the model without looking at its specific instantiated values.
can really be anything, so doesn’t necessarily have to correspond to physical regions (subsets) of , but they can be if we choose to restricting our search of infiltration/exfiltration-criteria-satisfying to functions that only return boundaries-in-the-sense-of-carving-the-physical-space.
e.g. can represent which subset of the physical boundary is, like 0, 0, 1, 0, 0, … 1, 1, 0
So I think under this definition of Markov blankets, they can be used to denote agent boundaries, even in physics-ish models (i.e. ones that relate nicely to causal relationships). I’d like to know what you think about this.
I thought if one could solve one NP-complete problem then one can solve all of them. But you say that the treewidth doesn’t help at all with the Clique problem. Is the parametrized complexity filtration by treewidth not preserved by equivalence between different NP-complete problems somehow?
All NP-complete problems should have parameters that makes the problem polynomial when bounded, trivially so by the ⇒ 3-SAT ⇒ Bayes Net translation, and using the treewidth bound.
This isn’t the case for the clique problem (finding max clique) because it’s not NP-complete (it’s not a decision problem), so we don’t necessarily expect its parameterized version to be polynomial tractable — in fact, it’s the k-clique problem (yes/no is there a clique larger than size k) that is NP-complete. (so by the above translation argument, there certainly exists some graphical quantity that when bounded makes the k-clique problem tractable, though I’m not aware of it, or whether it’s interesting)
To me, the interesting question is whether:
(1) translating a complexity-bounding parameter from one domain to another leads to quantities that are semantically intuitive and natural in the respective domain.
(2) and whether easy instances of the problems in the latter domain in-practice actually have low values of the translated parameter.
rambling: If not, then that implies we need to search for a new for this new domain. Perhaps this can lead to a whole new way of doing complexity class characterization by finding natural -s for all sorts of NP-complete problems (whose natural -s don’t directly translate to one another), and applying these various “difficulty measures” to characterize your NP-complete problem at hand! (I wouldn’t be surprised if this is already widely studied.)
Looking at the 3-SAT example ( are the propositional variables, the ORs, and the AND with serving as intermediate ANDs):
re: (1), at first glance the treewidth of 3-SAT (clearly dependent on the structure of the - interaction) doesn’t seem super insightful or intuitive, though we may count the statement “a 3-SAT problem gets exponentially harder as you increase the formula-treewidth” as progress.
… but even that wouldn’t be an if and only if characterization of 3-SAT difficulty, because re: (2) there exist easy-in-practice 3-SAT problems that don’t necessarily have bounded treewidth (i haven’t read the link).
I would be interested in a similar analysis for more NP-complete problems known to have natural parameterized complexity characterization.
You mention treewidth—are there other quantities of similar importance?
I’m not familiar with any, though ChatGPT does give me some examples! copy-pasted below:
Solution Size (k): The size of the solution or subset that we are trying to find. For example, in the k-Vertex Cover problem, k is the maximum size of the vertex cover. If k is small, the problem can be solved more efficiently.
Treewidth (tw): A measure of how “tree-like” a graph is. Many hard graph problems become tractable when restricted to graphs of bounded treewidth. Algorithms that leverage treewidth often use dynamic programming on tree decompositions of the graph.
Pathwidth (pw): Similar to treewidth but more restrictive, pathwidth measures how close a graph is to a path. Problems can be easier to solve on graphs with small pathwidth.
Vertex Cover Number (vc): The size of the smallest vertex cover of the graph. This parameter is often used in graph problems where knowing a small vertex cover can simplify the problem.
Clique Width (cw): A measure of the structural complexity of a graph. Bounded clique width can be used to design efficient algorithms for certain problems.
Max Degree (Δ): The maximum degree of any vertex in the graph. Problems can sometimes be solved more efficiently when the maximum degree is small.
Solution Depth (d): For tree-like or hierarchical structures, the depth of the solution tree or structure can be a useful parameter. This is often used in problems involving recursive or hierarchical decompositions.
Branchwidth (bw): Similar to treewidth, branchwidth is another measure of how a graph can be decomposed. Many algorithms that work with treewidth also apply to branchwidth.
Feedback Vertex Set (fvs): The size of the smallest set of vertices whose removal makes the graph acyclic. Problems can become easier on graphs with a small feedback vertex set.
Feedback Edge Set (fes): Similar to feedback vertex set, but involves removing edges instead of vertices to make the graph acyclic.
Modular Width: A parameter that measures the complexity of the modular decomposition of the graph. This can be used to simplify certain problems.
Distance to Triviality: This measures how many modifications (like deletions or additions) are needed to convert the input into a simpler or more tractable instance. For example, distance to a clique, distance to a forest, or distance to an interval graph.
Parameter for Specific Constraints: Sometimes, specific problems have unique natural parameters, like the number of constraints in a CSP (Constraint Satisfaction Problem), or the number of clauses in a SAT problem.
I like to think of treewidth in terms of its characterization from tree decomposition, a task where you find a clique tree (or junction tree) of an undirected graph.
Clique trees for an undirected graph is a tree such that:
node of a tree corresponds to a clique of a graph
maximal clique of a graph corresponds to a node of a tree
given two adjacent tree nodes, the clique they correspond to inside the graph is separated given their intersection set (sepset)
You can check that these properties hold in the example below. I will also refer to nodes of a clique tree as ‘cliques’. (image from here)
My intuition for the point of tree decompositions is that you want to coarsen the variables of a complicated graph so that they can be represented in a simpler form (tree), while ensuring the preservation of useful properties such as:
how the sepset mediates (i.e. removing the sepset from the graph separates the variables associated with all of the nodes on one side of the tree with another) the two cliques in the original graph
clique trees satisfy the running intersection property: if (the cliques corresponding to) the two nodes of a tree contain the same variable X, then X is also contained in (the cliques corresponding to) each and all of the intermediate nodes of the tree.
Intuitively, information flow about a variable must connected, i.e. they can’t suddenly disappear and reappear in the tree.
Of course tree decompositions aren’t unique (image):
So we define .
Intuitively, treewidths represent the degree to which nodes of a graph ‘irreducibly interact with one another’, if you really want to see the graph as a tree.
e.g., the first clique tree of the image above is a suboptimal coarse graining in that it misleadingly characterizes as having a greater degree of ‘irreducible interaction’ among.
Bayes Net inference algorithms maintain its efficiency by using dynamic programming over multiple layers.
Level 0: Naive Marginalization
No dynamic programming whatsoever. Just multiply all the conditional probability distribution (CPD) tables, and sum over the variables of non-interest.
Level 1: Variable Elimination
Cache the repeated computations within a query.
For example, given a chain-structured Bayes Net , instead of doing , we can do . Check my post for more.
Level 2: Clique-tree based algorithms — e.g., Sum-product (SP) / Belief-update (BU) calibration algorithms
Cache the repeated computations across queries.
Suppose you have a fixed Bayes Net, and you want to compute the marginalization not only , but also . Clearly running two instances of Variable Elimination as above is going to contain some overlapping computation.
Clique-tree is a data structure where, given the initial factors (in this case the CPD tables), you “calibrate” a tree whose nodes correspond to a subset of the variables. Cost can be amortized by running many queries over the same Bayes Net.
Calibration can be done by just two passes across the tree, after which you have the joint marginals for all the nodes of the clique tree.
Incorporating evidence is equally simple. Just zero-out the entries of variables that you are conditioning on for some node, then “propagate” that information downwards via a single pass across the tree.
Level 3: Specialized query-set answering algorithms over a calibrated clique tree.
Cache the repeated computations across a certain query-class
e.g., computing for every pair of variables can be done by using yet another layer of dynamic programming by maintaining a table of for each pair of clique-tree nodes ordered according to their distance in-between.
When Are Results from Computational Complexity Not Too Coarse?
Perhaps I should
one day in the far far futurewrite a sequence on bayes nets.Some low-effort TOC (this is basically mostly koller & friedman):
why bayesnets and markovnets? factorized cognition, how to do efficient bayesian updates in practice, it’s how our brain is probably organized, etc. why would anyone want to study this subject if they’re doing alignment research? explain philosophy behind them.
simple examples of bayes nets. basic factorization theorems (the I-map stuff and separation criterion)
tangent on why bayes nets aren’t causal nets, though Zack M Davis had a good post on this exact topic, comment threads there are high insight
how inference is basically marginalization (basic theorems of: a reduced markov net represents conditioning, thus inference upon conditioning is the same as marginalization on a reduced net)
why is marginalization hard? i.e. NP-completeness of exact and approximate inference worst-case
what is a workaround? solve by hand simple cases in which inference can be greatly simplified by just shuffling in the order of sums and products, and realize that the exponential blowup of complexity is dependent on a graphical property of your bayesnet called the treewidthexact inference algorithms (bounded by treewidth) that can exploit the graph structure and do inference efficiently: sum-product / belief-propagation
approximate inference algorithms (works in even high treewidth! no guarantee of convergence) - loopy belief propagation, variational methods, etc
connections to neuroscience: “the human brain is just doing belief propagation over a bayes net whose variables are the cortical column” or smth, i just know that there is some connection
Just read through Robust agents learn causal world models and man it is really cool! It proves a couple of bona fide selection theorems, talking about the internal structure of agents selected against a certain criteria.
Tl;dr, agents selected to perform robustly in various local interventional distributions must internally represent something isomorphic to a causal model of the variables upstream of utility, for it is capable of answering all causal queries for those variables.
Thm 1: agents achieving optimal policy (util max) across various local interventions must be able to answer causal queries for all variables upstream of the utility node
Thm 2: relaxation of above to nonoptimal policies, relating regret bounds to the accuracy of the reconstructed causal model
the proof is constructive—an algorithm that, when given access to regret-bounded-policy-oracle wrt an environment with some local intervention, queries them appropriately to construct a causal model
one implication is an algorithm for causal inference that converts black box agents to explicit causal models (because, y’know, agents like you and i are literally that aforementioned ‘regret-bounded-policy-oracle‘)
These selection theorems could be considered the converse of the well-known statement that given access to a causal model, one can find an optimal policy. (this and its relaxation to approximate causal models is stated in Thm 3)
Thm 1 / 2 is like a ‘causal good regulator‘ theorem.
gooder regulator theorem is not structural—as in, it gives conditions under which a model of the regulator must be isomorphic to the posterior of the system—a black box statement about the input-output behavior.
theorem is limited. only applies to cases where the decision node is not upstream of the environment nodes (eg classification. a negative example would be an mdp). but authors claim this is mostly for simpler proofs and they think this can be relaxed.
- Jul 31, 2024, 7:31 AM; 2 points) 's comment on Dalcy’s Shortform by (
Thoughtdump on why I’m interested in computational mechanics:
one concrete application to natural abstractions from here: tl;dr, belief structures generally seem to be fractal shaped. one major part of natural abstractions is trying to find the correspondence between structures in the environment and concepts used by the mind. so if we can do the inverse of what adam and paul did, i.e. ‘discover’ fractal structures from activations and figure out what stochastic process they might correspond to in the environment, that would be cool
… but i was initially interested in reading compmech stuff not with a particular alignment relevant thread in mind but rather because it seemed broadly similar in directions to natural abstractions.
re: how my focus would differ from my impression of current compmech work done in academia: academia seems faaaaaar less focused on actually trying out epsilon reconstruction in real world noisy data. CSSR is an example of a reconstruction algorithm. apparently people did compmech stuff on real-world data, don’t know how good, but effort-wise far too less invested compared to theory work
would be interested in these reconstruction algorithms, eg what are the bottlenecks to scaling them up, etc.
tangent: epsilon transducers seem cool. if the reconstruction algorithm is good, a prototypical example i’m thinking of is something like: pick some input-output region within a model, and literally try to discover the hmm model reconstructing it? of course it’s gonna be unwieldly large. but, to shift the thread in the direction of bright-eyed theorizing …
the foundational Calculi of Emergence paper talked about the possibility of hierarchical epsilon machines, where you do epsilon machines on top of epsilon machines and for simple examples where you can analytically do this, you get wild things like coming up with more and more compact representations of stochastic processes (eg data stream → tree → markov model → stack automata → … ?)
this … sounds like natural abstractions in its wildest dreams? literally point at some raw datastream and automatically build hierarchical abstractions that get more compact as you go up
haha but alas, (almost) no development afaik since the original paper. seems cool
and also more tangentially, compmech seemed to have a lot to talk about providing interesting semantics to various information measures aka True Names, so another angle i was interested in was to learn about them.
- May 12, 2024, 4:40 PM; 4 points) 's comment on quila’s Shortform by (
- May 3, 2024, 11:13 PM; 2 points) 's comment on Transformers Represent Belief State Geometry in their Residual Stream by (
re: second diagram in the “Bayesian Belief States For A Hidden Markov Model” section, shouldn’t the transition probabilities for the top left model be 85⁄7.5/7.5 instead of 90/5/5?
What is the shape predicted by compmech under a generation setting, and do you expect it instead of the fractal shape to show up under, say, a GAN loss? If so, and if their shapes are sufficiently distinct from the controls that are run to make sure the fractals aren’t just a visualization artifact, that would be further evidence in favor of the applicability of compmech in this setup.
If after all that it still sounds completely wack, check the date. Anything from before like 2003 or so is me as a kid, where “kid” is defined as “didn’t find out about heuristics and biases yet”, and sure at that age I was young enough to proclaim AI timelines or whatevs.
https://twitter.com/ESYudkowsky/status/1650180666951352320
btw there’s no input box for the “How much would you pay for each of these?” question.
Discovering agents provide a genuine causal, interventionist account of agency and an algorithm to detect them, motivated by the intentional stance. I find this paper very enlightening from a conceptual perspective!
I’ve tried to think of problems that needed to be solved before we can actually implement this on real systems—both conceptual and practical—on approximate order of importance.
There are no ‘dynamics,’ no learning. As soon as a mechanism node is edited, it is assumed that agents immediately change their ‘object decision variable’ (a conditional probability distribution given its object parent nodes) to play the subgame equilibria.
Assumption of factorization of variables into ‘object’ / ‘mechanisms,’ and the resulting subjectivity. The paper models the process by which an agent adapts its policy given changes in the mechanism of the environment via a ‘mechanism decision variable’ (that depends on its mechanism parent nodes), which modulates the conditional probability distribution of its child ‘object decision variable’, the actual policy.
For example, the paper says a learned RL policy isn’t an agent, because interventions in the environment won’t make it change its already-learned policy—but that a human or a RL policy together with its training process is an agent, because it can adapt. Is this reasonable?
Say I have a gridworld RL policy that’s learned to get cheese (3 cell world, cheese always on left) by always going to the left. Clearly it can’t change its policy when I change the cheese distribution to favor right, so it seems right to call this not an agent.
Now, say the policy now has sensory access to the grid state, and correctly generalized (despite only being trained on left-cheese) to move in the direction where it sees the cheese, so when I change the cheese distribution, it adapts accordingly. I think it is right to call this an agent?
Now, say the policy is an LLM agent (static weight) on an open world simulation which reasons in-context. I just changed the mechanism of the simulation by lowering the gravity constant, and the agent observes this, reasons in-context, and adapts its sensorimotor policy accordingly. This is clearly an agent?
I think this is because the paper considers, in the case of the RL policy alone, the ‘object policy’ to be the policy of the trained neural network (whose induced policy distribution is definitionally fixed), and the ‘mechanism policy’ to be a trivial delta function assigning the already-trained object policy. And in the case of the RL policy together with its training process, the ‘mechanism policy’ is now defined as the training process that assigns the fully-trained conditional probability distribution to the object policy.
But what if the ‘mechanism policy’ was the in-context learning process by which it induces an ‘object policy’? Then changes in the environment’s mechanism can be related to the ‘mechanism policy’ and thus the ‘object policy’ via in-context learning as in the second and third example, making them count as agents.
Ultimately, the setup in the paper forces us to factorize the means-by-which-policies-adapt into mechanism vs object variables, and the results (like whether a system is to be considered an agent) depends on this factorization. It’s not always clear what the right factorization is, how to discover them from data, or if this is the right frame to think about the problem at all.
Implicit choice of variables that are convenient for agent discovery. The paper does mention that the algorithm is dependent in the choice of the variable, as in: if the node corresponding to the ‘actual agent decision’ is missing but its children is there, then the algorithm will label its children to be the decision nodes. But this is already a very convenient representation!
Prototypical example: Minecraft world with RL agents interacting represented as a coarse-grained lattice (dynamical Bayes Net?) with each node corresponding to a physical location and its property, like color. Clearly no single node here is an agent, because agents move! My naive guess is that in principle, everything will be labeled an agent.
So the variables of choice must be abstract variables of the underlying substrate, like functions over them. But then, how do you discover the right representation automatically, in a way that interventions in the abstract variable level can faithfully translate to actually performable interventions in the underlying substrate?
Given the causal graph, even the slightest satisfaction of the agency-criterion labels the nodes as decision / utility. No “degree-of-agency”—maybe by summing over the extent to which the independencies fail to satisfy?
Then different agents are defined as causally separated chunks (~connected component) of [set-of-decision-nodes / set-of-utility-nodes]. How do we accommodate hierarchical agency (like subagents), systems with different degrees of agency, etc?
The interventional distribution on the object/mechanism variables are converted into a causal graph using the obvious [perform-do()-while-fixing-everything-else] algorithm. My impression is that causal discovery doesn’t really work in practice, especially in noisy reality with a large number of variables via gazillion conditional independence tests.
The correctness proof requires lots of unrealistic assumptions, e.g., agents always play subgame equilibria, though I think some of this can be relaxed.