Tomorrow can be brighter than today
Although the night is cold
The stars may seem so very far awayBut courage, hope and reason burn
In every mind, each lesson learned
Shining light to guide our wayMake tomorrow brighter than today
Dalcy
The Metaphysical Structure of Pearl’s Theory of Time
Epistemic status: metaphysics
I was reading Factored Space Models (previously, Finite Factored Sets) and was trying to understand in what sense it was a Theory of Time.
Scott Garrabrant says “[The Pearlian Theory of Time] … is the best thing to happen to our understanding of time since Einstein”. I read Pearl’s book on Causality[1], and while there’s math, this metaphysical connection that Scott seems to make isn’t really explicated. Timeless Causality and Timeless Physics is the only place I saw this view explained explicitly, but not at the level of math / language used in Pearl’s book.
Here is my attempt at explicitly writing down what all of these views are pointing at (in a more rigorous language)—the core of the Pearlian Theory of Time, and in what sense FSM shares the same structure.
Causality leave a shadow of conditional independence relationships over the observational distribution. Here’s an explanation providing the core intuition:
Suppose you represent the ground truth structure of [causality / determination] of the world via a Structural Causal Model over some variables, a very reasonable choice. Then, as you go down the Pearlian Rung: SCM [2] Causal Bayes Net [3] Bayes Net, theorems guarantee that the Bayes Net is still Markovian wrt the observational distribution.
(Read Timeless Causality for an intuitive example.)
Causal Discovery then (at least in this example) reduces to inferring the equation assignment directions of the SCM, given only the observational distribution.
The earlier result guarantees that all you have to do is find a Bayes Net that is Markovian wrt the observational distribution. Alongside the faithfulness assumption, this thus reduces to finding a Bayes Net structure G whose set of independencies (implied by d-separation) are identical to that of P (or, finding the Perfect Map of a distribution[4]).
Then, at least some of the edges of the Perfect Map will have its directions nailed down by the conditional independence relations.
The metaphysical claim is that, this direction is the definition of time[5], morally so, based on the intuition provided by the example above.
So, the Pearlian Theory of Time is the claim that Time is the partial order over the variables of a Bayes Net corresponding to the perfect map of a distribution.
Abstracting away, the structure of any Theory of Time is then to:
find a mathematical structure [in the Pearlian Theory of Time, a Bayes Net]
… that has gadgets [d-separation]
… that are, in some sense, “equivalent” [soundness & completeness] to the conditional independence relations of the distribution the structure is modeling
… while containing a notion of order [parenthood relationship of nodes in a Bayes Net]
… while this order induced from the gadget coinciding to that of d-separation [trivially so here, because we’re talking about Bayes Nets and d-separation] such that it captures the earlier example which provided the core intuition behind our Theory of Time.
This is exactly what Factored Space Model does:
find a mathematical structure [Factored Space Model]
… that has gadgets [structural independence]
… that are, in some sense, “equivalent” [soundness & completeness] to the conditional independence relations of the distribution the structure is modeling
… while containing a notion of order [preorder relation induced by the subset relationship of the History]
… while this order induced from the gadget coinciding to that of d-separation [by a theorem of FSM] such that it captures the earlier example which provided the core intuition behind our Theory of Time.
while, additionally, generalizing the scope of our Theory of Time from [variables that appear in the Bayes Net] to [any variables defined over the factored space].
… thus justifying calling FSM a Theory of Time in the same spirit that Pearlian Causal Discovery is a Theory of Time.
- ^
Chapter 2, specifically, which is about Causal Discovery. All the other chapters are mostly irrelevant for this purpose.
- ^
By (1) making a graph with edge direction corresponding to equation assignment direction, (2) pushforwarding uncertainties to endogenous variables, and (3) letting interventional distributions be defined by the truncated factorization formula.
- ^
By (1) forgetting the causal semantics, i.e. no longer associating the graph with all the interventional distributions, and only the no intervention observational distribution.
- ^
This shortform answers this question I had.
- ^
Pearl comes very close. In his Temporal Bias Conjecture (2.8.2):
“In most natural phenomenon, the physical time coincides with at least one statistical time.”
(where statistical time refers to the aforementioned direction.)
But doesn’t go as far as this ought to be the definition of Time.
The grinding inevitability is not a pressure on you from the outside, but a pressure from you, towards the world. This type of determination is the feeling of being an agent with desires and preferences. You are the unstoppable force, moving towards the things you care about, not because you have to but simply because that’s what it means to care.
I think this is probably one of my favorite quotes of all time. I translated it to Korean (with somewhat major stylistic changes) with the help of ChatGPT:
의지(意志)라 함은,
하나의 인간으로서,
멈출 수 없는 힘으로 자신이 중요히 여기는 것들을 향해 나아가는 것.이를 따르는 갈아붙이는 듯한 필연성은,
외부에서 자신을 압박하는 힘이 아닌,
스스로가 세상을 향해 내보내는 압력임을.해야 해서가 아니라,
단지 그것이 무언가를 소중히 여긴다는 뜻이기 때문에.
The K-complexity of a function is the length of its shortest code. But having many many codes is another way to be simple! Example: gauge symmetries in physics. Correcting for length-weighted code frequency, we get an empirically better simplicity measure: cross-entropy.
[this] is a well-known notion in algorithmic information theory, and differs from K-complexity by at most a constant
Epistemic status: literal shower thoughts, perhaps obvious in retrospect, but was a small insight to me.
I’ve been thinking about: “what proof strategies could prove structural selection theorems, and not just behavioral selection theorems?”
Typical examples of selection theorems in my mind are: coherence theorems, good regulator theorem, causal good regulator theorem.
Coherence theorem: Given an agent satisfying some axioms, we can observe their behavior in various conditions and construct , and then the agent’s behavior is equivalent to a system that is maximizing .
Says nothing about whether the agent internally constructs and uses them.
(Little Less Silly version of the) Good regulator theorem: A regulator that minimizes the entropy of a system variable (where there is an environment variable upstream of both and ) without unnecessary noise (hence deterministic) is behaviorally equivalent to a deterministic function of (despite being a function of ).
Says nothing about whether actually internally reconstructs and uses it to produce its output.
Causal good regulator theorem (summary): Given an agent achieving low regret across various environment perturbations, we can observe their behavior in specific perturbed-environments, and construct that is very similar to the true environment . Then argue: “hence the agent must have something internally isomorphic to ”. Which is true, but …
says nothing about whether the agent actually uses those internal isomorphic-to- structures in the causal history of computing its output.
And I got stuck here wondering, man, how do I ever prove anything structural.
Then I considered some theorems that, if you squint really really hard, could also be framed in the selection theorem language in a very broad sense:
SLT: Systems selected to get low loss are likely to be in a degenerate part of the loss landscape.[1]
Says something about structure: by assuming the system to be a parameterized statistical model, it says the parameters satisfy certain conditions like degeneracy (which further implies e.g., modularity).
This made me realize that to prove selection theorems on structural properties of agents, you should obviously give more mathematical structure to the “agent” in the first place:
SLT represents a system as a parameterized function—very rich!
In coherence theorem, the agent is just a single node that outputs decision given lotteries. In the good regulator theorem and the causal good regulator theorem, the agent is literally just a single node in a Bayes Net—very impoverished!
And recall, we actually have an agent foundations style selection theorem that does prove something structural about agent internals by giving more mathematical structure to the agent:
Gooder regulator theorem: A regulator is now two nodes instead of one, but the latter-in-time node gets an additional information about the choice of “game” it is being played against (thus the former node acts as a sort of information bottleneck). Then, given that the regulator makes take minimum entropy, the first node must be isomorphic to the likelihood function .
This does say something about structure, namely that an agent (satisfying certain conditions) with an internal information bottleneck (structural assumption) must have that bottleneck be behaviorally equivalent to a likelihood function, whose output is then connected to the second node. Thus it is valid to claim that (under our structural assumption) the agent internally reconstructs the likelihood values and uses it in its computation of the output.
So in short, we need more initial structure or even assumptions on our “agent,” at least more so than literally a single node in a Bayes Net, to expect to be able to prove something structural.
Here is my 5-minute attempt to put more such “structure” to the [agent/decision node] in the Causal good regulator theorem with the hopes that this would make the theorem more structural, and perhaps end up as a formalization of the Agent-like Structure Problem (for World Models, at least), or very similarly the Approximate Causal Mirror hypothesis:
Similar setup to the Causal good regulator theorem, but instead of a single node representing an agent’s decision node, assume that the agent as a whole is represented by an unknown causal graph , with a number of nodes designated as input and output, connected to the rest-of-the-world causal graph . Then claim: Agents with low regret must have that admits an abstracting causal model map (summary) from , and (maybe more structural properties such as) the approximation error should roughly be lowest around the input/output & utility nodes, and increase as you move further away from it in the low-level graph. This would be a very structural claim!
- ^
I’m being very very [imprecise/almost misleading] here—because I’m just trying to make a high-level point and the details don’t matter too much—one of the caveats (among many) being that this statement makes the theoretically yet unjustified connection between SGD and Bayes.
“I always remember, [Hamming] would come into my office and try to solve a problem [...] I had a very big blackboard, and he’d start on one side, write down some integral, say, ‘I ain’t afraid of nothin’, and start working on it. So, now, when I start a big problem, I say, ‘I ain’t afraid of nothin’, and dive into it.”
The question is whether this expression is easy to compute or not, and fortunately the answer is that it’s quite easy! We can evaluate the first term by the simple Monte Carlo method of drawing many independent samples and evaluating the empirical average, as we know the distribution explicitly and it was presumably chosen to be easy to draw samples from.
My question when reading this was: why can’t we say the same thing about ? i.e. draw many independent samples and evaluate the empirical average? Usually is also assumed known and simple to sample from (e.g., gaussian).
So far, my answer is:
, so assuming is my data, usually will be high when is high, so the samples during MCMC will be big enough to contribute to the sum, unlike blindly sampling from where most samples will contribute nearly to the sum.
Also another reason being how the expectation can be reduced to the sum of expectations over each of the dimensions of and if and factorizes nicely.
Is there a way to convert a LessWrong sequence into a single pdf? Should ideally preserve comments, latex, footnotes, etc.
Formalizing selection theorems for abstractability
Tl;dr, Systems are abstractable to the extent they admit an abstracting causal model map with low approximation error. This should yield a pareto frontier of high-level causal models consisting of different tradeoffs between complexity and approximation error. Then try to prove a selection theorem for abstractability / modularity by relating the form of this curve and a proposed selection criteria.
Recall, an abstracting causal model (ACM)—exact transformations, -abstractions, and approximations—is a map between two structural causal models satisfying certain requirements that lets us reasonably say one is an abstraction, or a high-level causal model of another.
Broadly speaking, the condition is a sort of causal consistency requirement. It’s a commuting diagram that requires the “high-level” interventions to be consistent with various “low-level” ways of implementing that intervention. Approximation errors talk about how well the diagram commutes (given that the support of the variables in the high-level causal model is equipped with some metric)
Now consider a curve: x-axis is the node count, and y-axis is the minimum approximation error of ACMs of the original system with that node count (subject to some conditions[1]). It would hopefully an decreasing one[2].
This curve would represent the abstractability of a system. Lower the curve, the more abstractable it is.
Aside: we may conjecture that natural systems will have discrete jumps, corresponding to natural modules. The intuition being that, eg if we have a physics model of two groups of humans interacting, in some sense 2 nodes (each node representing the human-group) and 4 nodes (each node representing the individual-human) are the most natural, and 3 nodes aren’t (perhaps the 2 node system with a degenerate node doing ~nothing, so it would have very similar approximation scores with the 2 node case).
Then, try hard to prove a selection theorem of the following form: given low-level causal model satisfying certain criteria (eg low regret over varying objectives, connection costs), the abstractability curve gets pushed further downwards. Or conversely, find conditions that make this true.
I don’t know how to prove this[3], but at least this gets closer to a well-defined mathematical problem.
- ^
I’ve been thinking about this for an hour now and finding the right definition here seems a bit non-trivial. Obviously there’s going to be an ACM of zero approximation error for any node count, just have a single node that is the joint of all the low-level nodes. Then the support would be massive, so a constraint on it may be appropriate.
Or instead we could fold it in to the x-axis—if there is perhaps a non ad-hoc, natural complexity measure for Bayes Nets that capture [high node counts ⇒ high complexity because each nodes represent stable causal mechanisms of the system, aka modules] and [high support size ⇒ high complexity because we don’t want modules that are “contrived” in some sense] as special cases, then we could use this as the x-axis instead of just node count.
Immediate answer: Restrict this whole setup into a prediction setting so that we can do model selection. Require on top of causal consistency that both the low-level and high-level causal model have a single node whose predictive distribution are similar. Now we can talk about eg the RLCT of a Bayes Net. I don’t know if this makes sense. Need to think more.
- ^
Or rather, find the appropriate setup to make this a decreasing curve.
- ^
I suspect closely studying the robust agents learn causal world models paper would be fruitful, since they also prove a selection theorem over causal models. Their strategy is to (1) develop an algorithm that queries an agent with low regret to construct a causal model, (2) prove that this yields an approximately correct causal model of the data generating model, (3) then arguing that this implies the agent must internally represent something isomorphic to a causal world model.
I don’t know if this is just me, but it took me an embarrassingly long time in my mathematical education to realize that the following three terminologies, which introductory textbooks used interchangeably without being explicit, mean the same thing. (Maybe this is just because English is my second language?)
X ⇒ Y means X is sufficient for Y means X only if Y
X ⇐ Y means X is necessary for Y means X if Y
I’d also love to have access!
Any thoughts on how to customize LessWrong to make it LessAddictive? I just really, really like the editor for various reasons, so I usually write a bunch (drafts, research notes, study notes, etc) using it but it’s quite easy to get distracted.
(the causal incentives paper convinced me to read it, thank you! good book so far)
if you read Sutton & Barto, it might be clearer to you how narrow are the circumstances under which ‘reward is not the optimization target’, and why they are not applicable to most AI things right now or in the foreseeable future
Can you explain this part a bit more?
My understanding of situations in which ‘reward is not the optimization target’ is when the assumptions of the policy improvement theorem don’t hold. In particular, the theorem (that iterating policy improvement step must yield strictly better policies and it converges at the optimal, reward maximizing policy) assumes that each step we’re updating the policy by greedy one-step lookahead (by argmaxing the action via ).
And this basically doesn’t hold irl because realistic RL agents aren’t forced to explore all states (the classic example of “I can explore the state of doing cocaine, and I’m sure my policy will drastically change in a way that my reward circuit considers an improvement, but I don’t have to do that). So my opinion that the circumstances under which ‘reward is the optimization target’ is very narrow remains unchanged, and I’m interested in why you believe otherwise.
I think something in the style of abstracting causal models would make this work—defining a high-level causal model such that there is a map from the states of the low-level causal model to it, in a way that’s consistent with mapping low-level interventions to high-level interventions. Then you can retain the notion of causality to non-low-level-physical variables with that variable being a (potentially complicated) function of potentially all of the low-level variables.
[Question] Money Pump Arguments assume Memoryless Agents. Isn’t this Unrealistic?
Unidimensional Continuity of Preference Assumption of “Resources”?
tl;dr, the unidimensional continuity of preference assumption in the money pumping argument used to justify the VNM axioms correspond to the assumption that there exists some unidimensional “resource” that the agent cares about, and this language is provided by the notion of “souring / sweetening” a lottery.
Various coherence theorems—or more specifically, various money pumping arguments generally have the following form:
If you violate this principle, then [you are rationally required] / [it is rationally permissible for you] to follow this trade that results in you throwing away resources. Thus, for you to avoid behaving pareto-suboptimally by throwing away resources, it is justifiable to call this principle a ‘principle of rationality,’ which you must follow.
… where “resources” (the usual example is money) are something that, apparently, these theorems assume exist. They do, but this fact is often stated in a very implicit way. Let me explain.
In the process of justifying the VNM axioms using money pumping arguments, one of the three main mathematical primitives are: (1) lotteries (probability distribution over outcomes), (2) preference relation (general binary relation), and (3) a notion of Souring/Sweetening of a lottery. Let me explain what (3) means.
Souring of is denoted , and a sweetening of is denoted .
is to be interpreted as “basically identical with A but strictly inferior in a single dimension that the agent cares about.” Based on this interpretation, we assume . Sweetening is the opposite, defined in the obvious way.
Formally, souring could be thought of as introducing a new preference relation , which is to be interpreted as “lottery B is basically identical to lottery A, but strictly inferior in a single dimension that the agent cares about”.
On the syntactic level, such is denoted as .
On the semantic level, based on the above interpretation, is related to via the following:
This is where the language to talk about resources come from. “Something you can independently vary alongside a lottery A such that more of it makes you prefer that option compared to A alone” sounds like what we’d intuitively call a resource[1].
Now that we have the language, notice that so far we haven’t assumed sourings or sweetenings exist. The following assumption does it:
Unidimensional Continuity of Preference: If , then there exists a prospect such that 1) is a souring of X and 2) .
Which gives a more operational characterization of souring as something that lets us interpolate between the preference margins of two lotteries—intuitively satisfied by e.g., money due to its infinite divisibility.
So the above assumption is where the assumption of resources come into play. I’m not aware of any money pump arguments for this assumption, or more generally, for the existence of a “resource.” Plausibly instrumental convergence.
- ^
I don’t actually think this + the assumption below fully capture what we intuitively mean by “resources”, enough to justify this terminology. I stuck with “resources” anyways because others around here used that term to (I think?) refer to what I’m describing here.
Yeah I’d like to know if there’s a unified way of thinking about information theoretic quantities and causal quantities, though a quick literature search doesn’t show up anything interesting. My guess is that we’d want separate boundary metrics for informational separation and causal separation.
I no longer think the setup above is viable, for reasons that connect to why I think Critch’s operationalization is incomplete and why boundaries should ultimately be grounded in Pearlian Causality and interventions.
(Note: I am thinking as I’m writing, so this might be a bit rambly.)
The world-trajectory distribution is ambiguous.
Intuition: Why does a robust glider in Lenia intuitively feel like a system possessing boundary? Well, I imagine various situations that happen in the world (like bullets) and this pattern mostly stays stable in face of them.
Now, notice that the measure of infiltration/exfiltration depends on , a distribution over world history.
So, for the above measure to capture my intuition, the approximate Markov condition (operationalized by low infil & exfil) must consider the world state that contains the Lenia pattern with it avoiding bullets.
Remember, is the raw world state, no coarse graining. So is the distribution over the raw world trajectory. It already captures all the “potentially occurring trajectories under which the system may take boundary-preserving-action.” Since everything is observed, our distribution already encodes all of “Nature’s Intervention.” So in some sense Critch’s definition is already causal (in a very trivial sense), by the virtue of requiring a distribution over the raw world trajectory, despite mentioning no Pearlian Causality.
Issue: Choice of
Maybe there is some canonical true for our physical world that minds can intersubjectively arrive at, so there’s no ambiguity.
But when I imagine trying to implement this scheme on Lenia, there’s immediately an ambiguity as to which distribution (representing my epistemic state on which raw world trajectories that will “actually happen”) we should choose:
Perhaps a very simple distribution: assigning uniform probability over world trajectories where the world contains nothing but the glider moving in a random direction with some initial point offset.
I suspect many stances other the one factorizing the world into gliders would have low infil/exfil, because the world is so simple. This is the case of “accidental boundary-ness.”
Perhaps something more complicated: various trajectories where e.g., the Lenia patterns encounters bullets, evolves with various other patterns, etc.
This I think rules out “accidental boundary-ness.”
I think the latter works. But now there’s a subjective choice of the distribution, and what are the set of possible/realistic “Nature’s Intervention”—all the situations that can ever be encountered by the system under which it has boundary-like behaviors—that we want to implicitly encode into our observational distribution. I don’t think it’s natural for assign much probability to a trajectory whose initial conditions are set in a very precise way such that everything decays into noise. But this feels quite subjective.
Hints toward a solution: Causality
I think the discussion above hints at a very crucial insight:
must arise as a consequence of the stable mechanisms in the world.
Suppose the world of Lenia contains various stable mechanisms like a gun that shoots bullets at random directions, scarce food sources, etc.
We want to describe distributions that the boundary system will “actually” experience in some sense. I want the “Lenia pattern dodges bullet” world trajectory to be considered, because there is a plausible mechanism in the world that can cause such trajectories to exist. For similar reasons, I think the empty world distributions are impoverished, and a distribution containing trajectories where the entire world decays into noise is bad because no mechanism can implement it.
Thus, unless you have a canonical choice of , a better starting point would be to consider the abstract causal model that encodes the stable mechanisms in the world, and using Discovering Agents-style interventional algorithms that operationalize the notion “boundaries causally separate environment and viscera.”
Well, because of everything mentioned above on how the causal model informs us on which trajectories are realistic, especially in the absence of a canonical . It’s also far more efficient, because the knowledge of the mechanism informs the algorithm of the precise interventions to query the world for, instead of having to implicitly bake them in .
There are still a lot more questions, but I think this is a pretty clarifying answer as to how Critch’s boundaries are limiting and why DA-style causal methods will be important.
I think it’s plausible that the general concept of boundaries can possibly be characterized somewhat independently of preferences, but at the same time have boundary-preservation be a quality that agents mostly satisfy (discussion here. very unsure about this). I see Critch’s definition as a first iteration of an operationalization for boundaries in the general, somewhat-preference-independent sense.
But I do agree that ultimately all of this should tie back to game theory. I find Discovering Agents most promising in this regards, though there are still a lot of problems—some of which I suspect might be easier to solve if we treat systems-with-high-boundaryness as a sort of primitive for the kind-of-thing that we can associate agency and preferences with in the first place.
EDIT: I no longer think this setup is viable, for reasons that connect to why I think Critch’s operationalization is incomplete and why boundaries should ultimately be grounded in Pearlian Causality and interventions. Check update.
I believe there’s nothing much in the way of actually implementing an approximation of Critch’s boundaries[1] using deep learning.
Recall, Critch’s boundaries are:
Given a world (markovian stochastic process) , map its values (vector) bijectively using into ‘features’ that can be split into four vectors each representing a boundary-possessing system’s Viscera, Active Boundary, Passive Boundary, and Environment.
Then, we characterize boundary-ness (i.e. minimal information flow across features unmediated by a boundary) using two mutual information criterion each representing infiltration and exfiltration of information.
And a policy of the boundary-posessing system (under the ‘stance’ of viewing the world implied by ) can be viewed as a stochastic map (that has no infiltration/exfiltration by definition) that best approximates the true dynamics.
The interpretation here (under low exfiltration and infiltration) is that can be viewed as a policy taken by the system in order to perpetuate its boundary-ness into the future and continue being well-described as a boundary-posessing system.
All of this seems easily implementable using very basic techniques from deep learning!
Bijective feature map are implemented using two NN maps each way, with an autoencoder loss.
Mutual information is approximated with standard variational approximations. Optimize to minimize it.
(the interpretation here being—we’re optimizing our ‘stance’ towards the world in a way that best views the world as a boundary-possessing system)
After you train your ‘stance’ using the above setup, learn the policy using an NN with standard SGD, with fixed .
A very basic experiment would look something like:
Test the above setup on two cellular automata (e.g., GoL, Lenia, etc) systems, one containing just random ash, and the other some boundary-like structure like noise-resistant glider structures found via optimization (there are a lot of such examples in the Lenia literature).[2]
Then (1) check if the infiltration/exfiltration values are lower for the latter system, and (2) do some interp to see if the V/A/P/E features or the learned policy NN have any interesting structures.
I’m not sure if I’d be working on this any time soon, but posting the idea here just in case people have feedback.
- ^
I think research on boundaries—both conceptual work and developing practical algorithms for approximating them & schemes involving them—are quite important for alignment for reasons discussed earlier in my shortform.
- ^
Ultimately we want our setup to detect boundaries that aren’t just physically contiguous chunks of matter, like informational boundaries, so we want to make sure our algorithm isn’t just always exploiting basic locality heuristics.
I can’t think of a good toy testbed (ideas appreciated!), but one easy thing to try is to just destroy all locality by mapping the automata lattice (which we were feeding as input) with the output of a complicated fixed bijective map over it, so that our system will have to learn locality if it turns out to be a useful notion in its attempt at viewing the system as a boundary.
The critical insight is that this is not always the case!
Let’s call two graphs I-equivalent if their set of independencies (implied by d-separation) are identical. A theorem of Bayes Nets say that two graphs are I-equivalent if they have the same skeleton and the same set of immoralities.
This last constraint, plus the constraint that the graph must be acyclic, allows some arrow directions to be identified—namely, across all I-equivalent graphs that are the perfect map of a distribution, some of the edges have identical directions assigned to them.
The IC algorithm (Verma & Pearl, 1990) for finding perfect maps (hence temporal direction) is exactly about exploiting these conditions to orient as many of the edges as possible:
More intuitively, (Verma & Pearl, 1992) and (Meek, 1995) together shows that the following four rules are necessary and sufficient operations to maximally orient the graph according to the I-equivalence (+ acyclicity) constraint:
Anyone interested in further detail should consult Pearl’s Causality Ch 2. Note that for some reason Ch 2 is the only chapter in the book where Pearl talks about Causal Discovery (i.e. inferring time from observational distribution) and the rest of the book is all about Causal Inference (i.e. inferring causal effect from (partially) known causal structure).