Scott Garrabrant comments on Finite Factored Sets

Scott Garrabrant 24 May 2021 1:46 UTC
LW: 4 AF: 3
AF
I don’t understand what conspiracy is required here.
X being orthogonal to X XOR Y implies X is before Y, we don’t get the converse.
- cousin_it 24 May 2021 11:57 UTC
  LW: 11 AF: 7
  AF Parent
  Well, imagine we have three boolean random variables. In “general position” there are no independence relations between them, so we can’t say much. Constrain them so two of the variables are independent, that’s a bit less “general”, and we still can’t say much. Constrain some more so the xor of all three variables is always 1, that’s even less “general”, now we can use your method to figure out that the third variable is downstream of the first two. Constrain some more so that some of the probabilities are ¹⁄₂, and the method stops working. What I’d like to understand is the intuition, which real world cases have the particular “general position” where the method works.
  - Scott Garrabrant 24 May 2021 15:38 UTC
    LW: 4 AF: 3
    AF Parent
    Ok, makes sense. I think you are just pointing out that when I am saying “general position,” that is relative to a given structure, like FFS or DAG or symmetric FFS.
    If you have a probability distribution, it might be well modeled by a DAG, or a weaker condition is that it is well modeled by a FFS, or an even weaker condition is that it is well modeled by a SFFS (symmetric finite factored set).
    We have a version of the fundamental theorem for DAGs and d-seperation, we have a version of the fundamental theorem for FFS and conditional orthogonality, and we might get a version of the fundamental theorem for SFFS and whatever corresponds to conditional independence in that world.
    However, I claim that even if we can extend to a fundamental theorem for SFFS, I still want to think of the independences in a SFFS as having different sources. There are the independences coming from orthogonality, and there are there the independences coming from symmetry (or symmetry together with orthogonality.
    In this world, orthogonality won’t be as inferable because it will only be a subset of independence, but it will still be an important concept. This is similar to what I think will happen when we go to the infinite dimensional factored sets case.
    - cousin_it 24 May 2021 16:03 UTC
      LW: 6 AF: 2
      AF Parent
      Can you give some more examples to motivate your method? Like the smoking/tar/cancer example for Pearl’s causality, or Newcomb’s problem and counterfactual mugging for UDT.
      - Scott Garrabrant 24 May 2021 16:35 UTC
        LW: 13 AF: 6
        AF Parent
        Hmm, first I want to point out that the talk here sort of has natural boundaries around inference, but I also want to work in a larger frame that uses FFS for stuff other than inference.
        If I focus on the inference question, one of the natural questions that I answer is where I talk about grue/bleen in the talk.
        I think for inference, it makes the most sense to think about FFS relative to Pearl. We have this problem with looking at smoking/tar/cancer, which is what if we carved into variables the wrong way. What if instead of tar/cancer, we had a variable for “How much bad stuff is in your body?” and “What is the ratio of tar to cancer?” The point is that our choice of variables both matters for the Pearlian framework, and is not empirically observable. I am trying to do all the good stuff in Pearl without the dependence on the variables
        Indeed, I think the largest crux between FFS and Pearl is something about variable realism. To FFS, there is no realism to a variable beyond its information content, so it doesn’t make sense to have two variables X, X’ with the same information, but different temporal properties. Pearl’s ontology, on the other hand, has these graphs with variables and edges that say “who listens to whom,” which sets us up to be able to have e.g. a copy function from X to X’, and an arrow from X to Y, which makes us want to say X is before Y, but X’ is not.
        For the more general uses of FFS, which are not about inference, my answer is something like “the same kind of stuff as Cartesian frames.” e.g. specifying embedded observations. (A partition $A$ observes a subset $E$ relative to a high level world model $W$ if $A ⊥ {E, S ∖ E}$ and $A ⊥ W | (S ∖ E)$ . Notice the first condition is violated by transparent Newcomb, and the second condition is violated by counterfactual mugging. (The symbols here should be read as combinatorial properties, there are no probabilities involved.))
        I want to be able to tell the stories like in the Saving Time post, where there are abstract versions of things that are temporally related.
        Scott Garrabrant 24 May 2021 17:47 UTC
        LW: 2 AF: 2
        AF Parent
        Here is a more concrete example of me using FFS the way I intend them to be used outside of the inference problem. (This is one specific application, but maybe it shows how I intend the concepts to be manipulated).
        I can give an example of embedded observation maybe, but it will have to come after a more formal definition of observation (This is observation of a variable, rather than the observation of an event above):
        Definition: Given a FFS $F = (S, B)$ , and $A$ , $W$ , $X$ , which are partitions of $S$ , where $X = {x_{1}, \dots, x_{n}}$ , we say $A$ observes $X$ relative to W if:
        1) $A ⊥ X$ ,
        2) $A$ can be expressed in the form $A = A_{0} \lor_{S} \dots \lor_{S} A_{n}$ , and
        3) $A_{i} ⊥ W | (S ∖ x_{i})$ .
        (This should all be interpreted combinatorially, not probabilistically.)
        The intuition of what is going on here is that to observe an event, you are being promised that you 1) do not change whether the event holds, and 3) do not change anything that matters in the case where that event does not hold. Then, to observe a variable, you can basically 2) split yourself up into different fragments of your policy, where each policy fragment observes a different value of that variable. (This whole thing is very updateless.)
        Example 1: (non-observation)
        An agent $A = {L, R}$ does not observe a coinflip $X = {H, T}$ , and chooses to raise either his left or right hand. Our FFS $F = (S, B)$ is given by $S = A \times X$ , and $B = {A, X}$ . (I am abusing notation here slightly by conflating $A$ with the partition you get on $A \times X$ by projecting onto the $A$ coordinate.) Then W is the discrete partition on $A \times X$ .
        In this example, we do not have observation. Proof: A only has two parts, so if we express A as a common refinement of 2 partitions, at least one of these two partitions must be equal to A. However, A is not orthogonal to W given H and A is not orthogonal to W given T. ( $h^{F} (A | H) = h^{F} (W | H) = h^{F} (A | T) = h^{F} (W | T) = {A}$ ). Thus we must violate condition 3.
        Example 2: (observation)
        An agent $A = {L L, L R, R L, R R}$ does observe a coinflip $X = {H, T}$ , and chooses to raise either his left or right hand. We can think of $A$ as actually choosing a policy that is a function from $X$ to ${L, R}$ , where the two character string in the parts in $A$ are the result of H followed by the result of T.
        Our FFS $F = (S, B)$ is given by $S = X \times A_{H} \times A_{T}$ , and $B = {X, A_{H}, A_{T}}$ , where $A_{H} = {L_{H}, R_{H}}$ represents what the agent would do seeing heads, and $A_{T} = {L_{T}, R_{T}}$ represents what the agent word do given seeing tails. $A = A_{H} \lor_{S} A_{T}$ . We also have a partition representing what the agent actually does $Y = {L, R}$ , where $L$ and $R$ are each four element sets in the obvious way. We will then say $W = X \lor_{S} Y$ , so W does not get to see what $A$ would have done, it only gets to see the coin flip and what $A$ actually did.
        Now I will prove that $A$ observes $X$ relative to $W$ in this example. First, $h^{F} (A) = {A_{H}, A_{T}}$ , and $h^{F} (X) = {X}$ , so we get the first condition, $A ⊥ X$ . We will break up A in the obvious way set up in the problem for condition 2, so it suffices now to show that $A_{H} ⊥ W | T$ , (and it will follow symmetrically that $A_{T} ⊥ W | H$ .)
        Im not going to go through the details, but $h^{F} (A_{H} | T) = {A_{H}}$ , while $h^{F} (W | T) = {A_{T}}$ , which are disjoint. The important thing here is that $W$ doesn’t care about $A_{H}$ in worlds in which $T$ holds.
        Discussion:
        So largely I am sharing this to give an example for how you can manipulate FFS combinatorially, and how you can use this to say things that you might otherwise want to say using graphs, Granted, you could also say the above things using graphs, but now you can say more things, because you are not restricted to the nodes you choose, you can ask the same combinatorial question about any of the other partitions, The benefit is largely about not being dependent on our choice of variables.
        It is interesting to try to translate this definition of observation to transparent Newcomb or counterfactual mugging, and see how some of the orthogonalities are violated, and thus it does not count as an observation.
        [ ]
        [deleted]
        Scott Garrabrant 24 May 2021 17:45 UTC
        LW: 2 AF: 2
        AF Parent
        I’ll try. My way of thinking doesn’t use the examples, so I have to generate them for communication.
        I can give an example of embedded observation maybe, but it will have to come after a more formal definition of observation (This is observation of a variable, rather than the observation of an event above):
        Definition: Given a FFS $F = (S, B)$ , and $A$ , $W$ , $X$ , which are partitions of $S$ , where $X = {x_{1}, \dots, x_{n}}$ , we say $A$ observes $X$ relative to W if:
        1) $A ⊥ X$ ,
        2) $A$ can be expressed in the form $A = A_{0} \lor_{S} \dots \lor_{S} A_{n}$ , and
        3) $A_{i} ⊥ W | (S ∖ x_{i})$ .
        (This should all be interpreted combinatorially, not probabilistically.)
        The intuition of what is going on here is that to observe an event, you are being promised that you 1) do not change whether the event holds, and 3) do not change anything that matters in the case where that event does not hold. Then, to observe a variable, you can basically 2) split yourself up into different fragments of your policy, where each policy fragment observes a different value of that variable. (This whole thing is very updateless.)
        Example 1 (non-observation)
        An agent $A = {L, R}$ does not observe a coinflip $X = {H, T}$ , and chooses to raise either his left or right hand. Our FFS $F = (S, B)$ is given by $S = A \times X$ , and $B = {A, X}$ . (I am abusing notation here slightly by conflating $A$ with the partition you get on $A \times X$ by projecting onto the $A$ coordinate.) Then W is the discrete partition on $A \times X$ .
        In this example, we do not have observation. Proof: A only has two parts, so if we express A as a common refinement of 2 partitions, at least one of these two partitions must be equal to A. However, A is not orthogonal to W given H and A is not orthogonal to W given T. ( $h^{F} (A | H) = h^{F} (W | H) = h^{F} (A | T) = h^{F} (W | T) = {A}$ ). Thus we must violate condition 3.
        Example 2: (observation)
        An agent $A = {L L, L R, R L, R R}$ does observe a coinflip $X = {H, T}$ , and chooses to raise either his left or right hand. We can think of $A$ as actually choosing a policy that is a function from $X$ to ${L, R}$ , where the two character string in the parts in $A$ are the result of H followed by the result of T.
        Our FFS $F = (S, B)$ is given by $S = X \times A_{H} \times A_{T}$ , and $B = {X, A_{H}, A_{T}}$ , where $A_{H} = {L_{H}, R_{H}}$ represents what the agent would do seeing heads, and $A_{T} = {L_{T}, R_{T}}$ represents what the agent word do given seeing tails. $A = A_{H} \lor_{S} A_{T}$ . We also have a partition representing what the agent actually does $Y = {L, R}$ , where $L$ and $R$ are each four element sets in the obvious way. We will then say $W = X \lor_{S} Y$ , so W does not get to see what $A$ would have done, it only gets to see the coin flip and what $A$ actually did.
        Now I will prove that $A$ observes $X$ relative to $W$ in this example. First, $h^{F} (A) = {A_{H}, A_{T}}$ , and $h^{F} (X) = {X}$ , so we get the first condition, $A ⊥ X$ . We will break up A in the obvious way set up in the problem for condition 2, so it suffices now to show that $A_{H} ⊥ W | T$ , (and it will follow symmetrically that $A_{T} ⊥ W | H$ .)
        Im not going to go through the details, but $h^{F} (A_{H} | T) = {A_{H}}$ , while $h^{F} (W | T) = {A_{T}}$ , which are disjoint. The important thing here is that $W$ doesn’t care about $A_{H}$ in worlds in which $T$ holds.
        Discussion:
        So largely I am sharing this to give an example for how you can manipulate FFS combinatorially, and how you can use this to say things that you might otherwise want to say using graphs, Granted, you could also say the above things using graphs, but now you can say more things, because you are not restricted to the nodes you choose, you can ask the same combinatorial question about any of the other partitions, The benefit is largely about not being dependent on our choice of variables.
        It is interesting to try to translate this definition of observation to transparent Newcomb or counterfactual mugging, and see how some of the orthogonalities are violated, and thus it does not count as an observation.
  - acgt 24 May 2021 12:33 UTC
    3 points
    AF Parent
    I’m confused what necessary work the Factorisation is doing in these temporal examples—in your example A and B are independent and C is related to both—the only assignment of “upstream/downstream” relations that makes sense is that C is downstream of both.
    
    Is the idea that factorisation is what carves your massive set of possible worlds up into these variables in the first place? Feel like I’m in a weird position where the math makes sense but I’m missing the motivational intuition for why we want to switch to this framework in the first place
    - Scott Garrabrant 24 May 2021 16:11 UTC
      LW: 4 AF: 3
      AF Parent
      I are note sure what you are asking (indeed I am not sure if you are responding to me or cousin_it.)
      
      One thing that I think is going on is that I use “factorization” in two places. Once when I say Pearl is using factorization data, and once where I say we are inferring a FFS. I think this is a coincidence. “Factorization” is just a really general and useful concept.
      So the carving into A and B and C is a factorization of the world into variables, but it is not the kind of factorization that shows up in the FFS, because disjoint factors should be independent in the FFS.
      As for why to switch to this framework, the main reason (to me) is that it has many of the advantages of Pearl with also being able to talk about some variables being coarse abstract versions of other variables. This is largely because I am interested in embedded agency applications.
      Another reason is that we can’t tell a compelling story about where the variables came from in the Pearlian story. Another reason is that sometimes we can infer time where Pearl cannot.
- acgt 24 May 2021 11:31 UTC
  LW: 3 AF: 1
  AF Parent
  What would such a distribution look like? The version where X XOR Y is independent of both X and Y makes sense but I’m struggling to envisage a case where it’s independent of only 1 variable.
  - Scott Garrabrant 24 May 2021 15:25 UTC
    LW: 4 AF: 3
    AF Parent
    It looks like X and V are independent binary variables with different probabilities in general position, and Y is defined to be X XOR V. (and thus V=X XOR Y).