The following is a critique of the idea of logical counterfactuals. The idea of logical counterfactuals has appeared in previous agent foundations research (especially at MIRI): here, here. “Impossible possible worlds” have been considered elsewhere in the literature; see the SEP article for a summary.
I will start by motivating the problem, which also gives an account for what a logical counterfactual is meant to be.
Suppose you learn about physics and find that you are a robot. You learn that your source code is “A”. You also believe that you have free will; in particular, you may decide to take either action X or action Y. In fact, you take action X. Later, you simulate “A” and find, unsurprisingly, that when you give it the observations you saw up to deciding to take action X or Y, it outputs action X. However, you, at the time, had the sense that you could have taken action Y instead. You want to be consistent with your past self, so you want to, at this later time, believe that you could have taken action Y at the time. If you could have taken Y, then you do take Y in some possible world (which still satisfies the same laws of physics). In this possible world, it is the case that “A” returns Y upon being given those same observations. But, the output of “A” when given those observations is a fixed computation, so you now need to reason about a possible world that is logically incoherent, given your knowledge that “A” in fact returns X. This possible world is, then, a logical counterfactual: a “possible world” that is logically incoherent.
To summarize: a logical counterfactual is a notion of “what would have happened” had you taken a different action after seeing your source code, and in that “what would have happened”, the source code must output a different action than what you actually took; hence, this “what would have happened” world is logically incoherent.
It is easy to see that this idea of logical counterfactuals is unsatisfactory. For one, no good account of them has yet been given. For two, there is a sense in which no account could be given; reasoning about logically incoherent worlds can only be so extensive before running into logical contradiction.
To extensively refute the idea, it is necessary to provide an alternative account of the motivating problem(s) which dispenses with the idea. Even if logical counterfactuals are unsatisfactory, the motivating problem(s) remain.
I now present two alternative accounts: counterfactual nonrealism, and policy-dependent source code.
Counterfactual nonrealism
According to counterfactual nonrealism, there is no fact of the matter about what “would have happened” had a different action been taken. There is, simply, the sequence of actions you take, and the sequence of observations you get. At the time of taking an action, you are uncertain about what that action is; hence, from your perspective, there are multiple possibilities.
Given this uncertainty, you may consider material conditionals: if I take action X, will consequence Q necessarily follow? An action may be selected on the basis of these conditionals, such as by determining which action results in the highest guaranteed expected utility if that action is taken.
This is basically the approach taken in my post on subjective implication decision theory. It is also the approach taken by proof-based UDT.
The material conditionals are ephemeral, in that at a later time, the agent will know that they could only have taken a certain action (assuming they knew their source code before taking the action), due to having had longer to think by then; hence, all the original material conditionals will be vacuously true. The apparent nondeterminism is, then, only due to the epistemic limitation of the agent at the time of making the decision, a limitation not faced by a later version of the agent (or an outside agent) with more computation power.
This leads to a sort of relativism: what is undetermined from one perspective may be determined from another. This makes global accounting difficult: it’s hard for one agent to evaluate whether another agent’s action is any good, because the two agents have different epistemic states, resulting in different judgments on material conditionals.
A problem that comes up is that of “spurious counterfactuals” (analyzed in the linked paper on proof-based UDT). An agent may become sure of its own action before that action is taken. Upon being sure of that action, the agent will know the material implication that, if they take a different action, something terrible will happen (this material implication is vacuously true). Hence the agent may take the action they were sure they would take, making the original certainty self-fulfilling. (There are technical details with how the agent becomes certain having to do with Löb’s theorem).
The most natural decision theory resulting in this framework is timeless decision theory (rather than updateless decision theory). This is because the agent updates on what they know about the world so far, and considers the material implications of themselves taken a certain action; these implications include logical implications if the agent knows their source code. Note that timeless decision theory is dynamically inconsistent in the counterfactual mugging problem.
Policy-dependent source code
A second approach is to assert that one’s source code depends on one’s entire policy, rather than only one’s actions up to seeing one’s source code.
Formally, a policy is a function mapping an observation history to an action. It is distinct from source code, in that the source code specifies the implementation of the policy in some programming language, rather than itself being a policy function.
Logically, it is impossible for the same source code to generate two different policies. There is a fact of the matter about what action the source code outputs given an observation history (assuming the program halts). Hence there is no way for two different policies to be compatible with the same source code.
Let’s return to the robot thought experiment and re-analyze it in light of this. After the robot has seen that their source code is “A” and taken action X, the robot considers what would have happened if they had taken action Y instead. However, if they had taken action Y instead, then their policy would, trivially, have to be different from their actual policy, which takes action X. Hence, their source code would be different. Hence, they would not have seen that their source code is “A”.
Instead, if the agent were to take action Y upon seeing that their source code is “A”, their source code must be something else, perhaps “B”. Hence, which action the agent would have taken depends directly on their policy’s behavior upon seeing that the source code is “B”, and indirectly on the entire policy (as source code depends on policy).
We see, then, that the original thought experiment encodes a reasoning error. The later agent wants to ask what would have happened if they had taken a different action after knowing their source code; however, the agent neglects that such a policy change would have resulted in seeing different source code! Hence, there is no need to posit a logically incoherent possible world.
The reasoning error came about due to using a conventional, linear notion of interactive causality. Intuitively, what you see up to time t depends only on your actions before time t. However, policy-dependent source code breaks this condition. What source code you see that you have depends on your entire policy, not just what actions you took up to seeing your source code. Hence, reasoning under policy-dependent source code requires abandoning linear interactive causality.
The most natural decision theory resulting from this approach is updateless decision theory, rather that timeless decision theory, as it is the entire policy that the counterfactual is on.
Conclusion
Before very recently, my philosophical approach had been counterfactual nonrealism. However, I am now more compelled by policy-dependent source code, after having analyzed it. I believe this approach fixes the main problem of counterfactual nonrealism, namely relativism making global accounting difficult. It also fixes the inherent dynamic inconsistency problems that TDT has relative to UDT (which are related to the relativism).
I believe the re-analysis I have provided of the thought experiment motivating logical counterfactuals is sufficient to refute the original interpretation, and thus to de-motivate logical counterfactuals.
The main problem with policy-dependent source code is that, since it violates linear interactive causality, analysis is correspondingly more difficult. Hence, there is further work to be done in considering simplified environment classes where possible simplifying assumptions (including linear interactive causality) can be made. It is critical, though, that the linear interactive causality assumption not be used in analyzing cases of an agent learning their source code, as this results in logical incoherence.
I too have recently updated (somewhat) away from counterfactual non-realism. I have a lot of stuff I need to work out and write about it.
I seem to have a lot of disagreements with your post.
I don’t think material conditionals are the best way to cash out counterfactual non-realism.
The basic reason I think it’s bad is the happy dance problem. This makes it seem clear that the sentence to condition on should not be obs→act.
If the action can be viewed as a function of observations, conditioning on A(obs)=act makes sense. But this is sort of like already having counterfactuals, or at least, being realist that there are counterfactuals about whan would do if the agent observed different things. So this response can be seen as abandoning counterfactual non-realism.
A different approach is to consider conditional beliefs rather than material implications. I think this is more true to counterfactual non-realism. In the simplest form, this means you just condition on actions (rather than trying to condition on something like obs→act or A(obs)=act). However, in order to reason updatelessly, you need something like conditioning on conditionals, which complicates matters.
Another reason to think it’s bad is Troll Bridge.
Again if the agent thinks there are basic counterfactual facts, A⇒B (required to respect A,A⇒B⊢B but little else—ie entirely determined by subjective beliefs), then the agent can escape Troll Bridge by disagreeing with the relevant inference. But this, of course, rejects the kind of counterfactual non-realism you intend.
To be more in line with counterfactual non-realism, we would like to use conditional probabilities instead. However, conditional probability behaves too much like material implication to block the Troll Bridge argument. However, I believe that there is an account of conditional probability which avoids this by rejecting the ratio analysis of conditional probability—ie Bayes’ definition—and instead regards conditional probability as a basic entity. (Along the lines of what Alan Hájek goes on and on about.) Thus an EDT-like procedure can be immune to both 5-and-10 and Troll Bridge. (I claim.)
As for policy-dependent source code, I find myself quite unsympathetic to this view.
If the agent is updateful, this is just saying that in counterfactuals where the agent does something else, it might have different source code. Which seems fine, but does it really solve anything? Why is this much better than counterfactuals which keep the source code fixed but imagine the execution trace being different? This seems to only push the rough spots further back—there can still be contradictions, e.g. between the source code and the process by which programmers wrote the source code. Do you imagine it is possible to entirely remove such rough spots from the counterfactuals?
So it seems you intend the agent to be updateless instead. But then we have all the usual issues with logical updatelessness. If the agent is logically updateless, there is absolutely no reason to think that its beliefs about the connections between source code and actual policy behavior is any good. Making those connections requires actual reasoning, not simply a good enough prior—which means being logically updateful. So it’s unclear what to do.
Perhaps logically-updateful policy-dependent-source-code is the most reasonable version of the idea. But then we are faced with the usual questions about spurious counterfactuals, chicken rule, exploration, and Troll Bridge. So we still have to make choices about those things.
In the happy dance problem, when the agent is considering doing a happy dance, the agent should have already updated on M. This is more like timeless decision theory than updateless decision theory.
Conditioning on ‘A(obs) = act’ is still a conditional, not a counterfactual. The difference between conditionals and counterfactuals is the difference between “If Oswald didn’t kill Kennedy, then someone else did” and “If Oswald didn’t kill Kennedy, then someone else would have”.
Indeed, troll bridge will present a problem for “playing chicken” approaches, which are probably necessary in counterfactual nonrealism.
For policy-dependent source code, I intend for the agent to be logically updateful, while updateless about observations.
Because it doesn’t lead to logical incoherence, so reasoning about counterfactuals doesn’t have to be limited.
If you see your source code is B instead of A, you should anticipate learning that the programmers programmed B instead of A, which means something was different in the process. So the counterfactual has implications backwards in physical time.
At some point it will ground out in: different indexical facts, different laws of physics, different initial conditions, different random events...
This theory isn’t worked out yet but it doesn’t yet seem that it will run into logical incoherence, the way logical counterfactuals do.
Maybe some of these.
Spurious counterfactuals require getting a proof of “I will take action X”. The proof proceeds by showing “source code A outputs action X”. But an agent who accepts policy-dependent source code will believe they have source code other than A if they don’t take action X. So the spurious proof doesn’t prevent the counterfactual from being evaluated.
Chicken rule is hence unnecessary.
Exploration is a matter of whether the world model is any good; the world model may, for example, map a policy to a distribution of expected observations. (That is, the world model already has policy counterfactuals as part of it; theories such as physics provide constraints on the world model rather than fully determining it). Learning a good world model is of course a problem in any approach.
Whether troll bridge is a problem depends on how the source code counterfactual is evaluated. Indeed, many ways of running this counterfactual (e.g. inserting special cases into the source code) are “stupid” and could be punished in a troll bridge problem.
I by no means think “policy-dependent source code” is presently a well worked-out theory; the advantage relative to logical counterfactuals is that in the latter case, there is a strong theoretical obstacle to ever having a well worked-out theory, namely logical incoherence of the counterfactuals. Hence, coming up with a theory of policy-dependent source code seems more likely to succeed than coming up with a theory of logical counterfactuals.
I’m not sure how you are thinking about this. It seems to me like this will imply really radical changes to the universe. Suppose the agent is choosing between a left path and a right path. Its actual programming will go left. It has to come up with alternate programming which would make it go right, in order to consider that scenario. The most probable universe in which its programming would make it go right is potentially really different from our own. In particular, it is a universe where it would go right despite everything it has observed, a lifetime of (updateless) learning, which in the real universe, has taught it that it should go left in situations like this.
EG, perhaps it has faced an iterated 5&10 problem, where left always yields 10. It has to consider alternate selves who, faced with that history, go right.
It just seems implausible that thinking about universes like that will result in systematically good decisions. In the iterated 5&10 example, perhaps universes where its programming fails iterated 5&10 are universes where iterated 5&10 is an exceedingly unlikely situation; so in fact, the reward for going right is quite unlikely to be 5, and very likely to be 100. Then the AI would choose to go right.
Obviously, this is not necessarily how you are thinking about it at all—as you said, you haven’t given an actual decision procedure. But the idea of considering only really consistent counterfactual worlds seems quite problematic.
I agree this is a problem, but isn’t this a problem for logical counterfactual approaches as well? Isn’t it also weird for a known fixed optimizer source code to produce a different result on this decision where it’s obvious that ‘left’ is the best decision?
If you assume that the agent chose ‘right’, it’s more reasonable to think it’s because it’s not a pure optimizer than that a pure optimizer would have chosen ‘right’, in my view.
If you form the intent to, as a policy, go ‘right’ on the 100th turn, you should anticipate learning that your source code is not the code of a pure optimizer.
I’m left with the feeling that you don’t see the problem I’m pointing at.
My concern is that the most plausible world where you aren’t a pure optimizer might look very very different, and whether this very very different world looks better or worse than the normal-looking world does not seem very relevant to the current decision.
Consider the “special exception selves” you mention—the Nth exception-self has a hard-coded exception “go right if it’s beet at least N turns and you’ve gone right at most 1/N of the time”.
Now let’s suppose that the worlds which give rise to exception-selves are a bit wild. That is to say, the rewards in those worlds have pretty high variance. So a significant fraction of them have quite high reward—let’s just say 10% of them have value much higher than is achievable in the real world.
So we expect that by around N=10, there will be an exception-self living in a world that looks really good.
This suggests to me that the policy-dependent-source agent cannot learn to go left > 90% of the time, because once it crosses that threshhold, the exception-self in the really good looking world is ready to trigger its exception—so going right starts to appear really good. The agent goes right until it is under the threshhold again.
If that’s true, then it seems to me rather bad: the agent ends up repeatedly going right in a situation where it should be able to learn to go left easily. Its reason for repeatedly going right? There is one enticing world, which looks much like the real world, except that in that world the agent definitely goes right. Because that agent is a lucky agent who gets a lot of utility, the actual agent has decided to copy its behavior exactly—anything else would prove the real agent unlucky, which would be sad.
Of course, this outcome is far from obvious; I’m playing fast and loose with how this sort of agent might reason.
I think it’s worth examining more closely what it means to be “not a pure optimizer”. Formally, a VNM utility function is a rationalization of a coherent policy. Say that you have some idea about what your utility function is, U. Suppose you then decide to follow a policy that does not maximize U. Logically, it follows that U is not really your utility function; either your policy doesn’t coherently maximize any utility function, or it maximizes some other utility function. (Because the utility function is, by definition, a rationalization of the policy)
Failing to disambiguate these two notions of “the agent’s utility function” is a map-territory error.
Decision theories require, as input, a utility function to maximize, and output a policy. If a decision theory is adopted by an agent who is using it to determine their policy (rather than already knowing their policy), then they are operating on some preliminary idea about what their utility function is. Their “actual” utility function is dependent on their policy; it need not match up with their idea.
So, it is very much possible for an agent who is operating on an idea U of their utility function, to evaluate counterfactuals in which their true behavioral utility function is not U. Indeed, this is implied by the fact that utility functions are rationalizations of policies.
Let’s look at the “turn left/right” example. The agent is operating on a utility function idea U, which is higher the more the agent turns left. When they evaluate the policy of turning “right” on the 10th time, they must conclude that, in this hypothetical, either (a) “right” maximizes U, (b) they are maximizing some utility function other than U, or (c) they aren’t a maximizer at all.
The logical counterfactual framework says the answer is (a): that the fixed computation of U-maximization results in turning right, not left. But, this is actually the weirdest of the three worlds. It is hard to imagine ways that “right” maximizes U, whereas it is easy to imagine that the agent is maximizing a utility function other than U, or is not a maximizer.
Yes, the (b) and (c) worlds may be weird in a problematic way. However, it is hard to imagine these being nearly as weird as (a).
One way they could be weird is that an agent having a complex utility function is likely to have been produced by a different process than an agent with a simple utility function. So the more weird exceptional decisions you make, the greater the evidence is that you were produced by the sort of process that produces complex utility functions.
This is pretty similar to the smoking lesion problem, then. I expect that policy-dependent source code will have a lot in common with EDT, as they both consider “what sort of agent I am” to be a consequence of one’s policy. (However, as you’ve pointed out, there are important complications with the framing of the smoking lesion problem)
I think further disambiguation on this could benefit from re-analyzing the smoking lesion problem (or a similar problem), but I’m not sure if I have the right set of concepts for this yet.
OK, all of that made sense to me. I find the direction more plausible than when I first read your post, although it still seems like it’ll fall to the problem I sketched.
I both like and hate that it treats logical uncertainty in a radically different way from empirical uncertainty—like, because we have so far failed to find any way to treat the two uniformly (besides being entirely updateful that is); and hate, because it still feels so wrong for the two to be very different.
I still disagree. We need a counterfactual structure in order to consider the agent as a function A(obs). EG, if the agent is a computer program, the function A() would contain all the counterfactual information about what the agent would do if it observed different things. Hence, considering the agent’s computer program as such a function leverages an ontological commitment to those counterfactuals.
To illustrate this, consider counterfactual mugging where we already see that the coin is heads—so, there is nothing we can do, we are at the mercy of our counterfactual partner. But suppose we haven’t yet observed whether Omega gives us the money.
A “real counterfactual” is one which can be true or false independently of whether its condition is met. In this case, if we believe in real counterfactuals, we believe that there is a fact of the matter about what we do in the coin=tails case, even though the coin came up heads. If we don’t believe in real counterfactuals, we instead think only that there is a fact of how Omega is computing “what I would have done if the coin had been tails”—but we do not believe there is any “correct” way for Omega to compute that.
The obs→act representation and the P(act|obs) representation both appear to satisfy this test of non-realism. The first is always true if the observation is false, so, lacks the ability to vary independently of the observation. The second is undefined when the observation is false, which is perhaps even more appealing for the non-realist.
Now consider the A(obs)=act representation.A(tails)=pay can still vary even when we know coin=heads. So, it fails this test—it is a realist representation!
Putting something into functional form imputes a causal/counterfactual structure.
This indeed makes sense when “obs” is itself a logical fact. If obs is a sensory input, though, ‘A(obs) = act’ is a logical fact, not a logical counterfactual. (I’m not trying to avoid causal interpretations of source code interpreters here, just logical counterfactuals)
Ahhh ok.
I agree that this gets around the problem, but to me the happy dance problem is still suggestive—it looks like the material conditional is the wrong representation of the thing we want to condition on.
Also—if the agent has already updated on observations, then updating on obs→act is just the same as updating on act. So this difference only matters in the updateless case, where it seems to cause us trouble.
I’ve been doing some work on this topic, and I am seeing two schools of thought on how to deal with the problem of logical contradictions you mention. To explain these, I’ll use an example counterfactual not involving agents and free will. Consider the counterfactual sentence: `if the vase had not been broken, the floor would not have been wet’. Now, how can we compute a truth value for this sentence?
School of thought 1 proceeds as follows: we know various facts about the world, like that the vase is broken and that the floor is wet. We also know general facts about vases, breaking, water, and floors. Now we add the extra fact that the vase is not broken to our knowledge base. Based on this extended body of knowledge, we compute the truth value of the claim ‘the floor is not wet’. Clearly, we are dealing with a knowledge base that contains mutually contradictory facts: the vase is both broken and it is not broken. Under normal mathematical systems of reasoning, this will allow us to prove any claim we like: the truth value of any sentence becomes 1, which is not what we want. Now, school 1 tries to solve this by coming up with new systems of reasoning that are tolerant of such internal contradictions, systems that will make computations that will produce the ‘obviously true’ conclusions only, of that will derive the `obviously true’ conclusions before deriving the `obviously false’ ones, or that compute probabilistic truth values such a way that those of the `obviously true’ conclusions are higher. In MIRI terminology, I believe this approach goes under the heading ‘decision theory’. I also interpret the two alternative solutions you mention above as following this school of thought. Personally, I find this solution approach not very promising or compelling.
School of thought 2, which includes Pearl’s version of counterfactual reasoning, says that if you want to reason (or if you want a machine to reason) in a counterfactual way, you should not just add facts to the body of knowledge you use. You need to delete or edit other facts in the knowledge base too, before you supply it to the reasoning engine, exactly to avoid inputting a knowledge base that has internal contradictions. For example, if you want to reason about ‘if the vase had not been broken’, one thing you definitely need to do is to first remove the statement (or any information leading to the conclusion that) `the vase is broken’ from the knowledge base that goes into your reasoning engine. You have to do this even though the fact that the vase is broken is obviously true for the current world you are in.
So school 2 avoids the problem of having to somehow build a reasoning engine that does the right thing even when a contradictory knowledge base is input. But it trades this for the problem of having to decide exactly what edits will be made to the knowledge base to eliminate the possibility of having such contradictions. In other words, if you want a machine to reason in a counterfactual way, you have to make choices about the specific edits you will make. Often, there are many possible choices, and different choices may lead to different probability distributions in the outcomes computed. This choice problem does not bother me that much, I see it as having design freedom. But if you are a philosopher of language trying to find a single obvious system of meaning for natural language counterfactual sentences, this choice problem might bother you a lot, you might be tempted to find some kind of representation-independent Occam’s razor that can be used to decide between counterfactual edits.
Overall, my feeling is that school 2 gives an account of logical counterfactuals that is good enough for my purposes in AGI safety work.
As a trivial school 1 edge case, one could design a reasoning engine that can deal with contradictory facts in its input knowledge base as follows: the engine first makes some school 2 edits on its input to remove the contradictions, and then proceeds calculating the requested truth value. So one could argue that the schools are not fundamentally different, though I do feel they are different in outlook, especially in their outlook on how necessary or useful it will be for AGI safety to resolve certain puzzles.
What about school 3, the one that solves the problem with compartmentalisation/sandboxing?
I was hoping somebody would come up with more schools… I think I could interpret the techniques of school 3 as a particular way to implement the `make some edits before you input it into the reasoning engine engine’ prescription of school 2, but maybe school 3 is different from school 2 in how it would describe its solution direction.
There is definitely also a school 4 (or maybe you would say this is the same one as school 3) which considers it to be an obvious truth that that when you run simulations or start up a sandbox, you can supply any starting world state that you like, and there is nothing strange or paradoxical about this. Specifically, if you are an agent considering a choice between taking actions A, B, and C as the next action, you can run different simulations to extrapolate the results of each. If a self-aware agent inside the simulation for action B computes the action that an optimal agent would have taken at the point in time where its simulation started was A, this agent cannot conclude there is a contradiction: such a conclusion would rest on making a category error. (See my answer in this post for a longer discussion of the topic.)
My motivation for talking about logical counterfactuals has little to do with free will, even if the philosophical analysis of logical counterfactuals does.
The reason I want to talk about logical counterfactuals is as follows: suppose as above that I learn that I am a robot, and that my source code is “A”(which is presumed to be deterministic in this scenario), and that I have a decision to make between action X and action Y. In order to make that decision, I want to know which decision has better expected utility. The problem is that, in fact, I will either choose X or Y. Suppose without loss of generality that I will end up choosing action X. Then worlds in which I choose Y are logically incoherent, so how am I supposed to reason about the expected utility of choosing Y?
I’m not using “free will” to mean something distinct from “the ability of an agent, from its perspective, to choose one of multiple possible actions”. Maybe this usage is nonstandard but find/replace yields the right meaning.
I think using the term in that way, without explicitly defining it, makes the discussion more confused
From an omniscient point of view, or from your point of view? The typical agent has imperfect knowledge of both the inputs to their decision procedure, and the procedure itself. So long as an agent treats what it thinks is happening, as only one possibility, then there is not contradiction because possible-X is always compatible with possibly not-X.
From an omniscient point of view, yes. From my point of view, probably not, but there are still problems that arise relating to this, that can cause logic-based agents to get very confused.
Let A be an agent, considering options X and not-X. Suppose A |- Action=not-X → Utility=0. The naive approach to this would be to say: if A |- Action=X → Utility<0, A will do not-X, and if A |- Action=X → Utility>0, A will do X. Suppose further that A knows its source code, so it knows this is the case.
Consider the statement G=(A |- G) → (Action=X → Utility<0). It can be constructed by using Godel-numbering and quines. Present A with the following argument:
Suppose for the sake of argument that A |- G. Then A |- (A |- G), since A knows its source code. Also, by definition of G, A |- (A |- G) → (Action=X → Utility<0). By modus ponens, A |- (Action=X → Utility<0). Therefore, by our assumption about A, A will do not-X: Action!=X. But, vacuously, this means that (Action=X → Utility<0). Since we have proved this by assuming A |- G, we know that (A |- G) → (Action=X → Utility<0), in other words, we know G.
The argument then goes, similarly to above:
A |- G
A |- (A |- G)
A |- (A |- G) → (Action=X → Utility<0)
A |- (Action=X → Utility<0)
Action=Not-X
We proved this without knowing anything about X. This shows that naive logical implication can easily lead one astray. The standard solution to this problem is the chicken rule, making it so that if A ever proves which action it will take, it will immediately take the opposite action, which avoids the argument presented above, but is defeated by Troll Bridge, even when the agent has good logical uncertainty.
These problems seem to me to show that logical uncertainty about the action one will take, paired with logical implications about what the result will be if you take a particular action, are insufficient to describe a good decision theory.
I am not aware of a good reason to believe that a perfect decision theory is even possible, or that counterfactuals of any sort are the main obstacle.
Simpler solution: in that world, your code is instead A’, which is exactly like A, except that it returns Y in this situation. This is the more general solution derived from Pearl’s account of counterfactuals in domains with a finite number of variables (the “twin network construction”).
Last year, my colleagues and I published a paper on Turing-complete counterfactual models (“causal probabilistic programming”), which details how to do this, and even gives executable code to play with, as well as a formal semantics. Have a look at our predator-prey example, a fully worked example of how to do this “counterfactual world is same except blah” construction.
http://www.zenna.org/publications/causal.pdf
Yes, this is a specific way of doing policy-dependent source code, which minimizes how much the source code has to change to handle the counterfactual.
Haven’t looked deeply into the paper yet but the basic idea seems sound.
If the agent is ‘caused’ then in order for its source code to be different, something about the process that produced it must be different. (I haven’t seen this addressed.)
I found parts of your framing quite original and I’m still trying to understand all the consequences.
Firstly, I’m also opposed to characterising the problem in terms of logical counterfactuals. I’ve argued before that Counterfactuals are an Answer Not a Question, although maybe it would have been clearer to say that they are a Tool Not a Question instead. If we’re talking strictly, it doesn’t make sense to ask what maths would. be like if 1+1=3 as it doesn’t, but we can construct a para-consistent logic where it makes sense to do something analogous to pretending 1+1=3. And so maybe one form of “logical counterfactual” could be useful for solving these problems, but that doesn’t mean asking what logical counterfactuals are, as though they were ontologically basic, as though they were in the map not the territory, as though they were a single unified concept, makes sense.
Secondly, “free will” is such a loaded word that using it in a non-standard fashion simply obscures and confuses the discussion. Nonetheless, I think you are touching upon an important point here. I have a framing which I believe helps clarify the situation. If there’s only one possible decision, this gives us a Trivial Decision Problem. So to have a non-trivial decision problem, we’d need a model containing at least two decisions. If we actually did have libertarian free will, then our decision problems would always be non-trivial. However, in the absence of this, the only way to avoid triviality would be to augment the factual with at least one counterfactual.
Counterfactual non-realism: Hmm… I see how this could be a useful concept, but the definition given feels a bit vague. For example, recently I’ve been arguing in favour of what counts as a valid counterfactual being at least partially a matter of social convention. Is that counterfactual non-realism?
Further, it seems a bit strange to associate material conditions with counterfactual non-realism. Material conditions only provide the outcome when we have a consistent counterfactual. So, either a) we believe in libertarian free will b) we use something like the erasure approach to remove information such that we have multiple consistent possibilities (see https://www.lesswrong.com/posts/BRuWm4GxcTNPn4XDX/deconfusing-logical-counterfactuals). Proof-based UDT doesn’t quite use material conditionals, it uses a paraconsistent version of them instead. Although, maybe I’m just being too pedantic here. In any case, we can find ways of making paraconsistent logic behave as expected in any scenario, however it would require a separate ground. That is, it isn’t enough that the logic merely seems to work, but we should be able to provide a separate reason for why using a paraconsistent logic in that way is good.
Also, another approach which kind of aligns with counterfactual non-realism is to say that given the state of the universe at any particular time we can determine the past and future and that there are no counterfactuals beyond those we generate by imagining state Y at time T instead of state X. So, to imagine counterfactually taking action Y we replace the agent doing X with another agent doing Y and flow causation both forwards and backwards. (See this post for more detail). It could be argued that these count as counterfactuals, but I’d align it with counterfactual non-realism as it doesn’t have decision counterfactuals as seperate ontological elements.
Policy-dependent source code—this is actually a pretty interesting framing. I’ve always defaulted to thinking about counterfactuals in terms of actions, but when we’re talking about things in terms of problems like Counterfactual Mugging, characterising counterfactuals in terms of policy might be more natural. It’s strange why this feels fresh to me—I mean UDT takes this approach—but I never considered the possibility of non-UDT policy counterfactuals. I guess from a philosophical perspective it makes sense to first consider whether policy-dependent source code makes sense and then if it does further ask whether UDT makes sense.
As a side note—one thing I don’t understand is why more people don’t seem to want to use just the word “will” without the “free” part in front of it.
It seems like a much more straightforward and less fraught term, and something that we obviously have. Do we have a “will”? Obviously yes—we want things, we choose things, etc. Is that will “free”? Well what does that mean?
EDIT: I feel like this is a case of philosophers baking in a confusion into their standard term. It’d be like if instead of space we always talked about “absolute space”. And then post-Einstein people argued about whether “absolute space” existed or not, without ever just using the term “space” just by itself.
Philosophers talk about free will because it is contentious and therefore worth discussing philosophically , whereas will, qua wants and desires, isn’t.
cf, the silly physicists who insist on talking about dark matter, when anyone can see that ordinary matter exists.
Fair point. But then why do so many (including philosophers) make statements like, “we seem to have free will”, or “this experience of apparent free will that we have requires explanation.”
If ‘free will’ in those statements means something different from ‘will’, then it seems like they’re assuming the (wrong) explanation.
If physicists often used the term “dark matter” in ways that suggested it’s the same thing as people’s folk concept of matter, then I’d agree that they were silly.
Why specific philosophers say specific things is usually explained by the philosophers themselves, since it is hard to gain a reputation in the field by making unsupported assertions. But you seem to be making the point that is strange that any philosopher argues in favour of free will, since, according to you it is obviously non-existent. The answer to that is that you are not capable of reproducing all the arguments for or against a claim yourself, so your personal guesswork is not a good guide to how plausible something is.
Doens’t everything require explanation? Even your man Yudkowsky offers an explanation of the feeling of free will.
Physicists do use the word “matter” in a sense that departs from folk usage. For instance, they assert that it is mostly nothingness, and that it is equivalent to energy.
I didn’t mean that just the philosophers who believe in (libertarian, contra-causal) free will make statements like “we seem to have free will”, or “this experience of apparent free will that we have requires explanation”. I’ve heard those statements even from those questioning such free will.
They’ll say, “we seem to have free will, but actually it’s an illusion”.
What I do not see is proponents of determinism saying that “free will” is the wrong term, that most of the intuitive properties that our wants and choices seem to have are satisfied by the idea of a “will” plane and simple. And then starting the argument from there about whether there are additional properties that that will has or seems to have s.t. it’s reasonable to append the term “free” to the front.
Maybe it’s popularizers that I have to blame, rather than philosophers. I’m not sure. My complaint is that somehow the standard sides of the debate came to be labeled “free will” vs “determinism” rather than “uncaused will” vs “determined will”.
I think the “fee will” vs “determinism” framing unfairly makes it seem like whether any wanting or choosing is happening is at stake, such that people had to come up with the special term “compatibilism” for the position that “no no, there’s still wanting and choosing going on”.
If you started the debate with everyone agreeing, “obviously there’s some form of wanting and choosing happening,” and then asking, “but what form does it take and where does it come from? Can it be said to be caused by anything?” then I think the natural terms for the two camps would be something like “uncaused will” and “determined will”.
I think those terms accurately describe the major sides of the popular debate and are less likely to prejudice people’s intuitions in favor of the free/uncaused will side.
So what I don’t understand is: why don’t proponents of determinism push that framing?
Proponents of determinism tend to say that libertarian free will doesn’t exist, but compatibilist free will might. It is likely that they are expressing the same idea as you, but in different language.
That’s an interesting point
Wikipedia says “Free will is the ability to choose between different possible courses of action unimpeded.” SEP says “The term “free will” has emerged over the past two millennia as the canonical designator for a significant kind of control over one’s actions.” So my usage seems pretty standard.
All word definitions are determined in large part by social convention. The question is whether the social convention corresponds to a definition (e.g. with truth conditions) or not. If it does, then the social convention is realist, if not, it’s nonrealist (perhaps emotivist, etc).
Not necessarily. An agent may be uncertain over its own action, and thus have uncertainty about material conditionals involving its action. The “possible worlds” represented by this uncertainty may be logically inconsistent, in ways the agent can’t determine before making the decision.
I don’t understand this? I thought it searched for proofs of the form “if I take this action, then I get at least this much utility”, which is a material conditional.
Policy-dependent source code does this; one’s source code depends on one’s policy.
I think UDT makes sense in “dualistic” decision problems that are already factorized as “this policy leads to these consequences”. Extending it to a nondualist case brings up difficulties, including the free will / determinism issue. Policy-dependent source code is a way of interpreting UDT in a setting with deterministic, knowable physics.
Not quite. The way you are using it doesn’t necessarily imply real control, it may be imaginary control.
True. Maybe I should clarify what I’m suggesting. My current theory is that there are multiple reasonable definitions of counterfactual and it comes down to social norms as to what we accept as a valid counterfactual. However, it is still very much a work in progress, so I wouldn’t be able to provide more than vague details.
I guess my point was that this notion of counterfactual isn’t strictly a material conditional due to the principle of explosion. It’s a “para-consistent material conditional” by which I mean the algorithm is limited in such a way as to prevent this explosion.
Hmm… good point. However, were you flowing this all the way back in time? Such as if you change someone’s source code, you’d also have to change the person who programmed them.
What do you mean by dualistic?
I’m discussing a hypothetical agent who believes itself to have control. So its beliefs include “I have free will”. Its belief isn’t “I believe that I have free will”.
Yes, that makes sense.
Yes (see thread with Abram Demski).
Already factorized as an agent interacting with an environment.
Hmm, yeah this could be a viable theory. Anyway to summarise the argument I make in Is Backwards Causation Necessarily Absurd?, I point out that since physics is pretty much reversible, instead of A causing B, it seems as though we could also imagine B causing A and time going backwards. In this view, it would be reasonable to say that one-boxing (backwards-)caused the box to be full in Newcombs. I only sketched the theory because I don’t have enough physics knowledge to evaluate it. But the point is that we can give justification for a non-standard model of causality.
On my current understanding of this post, I think I have a criticism. But I’m not sure if I properly understand the post, so tell me if I’m wrong in my following summary. I take the post to be saying something like the following:
If my summary is right, I’m not sure how policy-dependent source code is a solution to the global accounting problem. This is because the agent, when asking what would have happened if I had done Y, still faces a global accounting problem. This is because the agent must then assume they have some different source code B, and it seems like choosing an appropriate B will be underdetermined. That is, there is no unique source code B to give you a determinate answer about what would have happened if you performed A*. I can see why thinking in terms of policy-dependent source code would be attractive if you were a nonrealist about specifically logical counterfactuals, and a realist about different kinds of counterfactuals. But that’s not what I took you to be saying.
The summary is correct.
Indeed, it is underdetermined what the alternative source code is. Sometimes it doesn’t matter (this is the case in most decision problems), and sometimes there is a family of programs that can be assumed. But this still presents theoretical problems.
The motivation is to be a nonrealist about logical counterfactuals while being a realist about some counterfactuals.
I see the problem of counterfactuals as essentially solved by quasi-Bayesianism, which behaves like UDT in all Newcomb-like situations. The source code in your presentation of the problem is more or less equivalent to Omega in Newcomb-like problems. A TRL agent can also reason about arbitrary programs, and learn that a certain program acts as a predictor for its own actions.
This approach has some similarity with material implication and proof-based decision theory, in the sense that out of several hypothesis about counterfactuals that are consistent with observations, the decisive role is played by the most optimistic hypothesis (the one that can be exploited for the most expected utility). However, it has no problem with global accounting and indeed it solves counterfactual mugging successfully.
It seems the approaches we’re using are similar, in that they both are starting from observation/action history with posited falsifiable laws, with the agent’s source code not known a priori, and the agent considering different policies.
Learning “my source code is A” is quite similar to learning “Omega predicts my action is equal to A()”, so these would lead to similar results.
Policy-dependent source code, then, corresponds to Omega making different predictions depending on the agent’s intended policy, such that when comparing policies, the agent has to imagine Omega predicting differently (as it would imagine learning different source code under policy-dependent source code).
Well, in quasi-Bayesianism for each policy you have to consider the worst-case environment in your belief set, which depends on the policy. I guess that in this sense it is analogous.
Short:
Unless something interfered with what they saw—there need not be pure/true observations.
And something might have incentive to do so if the agent were to do X if it “saw its source code was A” and were to do Y if it “saw its source code was B”. While A and B may be mutually exclusive, the actual policy “might” be dependent on observations of either.
Long:
[1] If a program takes long enough to run, it may never be found that it does halt. In a sense, the fact that its output is determined does not mean it can (or will) be deduced.
And set of inputs.
Overall take:
Dynamic versus static:
Consider the numbers 3, 1, 2, 4.
There exists more than one set of actions that ‘transforms’ the above into: 1, 2, 3, 4.
(It can also be transformed into a sorted list by deleting the 3...)
A sorting method however, does not always take a list and move the first element to the third position, or even necessarily do so in every case where the first element is three.
While deterministic, its behavior depends upon an input. Given the input, the actions it will take are known (or follow from the source code in principle[1]).
This can be generalized further, in the case of a sorting program that takes both a set of objects, and a way of ordering. Perhaps a program can even be written that reasons about some policy, and based on the results, makes an output conditional on what it finds. Thus the “logical counterfactual” does not exist per se, but is a way of thinking used in order to handle the different cases, as it is not clear which one is the case, though only one may be possible.
More specific:
Though a policy may include/specify (simpler) policies, and thus by extension, a source code may as well, though the different threads will probably be weaved together.
I’m trying t understand where exactly in your approach you sneak in the free will...
For counterfactual nonrealism, it’s simply the uncertainty an agent has about their own action, while believing themselves to control their action.
For policy-dependent source code, the “different possibilities” correspond to different source code. An agent with fixed source code can only take one possible action (from a logically omniscent perspective), but the counterfactuals change the agent’s source code, getting around this constraint.
I think
when modeling a complex/not entirely understood system, probabilities may be a more effective framework.
Just as, if the output of a program were known before it was run, it probably wouldn’t need to be run, we don’t know what we’ll decide before we decide, though we do after, and we’re not sure how we could have predicted the outcome in advance.
What is “linear interactive causality”?
Basically, the assumption that you’re participating in a POMDP. The idea is that there’s some hidden state that your actions interact with in a temporally linear fashion (i.e. action 1 affects state 2), such that your late actions can’t affect early states/observations.
OK, so no “backwards causation” ? (not sure if that’s a technical term and/or if I’m using it right...)
Is there a word we could use instead of “linear”, which to an ML person sounds like “as in linear algebra”?
Yes, it’s about no backwards assumption. Linear has lots of meanings, I’m not concerned about this getting confused with linear algebra, but you can suggest a better term if you have one.
There is a logical contradiction between the idea that your actions are determined, and the idea that you could have acted differently under the exact same circumstances. There is no such problem if you do not assume determinism, meaning that the “problem” of logical counterfactuals is neither unavoidable nor purely logical—it is not purely logical because a metaphysical assumption, an assumption about the way reality works is involved.
The assumption of determinism is implicit in talking of yourself as a computer programme, and the assumption of indeterminism is implicit in talking about yourself as nonetheless having free will.
A purely logical counterfactual , a logical counterfactual properly so-called, is a hypothetical state of affairs, where a different input or set of preconditions is supposed, and a different, also hypothetical output or result obtains. Such a counterfactual is logically consistent—it just isn’t consistent with what actually occurred.
People calculate logical counterfactuals all the time. You can figure out what output a programme will give in response to an input it has never received by looking at the code. But note that that is a purely epistemological issue. There may be a separate, ontological, not epistemological issue about real counterfactuals. If you have good reason to believe in determinsim, which you don’t, you should disbelieve in real counterfactuals. But that says nothing about logical counterfacuals. So long as some hygiene is exercised about the epistemological/ontological distinction, and the logical/real disinticntion then there is no problem.
Note that problems agents have in introspecting their own decision making are not problems with counterfactuals (real or logical) per se.
It doesn’t lead to serious relativism, because the perspectives are asymmetrical. The agent that knows more is more right.
A “spurious” counterfactual is just a logical, as opposed to real, counterfactual. The fact that it could never have occurred means it was never a real counterfactual.
This was interesting to read but not entirely sure just what it means to me and thinking anything through.
As I was reading I started to think along the lines of you policy side—perhaps the question is not about how to twist A code into outputting Y but rather why not just consider the agent runs some other code. (The immediate problem with my thought is that of the infinite regress.) But also when thinking about counterfactuals that is in a sense what I am exploring. But I would express that more as what if action/path Y were taken, where does that lead? Is that a better result? If so then the response is about updating some priors related to the input to A or updating A. In this sense the question is not about the logical problems of getting A to output Y when we know it has to output X but about the internal logic and decision making performance of A and if we need to update to A’.
I am also wondering if including the whole free-will aspect add value. If you just took that aspect out what changes for your thinking? Or is the whole question of free-will part of the philosophical question you need to address. If so your post did prompt a thought on my thinking about free-will, particularly in the context of rational mindsets. I don’t know if someone has already followed the line of thinking (but would certainly think so) but I don’t think free-will can be rationally explored and explained within the confines of pure logical consistency.
Without some assumption similar to “free will” it is hard to do any decision theory at all, as you can’t compare different actions; there is only one possible action.
The counterfactual nonrealist position is closer to determinism than the policy-dependent source code position. This assumes that the algorithm controls the decision while the output of the algorithm is unknown.
Under determinism, there is only one actually possible action, and that doesn’t stop you comparing hypothetical actions. Logical possibility =/= real possibility. Since logical possibilities are only logical possibilities, no sincere assumption of real free will is required.
Since you are invariably in a far from omniscient state about both the world and your own inner workings, you are pretty much always dealing with hypotheses, not direct insight into reality.
This is exactly what is described in the counterfactual nonrealism section.
Under determinism, you should be a nonrealist about real counterfactuals, but there is still no problem with logical counterfactuals. So what is “the problem of logical counterfactuals”?
They’re logically incoherent so your reasoning about them is limited. If you gain in computing power then you need to stop being a realist about them or else your reasoning explodes.
They are not logically incoherent in thenselves. They are inconsistent with what actually happened. That means that if you try to be bundle the hypothetical,the logical counterfactual ,in with your model of reality, the resulting mish mash will be inconsistent. But the resulting mish mash isn’t the logical counterfactual per se.
W can think about counterfactuals without our heads the exploding. That is the correct starting point. How is that possible? The obvious answer is that consideration of hypothetical scenarios takes place in a sandbox.
They are logically incoherent in themselves though. Suppose the agent’s source code is “A”. Suppose that in fact, A returns action X. Consider a logical counterfactual “possible world” where A returns action Y. In this logical counterfactual, it is possible to deduce a contradiction: A returns X (by computation/logic) and returns Y (by assumption) and X is not equal to Y. Hence by the principle of explosion, everything is true.
It isn’t necessary to observe that A returns X in real life, it can be deduced from logic.
(Note that this doesn’t exclude the logical material conditionals described in the post, only logical counterfactuals)
Source code doesn’t entirely determine the result, inputs are also required.* Thus “logical counterfactuals” -reasoning about what a program will return if I input y? This can be done by asking ‘if I had input y instead of x’ or ‘if I input y’ even if I later decide to input x.
While it can be said that such considerations render one’s “output” conditional on logic, they remain entirely conditional on reasoning about a model, which may be incorrect. It seems more useful to refer to such a relation as conditional on one’s models/reasoning, or even processes in the world. A calculator may be misused—a 2 instead of a 3 here, hitting “=” one too many times, there, etc.
(Saying it is impossible for a rational agent that knows X to do Y, and agent A is not doing Y, does not establish that A is irrational—even if the premises are true, what follows is that A is not rational or does not know X.)
*Unless source code is defined as including the inputs.
You are assuming a very strong set of conditions..that determinism holds,that the agent has perfect knowledge of its source code, and that it is compelled to consider hypothetical situations in maximum resolution.
Those are the conditions in which logical counterfactuals are most well-motivated. If there isn’t determinism or known source code then there isn’t an obvious reason to be considering impossible possible worlds.
Those are the conditions under which counterfactuals are flat out impossible. But we have plenty of motivation to consider hypotheticals ,and we don’t generally know how possible they are