I’ll focus on the gears-level elaboration of why all those computational mechanisms are equivalent. In short: If you want to actually get anything done, Turing machines suck! They’re painful to use, and that makes it hard to get insight by experimenting with them. Lambda calculus / combinator calculi are way better for actually seeing why this general equivalence is probably true, and then afterwards you can link that back to TMs. (But, if at all possible, don’t start with them!)
Sure, in some sense Turing machines are “natural” or “obvious”: If you start from finite state machines and add a stack, you get a push-down automaton. Do it again, and you essentially have a Turing machine. (Each stack is one end of the infinite tape.) There’s more bookkeeping involved to properly handle “underflows” (when you reach the bottom of a stack and need to extend it) yadda yadda, so every “logical state” would require a whole group of states to actually implement, but that’s a boring mechanical transformation. So just slap another layer of abstraction on it and don’t actually work with 2-stack PDAs, and you’re golden, right? Right…?
Well… actually the same problem repeats again and again. TMs are just extremely low-level and not amenable to abstraction. By default, you get two tape symbols. If you need more, you can emulate those, which will require multiple “implementation states” to implement a single “logical state”, so you add another layer of abstraction… Same with multiple tapes or whatever else you need. But you’re still stuck in ugly implementation details, because you need to move the head left and right to skip across stuff, and you can’t really abstract that away. (Likewise, chaining TMs or using one as a “subroutine” from another isn’t easily possible and involves lots of manual “rewiring”.) So except for trivial examples, actually working with Turing machines is extremely painful. It’s not surprising that it’s really hard to see why they should be able to run all effectively computable functions.
Enter lambda calculus. It’s basically just functions, on a purely syntactic level. You have binders or (anonymous) function definitions – the eponymous lambdas. You have variables, and you have nested application of variables and terms (just like you’re used to from normal functions.) There’s exactly one rule: If a function is applied to an argument, you substitute that argument into the body wherever the placeholder variable was used. So e.g. (λx.((fx)y)(gx))z → ((fz)y)(gz) by substituting z for x. (Actual use of lambda calculus needs a few more rules: You have to make sure that you’re not aliasing variables and changing the meaning, but that’s a relatively minor (if tricky) bookkeeping problem that does abstract away fairly cleanly.)
Now, on top of that, you can quickly build a comfortable working environment. Want to actually name things?let x := y in t is just a ((λx.t)y). (The assignment being split around the whole code is a bit ugly, but you don’t even need “actual” extensions, it’s just syntax sugar.) Data can also be emulated quite comfortably, if you understand one key idea: Data is only useful because you eventually do something with it. If all you have is abstraction, just abstract out the “actually doing” part with a placeholder in the most general scheme that you could possibly need. (Generally an induction scheme or “fold”.) [Note: I’ll mostly use actual names instead of one-letter variables going forward, and may drop extra parentheses: f x y z = (((f x) y) z)]
In a way, Booleans are binary decisions. true := λtt.λff.tt, false := λtt.λff.ff (Now your if c then x else y just becomes a c x y… or you add some syntax sugar.)
In a way, natural numbers are iteration. zero := λze.λsu.ze, succ := λn.λze.λsu.su(n ze su). (If you look closely, that’s just the induction principle. And all that the number “does” is: (succ (succ (succ zero)) x f) = f (f (f x)))
Lists are just natural numbers “with ornaments”, a value dangling on every node. nil := λni.λco.ni, cons := λx.λxs.λni.λco.co x (xs ni co) The usual list functions are easy too: map := λf.λxs.(xs nil (λx.λxs.cons (f x) xs)), fold := λxs.xs (yep, the list – as represented – is the fold)
And so on… so I think you’ll believe me that you basically have a full functional programming language, and that lambda calculus is equivalent to all those more usual computational substrates. (…and maybe Turing machines too?)
Still… even if lambda calculus is extremely simple in some ways, it’s also disturbingly complex in others. Substitution everywhere inside a term is “spooky action at a distance”, a priori it’s quite plausible that this does a lot of “heavy lifting”. So: Let’s turn to combinator calculi. Let’s keep the term structure and application of things, but get rid of all (object-level) variables and binders. Instead, we define three substitution rules: (using variables only at the meta-level)
I x = x
K v x = v (or, fully parenthesized: (K v) x = v)
S a b x = (a x) (b x) (or, again more precisely: ((S a) b) x = (a x) (b x))
(Strictly speaking, we don’t even need I, as SKKx = (Kx)(Kx) = x, so if we say I := SKK that’s equivalent.) The interesting thing about these rules is that all lambda-terms can be translated into SKI-terms. (To get rid of a binder / λ, you recursively transform the body: Pretend the intended argument x is already applied to the body, how do you get it where it needs to go? If you have an application, you prefix it with S such that the argument is passed down into both branches. If a branch doesn’t use the argument (that includes other variables), prefix it with K so it remains unchanged. Otherwise, it is the bound variable that should be substituted, so put an I and you get the value. Done.) Worst case, this will cause an exponential(!) increase of the term size, but it’ll get the job done. After that, there’s no more “looking inside terms” or dealing with variables, but extremely simple rewrite rules. To revisit the examples above:
zero := λze.λsu.ze → S(KK)I, succ := λn.λze.λsu.su(n ze su) → λn.λze.SI(S(K(n ze))I) → λn.S(K(SI))(S(S(KS)(S(KK)(S(KK)(S(K n)I)))(KI)) → (S(K(S(K(SI))))(S(S(KS)(S(K(S(KS)))(S(K(S(KK)))(S(S(KS)(S(KK)I))(KI)))))(K(KI)))) […except I think I messed up somewhere.]
(And yes, I did that by hand… so I hope you’ll accept that I stop there and don’t translate the other examples.)
So actually you don’t need lambda calculus (or variables, for that matter), just these two rules. You can compile your Haskell to this SKI-stuff.
Take a moment to process that.
Now I hope it’s also pretty obvious that you can implement a Turing machine in Haskell? Same for imperative programming languages – look at e.g. the Whitespace interpreter (Haskell version in the archive, Idris version on Github). So at this point I think you have a pretty good intuition as for why all existing computational mechanisms can be reduced down to SKI-calculus.
As for the other direction…with a basic TM model, dealing with all those dynamically resizing parts would be extremely painful, but if you use an extended Turing machine that has lots of different symbols and also more than one band, it’s actually not that hard? Detecting expandable locations isn’t that hard if you use one band as scratch space for counting nested parentheses. Doing the “reductions” (which can actually greatly expand the space needed because S duplicates the argument) can also happen on a separate band, and then afterwards you copy over the stuff back onto the “main band”, after adjusting the placement of the unchanged stuff on the side. (I hope that’s sufficiently clear? This is basically using multiple bands as a fixed number of temporary helper variables.)
And that’s it. Now you can run your Turing machine in your Haskell on your Turing machine, or whatever… Hope you’re having fun!
Thanks for your reply! I had never heard of SKI-Calculus!
In short: If you want to actually get anything done, Turing machines suck! They’re painful to use, and that makes it hard to get insight by experimenting with them.
I feel like one advantage Turing machines have, is that they are in some sense low level (actually quite close to how we build our computers)? They do seem to have a nice abstraction for memory. I would have no idea how to characterize NSPACE with lambda calculus, for example (though of course googling reveals someone seems to have figured it out).
They’re actually quite different from how our computers work (just on the surface already, the program is unmodifiable and separate from the code, and the whole walking around is also a very important difference[1]), but I agree that they feel “electronics-y” / like a big ball of wires.
Don’t forget that most of the complexity theoretic definitions are based on Turing machines, because that happened to become the dominant model for the study of these kinds of things. Similar-but-slightly-different constructions would perhaps be more “natural” on lambda calculus, so if that had been the dominant model, then the Turing machine versions would have been slightly awkward instead.
But memory isn’t actually a problem – see Lisp. Represent everything as a cons cell (a storage unit with two slots) or a “symbol” (basically an unmodifiable string that compares in O(1)), where cons cells nest arbitrarily. (Simplest cell representation is two pointers and maybe a few tag bits to decide what it represents… optionally plus garbage collector stuff. Very close to how our computers operate!) Variable names become symbols, application becomes a cell, and lambdas can either become a specially tagged cell, or a nested construction like (lambda ('var body)) (doesn’t really make a difference memory-wise, just a constant factor.) Time is also easy – just count the number of reductions, or maybe include the number of substitutions if you think that’s important. (But again, constant factor, doesn’t actually matter.)
You could even argue that the problem goes in the other direction: For Turing machines, you largely don’t care about the states (often you don’t even explicitly construct them, just argue that they could be constructed), whereas lambda calculus programs are routinely written out in full. So had lambda calculus become the basis, then program size / “problem description complexity”, the “cost of abstractions”, reusability and other related things might have been considered more important. (You get a bit of that now, under the name of “code quality”, but it could have happened much earlier and way more thoroughly.)
An average normal program is extremely pointer-heavy, but you don’t even have the “alphabet” to efficiently represent pointers. (Unless you make “bytes” from bits and then multiply your state count to deal with that construction.) You also don’t have absolute locations, you can’t really do variables (especially not arbitrary numbers of them), and so on. So if you look deeper, both have a very different approach to state and, by extension, functions.
These are good points. From my understanding of how processors work, it seems one would get most of the benefits you mention by having addresses/absolute locations (and thus being able to use pointers [by using addresses instead of left/right operations]). Does that ring true to you?
I think I got something about what you said, that in a certain sense we don’t care about state with a Turing machine. It just occurred to me that the only reason we can use these more convenient designs in a processor is because its memory is limited. (Not sure if this is true? There is probably a really awkward way you could encode addresses, even if your memory were infinite? You’d probably do some tree-like thing, and figuring out where you actually want to put stuff would take some time.)
I’ll focus on the gears-level elaboration of why all those computational mechanisms are equivalent. In short: If you want to actually get anything done, Turing machines suck! They’re painful to use, and that makes it hard to get insight by experimenting with them. Lambda calculus / combinator calculi are way better for actually seeing why this general equivalence is probably true, and then afterwards you can link that back to TMs. (But, if at all possible, don’t start with them!)
Sure, in some sense Turing machines are “natural” or “obvious”: If you start from finite state machines and add a stack, you get a push-down automaton. Do it again, and you essentially have a Turing machine. (Each stack is one end of the infinite tape.) There’s more bookkeeping involved to properly handle “underflows” (when you reach the bottom of a stack and need to extend it) yadda yadda, so every “logical state” would require a whole group of states to actually implement, but that’s a boring mechanical transformation. So just slap another layer of abstraction on it and don’t actually work with 2-stack PDAs, and you’re golden, right? Right…?
Well… actually the same problem repeats again and again. TMs are just extremely low-level and not amenable to abstraction. By default, you get two tape symbols. If you need more, you can emulate those, which will require multiple “implementation states” to implement a single “logical state”, so you add another layer of abstraction… Same with multiple tapes or whatever else you need. But you’re still stuck in ugly implementation details, because you need to move the head left and right to skip across stuff, and you can’t really abstract that away. (Likewise, chaining TMs or using one as a “subroutine” from another isn’t easily possible and involves lots of manual “rewiring”.) So except for trivial examples, actually working with Turing machines is extremely painful. It’s not surprising that it’s really hard to see why they should be able to run all effectively computable functions.
Enter lambda calculus. It’s basically just functions, on a purely syntactic level. You have binders or (anonymous) function definitions – the eponymous lambdas. You have variables, and you have nested application of variables and terms (just like you’re used to from normal functions.) There’s exactly one rule: If a function is applied to an argument, you substitute that argument into the body wherever the placeholder variable was used. So e.g. (λx.((fx)y)(gx))z → ((fz)y)(gz) by substituting z for x. (Actual use of lambda calculus needs a few more rules: You have to make sure that you’re not aliasing variables and changing the meaning, but that’s a relatively minor (if tricky) bookkeeping problem that does abstract away fairly cleanly.)
Now, on top of that, you can quickly build a comfortable working environment. Want to actually name things?
let x := y in t
is just a ((λx.t)y). (The assignment being split around the whole code is a bit ugly, but you don’t even need “actual” extensions, it’s just syntax sugar.) Data can also be emulated quite comfortably, if you understand one key idea: Data is only useful because you eventually do something with it. If all you have is abstraction, just abstract out the “actually doing” part with a placeholder in the most general scheme that you could possibly need. (Generally an induction scheme or “fold”.) [Note: I’ll mostly use actual names instead of one-letter variables going forward, and may drop extra parentheses:f x y z = (((f x) y) z)
]In a way, Booleans are binary decisions.
true := λtt.λff.tt
,false := λtt.λff.ff
(Now yourif c then x else y
just becomes ac x y
… or you add some syntax sugar.)In a way, natural numbers are iteration.
zero := λze.λsu.ze
,succ := λn.λze.λsu.su(n ze su)
. (If you look closely, that’s just the induction principle. And all that the number “does” is:(succ (succ (succ zero)) x f) = f (f (f x))
)Lists are just natural numbers “with ornaments”, a value dangling on every node.
nil := λni.λco.ni
,cons := λx.λxs.λni.λco.co x (xs ni co)
The usual list functions are easy too:map := λf.λxs.(xs nil (λx.λxs.cons (f x) xs))
,fold := λxs.xs
(yep, the list – as represented – is the fold)And so on… so I think you’ll believe me that you basically have a full functional programming language, and that lambda calculus is equivalent to all those more usual computational substrates. (…and maybe Turing machines too?)
Still… even if lambda calculus is extremely simple in some ways, it’s also disturbingly complex in others. Substitution everywhere inside a term is “spooky action at a distance”, a priori it’s quite plausible that this does a lot of “heavy lifting”. So: Let’s turn to combinator calculi. Let’s keep the term structure and application of things, but get rid of all (object-level) variables and binders. Instead, we define three substitution rules: (using variables only at the meta-level)
I x = x
K v x = v
(or, fully parenthesized:(K v) x = v
)S a b x = (a x) (b x)
(or, again more precisely:((S a) b) x = (a x) (b x)
)(Strictly speaking, we don’t even need
I
, asSKKx = (Kx)(Kx) = x
, so if we sayI := SKK
that’s equivalent.) The interesting thing about these rules is that all lambda-terms can be translated intoSKI
-terms. (To get rid of a binder / λ, you recursively transform the body: Pretend the intended argumentx
is already applied to the body, how do you get it where it needs to go? If you have an application, you prefix it withS
such that the argument is passed down into both branches. If a branch doesn’t use the argument (that includes other variables), prefix it withK
so it remains unchanged. Otherwise, it is the bound variable that should be substituted, so put anI
and you get the value. Done.) Worst case, this will cause an exponential(!) increase of the term size, but it’ll get the job done. After that, there’s no more “looking inside terms” or dealing with variables, but extremely simple rewrite rules. To revisit the examples above:true := λtt.λff.tt → λtt.(K tt) → S(KK)I
,false := λtt.λff.ff → λtt.(I) → KI
zero := λze.λsu.ze → S(KK)I
,succ := λn.λze.λsu.su(n ze su) → λn.λze.SI(S(K(n ze))I) → λn.S(K(SI))(S(S(KS)(S(KK)(S(KK)(S(K n)I)))(KI)) → (S(K(S(K(SI))))(S(S(KS)(S(K(S(KS)))(S(K(S(KK)))(S(S(KS)(S(KK)I))(KI)))))(K(KI))))
[…except I think I messed up somewhere.](And yes, I did that by hand… so I hope you’ll accept that I stop there and don’t translate the other examples.)
So actually you don’t need lambda calculus (or variables, for that matter), just these two rules. You can compile your Haskell to this SKI-stuff.
Take a moment to process that.
Now I hope it’s also pretty obvious that you can implement a Turing machine in Haskell? Same for imperative programming languages – look at e.g. the Whitespace interpreter (Haskell version in the archive, Idris version on Github). So at this point I think you have a pretty good intuition as for why all existing computational mechanisms can be reduced down to SKI-calculus.
As for the other direction…with a basic TM model, dealing with all those dynamically resizing parts would be extremely painful, but if you use an extended Turing machine that has lots of different symbols and also more than one band, it’s actually not that hard? Detecting expandable locations isn’t that hard if you use one band as scratch space for counting nested parentheses. Doing the “reductions” (which can actually greatly expand the space needed because
S
duplicates the argument) can also happen on a separate band, and then afterwards you copy over the stuff back onto the “main band”, after adjusting the placement of the unchanged stuff on the side. (I hope that’s sufficiently clear? This is basically using multiple bands as a fixed number of temporary helper variables.)And that’s it. Now you can run your Turing machine in your Haskell on your Turing machine, or whatever… Hope you’re having fun!
Thanks for your reply! I had never heard of SKI-Calculus!
I feel like one advantage Turing machines have, is that they are in some sense low level (actually quite close to how we build our computers)? They do seem to have a nice abstraction for memory. I would have no idea how to characterize NSPACE with lambda calculus, for example (though of course googling reveals someone seems to have figured it out).
They’re actually quite different from how our computers work (just on the surface already, the program is unmodifiable and separate from the code, and the whole walking around is also a very important difference[1]), but I agree that they feel “electronics-y” / like a big ball of wires.
Don’t forget that most of the complexity theoretic definitions are based on Turing machines, because that happened to become the dominant model for the study of these kinds of things. Similar-but-slightly-different constructions would perhaps be more “natural” on lambda calculus, so if that had been the dominant model, then the Turing machine versions would have been slightly awkward instead.
But memory isn’t actually a problem – see Lisp. Represent everything as a cons cell (a storage unit with two slots) or a “symbol” (basically an unmodifiable string that compares in O(1)), where cons cells nest arbitrarily. (Simplest cell representation is two pointers and maybe a few tag bits to decide what it represents… optionally plus garbage collector stuff. Very close to how our computers operate!) Variable names become symbols, application becomes a cell, and lambdas can either become a specially tagged cell, or a nested construction like
(lambda ('var body))
(doesn’t really make a difference memory-wise, just a constant factor.) Time is also easy – just count the number of reductions, or maybe include the number of substitutions if you think that’s important. (But again, constant factor, doesn’t actually matter.)You could even argue that the problem goes in the other direction: For Turing machines, you largely don’t care about the states (often you don’t even explicitly construct them, just argue that they could be constructed), whereas lambda calculus programs are routinely written out in full. So had lambda calculus become the basis, then program size / “problem description complexity”, the “cost of abstractions”, reusability and other related things might have been considered more important. (You get a bit of that now, under the name of “code quality”, but it could have happened much earlier and way more thoroughly.)
An average normal program is extremely pointer-heavy, but you don’t even have the “alphabet” to efficiently represent pointers. (Unless you make “bytes” from bits and then multiply your state count to deal with that construction.) You also don’t have absolute locations, you can’t really do variables (especially not arbitrary numbers of them), and so on. So if you look deeper, both have a very different approach to state and, by extension, functions.
These are good points. From my understanding of how processors work, it seems one would get most of the benefits you mention by having addresses/absolute locations (and thus being able to use pointers [by using addresses instead of left/right operations]). Does that ring true to you?
I think I got something about what you said, that in a certain sense we don’t care about state with a Turing machine. It just occurred to me that the only reason we can use these more convenient designs in a processor is because its memory is limited. (Not sure if this is true? There is probably a really awkward way you could encode addresses, even if your memory were infinite? You’d probably do some tree-like thing, and figuring out where you actually want to put stuff would take some time.)