Explaining a Math Magic Trick

Robert_AIZIMay 5, 2024, 7:41 PM

97 points

Introduction

A recent popular tweet did a “math magic trick”, and I want to explain why it works and use that as an excuse to talk about cool math (functional analysis). The tweet in question:

This is a cute magic trick, and like any good trick they nonchalantly gloss over the most important step. Did you spot it? Did you notice your confusion?

Here’s the key question: Why did they switch from a differential equation to an integral equation? If you can use $(1 - x)^{- 1} = 1 + x + x^{2} + . . .$ when $x = \int$ , why not use it when $x = d / d x$ ?

Well, lets try it, writing $D$ for the derivative:

$\begin{matrix} f^{'} & = & f (1 - D) f & = & 0 f & = & (1 + D + D^{2} + . . .) 0 f & = & 0 + 0 + 0 + . . . f & = & 0 \end{matrix}$

So now you may be disappointed, but relieved: yes, this version fails, but at least it fails-safe, giving you the trivial solution, right?

But no, actually $(1 - D)^{- 1} = 1 + D + D^{2} + . . .$ can fail catastrophically, which we can see if we try a nonhomogeneous equation like $f^{'} = f + e^{x}$ (which you may recall has solution $x e^{x}$ ):

$\begin{matrix} f^{'} & = & f + e^{x} (1 - D) f & = & e^{x} f & = & (1 + D + D^{2} + . . .) e^{x} f & = & e^{x} + e^{x} + e^{x} + . . . f & = & \infty ? \end{matrix}$

However, the integral version still works. To formalize the original approach: we define the function $I$ (for integral) to take in a function $f (x)$ and produce the function $I f$ defined by $I f (x) = \int_{0}^{x} f (t) d t$ . This rigorizes the original trick, elegantly incorporates the initial conditions of the differential equation, and fully generalizes to solving nonhomogeneous versions like $f^{'} = f + e^{x}$ (left as an exercise to the reader, of course).

So why does $(1 - D)^{- 1} = 1 + D + D^{2} + . . .$ fail, but $(1 - I)^{- 1} = 1 + I + I^{2} + . . .$ works robustly? The answer is functional analysis!

Functional Analysis

Savvy readers may already be screaming that the trick $(1 - x)^{- 1} = 1 + x + x^{2} + . . .$ for numbers only holds true for $| x | < 1$ , and this is indeed the key to explaining what happens with $D$ and $I$ ! But how can we define the “absolute value” of “the derivative function” or “the integral function”?

What we’re looking for is a norm, a function that generalizes absolute values. A norm is a function $x \mapsto | | x | |$ satisfying these properties:

$| | x | | \geq 0$ for all $x$ (positivity), and $| | x | | = 0$ if and only if $x = 0$ (positive-definite)
$| | x + y | | \leq | | x | | + | | y | |$ for all $x$ and $y$ (triangle inequality)
$| | c x | | = | c | \cdot | | x | |$ for all $x$ and real numbers $c$ , where $| c |$ denotes the usual absolute value (absolute homogeneity)

Here’s an important example of a norm: fix some compact subset of $R$ , say $X = [- 10, 10]$ , and for a continuous function $f : X \to R$ define $| | f | |_{\infty} = {max}_{x \in X} | f (x) |$ , which would commonly be called the $L^{\infty}$ -norm of $f$ . (We may use a maximum here due to the Extreme Value Theorem. In general you would use a supremum instead.) Again I shall leave it to the reader to check that this is a norm.

This example takes us halfway to our goal: we can now talk about the “absolute value” of a continuous function that takes in a real number and spits out a real number, but $D$ and $I$ take in functions and spit out functions (what we usually call an operator, so what we need is an operator norm).

Put another way, the $L^{\infty}$ -norm is “the largest output of the function”, and this will serve as the inspiration for our operator norm. Doing the minimal changes possible, we might try to define $| | I | | = {max}_{f continuous} | | I f | |_{\infty}$ . There are two problems with this:

First, since $I$ is linear, you can make $| | I f | |_{\infty}$ arbitrarily large by scaling $f$ by 10x, or 100x, etc. We can fix this by restricting the set of valid f for these purposes, just like how for the $L^{\infty}$ example restricted the inputs of $f$ to the compact set $X = [- 10, 10]$ . Unsurprisingly nice choice of set to restrict to is the “unit ball” of functions, the set of functions with $| | f | |_{\infty} \leq 1$ .
Second, we must bid tearful farewell to the innocent childhood of maxima, and enter the liberating adulthood of suprema. This is necessary since $f$ ranges over the infinite-dimensional vector space of continuous functions, so the Heine-Borel theorem no longer guarantees the unit ball is compact, and therefore the extreme value theorem no longer guarantees that we will attain a maximum.

So the proper definition of the norm of $I$ and $D$ are:

$| | I | | = {sup}_{f continuous, | | f | |_{\infty} \leq 1} | | I f | |_{\infty}$

$| | D | | = {sup}_{f continuous, | | f | |_{\infty} \leq 1} | | D f | |_{\infty}$

(and you can define similar norms for any linear operator, including $I^{2}$ , etc.) A good exercise is to show these equivalent definitions of the operator norm for any linear function L:

$\begin{matrix} | | L | | & = & sup f continuous, | | f | |_{\infty} \leq 1 | | L f | |_{\infty} = & sup f continuous, | | f | |_{\infty} = 1 | | L f | |_{\infty} = & inf {b \geq 0 : | | L f | |_{\infty} \leq b | | f | |_{\infty} for all f} \end{matrix}$

So another way of thinking of the operator norm is the maximum stretching factor of the linear operator. The third definition also motivates the terminology of bounded linear operators: each such $b$ is a bound on the operator $L$ , and the least such bound is the norm. Fun exercise: show that a linear operator is bounded if and only if it is continuous (with respect to the correct topologies). Hint: you’ll need to work in infinite dimensional spaces here, because any finite-dimensional linear operator must be bounded.

Now let’s actually compute these norms! For $I$ , remember that our $L^{\infty}$ -norm is defined over the interval $X = [- 10, 10]$ . First observe that for the constant function $f (x) = 1$ , $I f (x) = x$ , so $| | I f | |_{\infty} = 10$ . Thus $| | I | | \geq 10$ . To show that this is indeed the maximum we use the triangle inequality for integrals:

$\begin{matrix} | | I f | | & = & max x \in X | \int_{0}^{x} f (t) d t | \leq & max x \in X \int_{0}^{x} | f (t) | d t \leq & max x \in X \int_{0}^{x} 1 d t = & max x \in X x = 10 \end{matrix}$

So we have shown $| | I | | = 10$ ! Put a pin in that while we check $| | D | |$ .

For $D$ , we have a problem: for any positive number $k$ , $D (e^{k x}) = k e^{k x}$ . In other words, $D$ can stretch functions by any amount, so it has no norm, or we’d write $| | D | | = \infty$ (and I promise this is a failure of $D$ , not of our definitions). Put another way, $D$ is not bounded as a linear operator, since it can stretch functions by an arbitrary amount.

But now let’s return to $I$ . We said that $| | I | | = 10$ (if we’re defining it relative to the $L^{\infty}$ -norm on $X = [- 10, 10]$ ), but isn’t $(1 - x)^{- 1} = 1 + x + x^{2} + . . .$ only true when $| x | < 1$ ? For real numbers, yes, but for operators, something magical happens: $| | I^{n} | | \neq | | I | |^{n}$ ! (Its like there’s a whole algebra of these operators...)

In fact, you can show that $| | I^{n} | |$ assumes its maximum value when applied to the constant function $f (x) = 1$ , and hence have $| | I^{n} | | = \frac{1}{n!} | | I | |^{n}$ . Since $n!$ grows faster than exponential functions, $| | I^{n} | |$ is converging to 0 quickly, so $1 + I + I^{2} + I^{3} + . . .$ is a Cauchy sum, and it is then straightforward to show that the limit is the multiplicative inverse of $1 - I$ . Thus, $(1 - I)^{- 1} = 1 + I + I^{2} + . . .$ is a valid expression that you can apply to any continuous (or bounded) function $f$ on any compact set $X$ . This convergence happens regardless of the choice of the compact set, though it will happen at different rates, analogous to uniform convergence on compact sets.

Summary

Writing $D$ for derivative and $I$ for integral, we showed that $(1 - D)^{- 1} = 1 + D + D^{2} . . .$ can fail, even though $(1 - I)^{- 1} = 1 + I + I^{2} . . .$ is always true.
To explain this, we have to show that $I$ is fundamentally better behaved than $D$ , in a way analogous to $| x | < 1$ .
We built this up in two steps. First, we defined the $L^{\infty}$ -norm for real-valued functions, which lets you say how “large” those functions are. Then, we extended this to function-valued functions (operators), having to make two slight modifications along the way.
With this machinery in place, we could show that $| | I | | < \infty$ , or we can say that $I$ is bounded. The resulting norm depends on the domain of the functions $f$ under consideration, but any compact domain is allowable. Also, since $| | I^{n} | | = \frac{1}{n!} | | I | |^{n}$ , the exact value doesn’t matter since the norm of each term goes to 0.
Since $| | I^{n} | | \to 0$ sufficiently quickly, we can say that $1 + I + I^{2} + I^{3} + . . .$ is Cauchy as a sequence of operators. In other words, if you apply the partial sums $1, 1 + I, 1 + I + I^{2}, . . .$ as operators to any function $f$ , the functions $f, f + I f, f + I f + I^{2} f, . . .$ will converge with respect to the $L^{\infty}$ -norm. Writing $L f$ for the function they converge to, it follows that $(1 - I) L f = f$ , so we may write $L = 1 + I + I^{2} + I^{3} + . . . = (1 - I)^{- 1}$ as a statement about linear operators.
In contrast, $D$ is unbounded as an operator, meaning $| | D | | = \infty$ . Thus algebra tricks like $(1 - D)^{- 1} = 1 + D + D^{2} . . .$ will break down if you put in the wrong function $f$ .

Robert_AIZIMay 5, 2024, 7:41 PM

97 points

10 comments5 min readLW link

Logic & Mathematics World Modeling

Dacyn May 6, 2024, 9:11 PM
7 points
0

This doesn’t completely explain the trick, though. In the step where you write f=(1-I)^{-1} 0, if you interpret I as an operator then you get f=0 as the result. To get f=Ce^x you need to have f=(1-I)^{-1} C in that step instead. You can get this by replacing \int f by If+C at the beginning.
- Robert_AIZI May 7, 2024, 3:33 AM
  4 points
  0
  Parent
  
  Ah sorry, I skipped over that derivation! Here’s how we’d approach this from first principals: to solve f=Df, we know we want to use the (1-x)=1+x+x^2+… trick, but now know that we need x=I instead of x=D. So that’s why we want to switch to an integral equation, and we get
  f=Df
  If=IDf = f-f(0)
  where the final equality is the fundamental theorem of calculus. Then we rearrange:
  f-If=f(0)
  (1-I)f=f(0)
  and solve from there using the (1-I)=1+I+I^2+… trick! What’s nice about this is it shows exactly how the initial condition of the DE shows up.
notfnofn May 6, 2024, 5:50 PM
6 points
0

Here’s a puzzle I came up with in undergrad, based on this idea:
Let $f (x)$ be a function with nice derivatives and anti-derivatives (like exponentials, sine, or cosine) and $p (x)$ be a polynomial. Express the $k$ th anti-derivative of $p (x) f (x)$ in terms of derivatives and anti-derivatives of $p (x)$ and $f (x)$ .
Can provide link to a post on r/mathriddles with the answer in the comments upon request
- DaemonicSigil May 7, 2024, 3:25 AM
  5 points
  0
  Parent
  
  
  Use integration by parts:
  
  $I (p f) = p I f - I ((D p) (I f))$
  
  Then $D p$ is another polynomial (of smaller degree), and $I f$ is another “nice” function, so we recurse.
  - notfnofn May 7, 2024, 3:29 AM
    4 points
    0
    Parent
    
    This is true, but I’m looking for an explicit, non-recursive formula that needs to handle the general case of the kth anti-derivative (instead of just the first).
    The solution involves doing something funny with formal power series, like in this post.
    - DaemonicSigil May 7, 2024, 4:58 AM
      5 points
      0
      Parent
      
      Heh, sure.
      
      Promote $f$ from a function to a linear operator on the space of functions, $F$ . The action of this operator is just “multiply by $f$ ”. We’ll similarly define $F^{\sim}, F^{\sim^{2}}$ meaning to multiply by the first, second integral of $f$ , etc.
      
      Observe:
      
      $I F = F^{\sim} - I F^{\sim} D$
      
      $I F = F^{\sim} - F^{\sim^{2}} D + F^{\sim^{3}} D^{2} - \dots$
      
      Now we can calculate what we get when applying $k$ times. The calculation simplifies when we note that all terms are of the form $F^{\sim^{a}} (- D)^{(a - k)}$ . Result:
      
      $I^{k} F = \infty \sum j = k (\frac{j - 1}{k - 1}) F^{\sim^{j}} (- D)^{j - k}$
      
      Now we apply the above operator to $p$ :
      
      $I^{k} F p = \infty \sum j = k (\frac{j - 1}{k - 1}) F^{\sim^{j}} (- D)^{j - k} p$
      
      $I^{k} (f p) = \infty \sum j = k (\frac{j - 1}{k - 1}) (I^{j} f) (- D)^{j - k} p$
      
      The sum terminates because a polynomial can only have finitely many derivatives.
      - notfnofn May 7, 2024, 1:12 PM
        6 points
        0
        Parent
        
        Very nice! Notice that if you write $r = j - k,$ $I$ as $D^{- 1}$ , and play around with binomial coefficients a bit, we can rewrite this as:
        
        $D^{- k} (f p) = \sum_{r = 0}^{\infty} (\frac{- k}{r}) (D^{- k - r} f) (D^{r} p)$
        which holds for $k < 0$ as well, in which case it becomes the derivative product rule. This also matches the formal power series expansion of $(x + y)^{- k}$ , which one can motivate directly
        (By the way, how do you spoiler tag?)
        DaemonicSigil May 7, 2024, 4:07 PM
        3 points
        0
        Parent
        
        Oh, very cool, thanks! Spoiler tag in markdown is:
        
        :::spoiler text here :::
Donald Hobson May 13, 2024, 5:27 PM
3 points
0

You can make $(1 + D + D^{2} + . . .)$ work out, if you are prepared to make your mathematics even more deranged.
So lets look at $(1 + D + D^{2} + . . .) 0$
Think of the $0$ not as $0$ but as some infinitesimal $ϵ$ times some unknown function $g$ .
If that function is $g = e^{0.5 x}$ then we get $(1 + \frac{1}{2} + \frac{1}{4} + . . .) e^{0.5 x}$ which is finite, so multiplied by $ϵ$ it becomes infinitesimal.
If $g = e^{2 x}$ then we get $(1 + 2 + 4...) e^{2 x}$ and as we know $1 + 2 + 4 + . . . = - 1$ because $(1 + 2 + 4 + . . .) \times 2 = 2 + 4 + 8... = (1 + 2 + 4...) - 1$
So this case is the same as before.
But for $g = e^{x}$ we get $(1 + 1 + 1...) e^{x}$ which doesn’t converge. The infinite largeness of this sum cancels with the infinitesimally small size of $ϵ$ (Up to an arbitrary finite constant).
So $(1 + D + D^{2} . . .) 0 = C e^{x}$
Great. Now lets apply the same reasoning to
$(1 + D + D^{2} . . .) e^{x}$ . First note that this is infinite, it’s $\infty e^{x}$ , so undefined. Can we make this finite. Well think of $e^{x}$ as actually being $e^{x} + ϵ g$ and in this case, take $g = x e^{x}$
$D^{n} (e^{x} + ϵ g) = e^{x} + ϵ n e^{x} + ϵ x e^{x}$
For the final term, the smallness of epsilon counteracts having to sum to infinity. For the first and middle term, the sum is $(\frac{1}{ϵ} + (\frac{1}{ϵ} + 1) + (\frac{1}{ϵ} + 2) + . . .) ϵ e^{x}$
Which is $(1 + 2 + 3 + . . .) ϵ e^{x} - (1 + 2 + . . . + (\frac{1}{ϵ} - 1) + \frac{1}{ϵ}) ϵ e^{x}$
Now $1 + 2 + 3 + . . . = - \frac{1}{12}$
So we have $(- \frac{1}{12}) ϵ e^{x} - \frac{1}{2} \frac{1}{ϵ} (\frac{1}{ϵ} - 1) ϵ e^{x}$
The first term is negligible. So $- \frac{1}{2} (\frac{1}{ϵ} - 1) e^{x}$
Note that the $\frac{1}{2} e^{x}$ can be ignored, because we have $C e^{x}$ for arbitrary (finite) C as before.
Now $- \frac{1}{2 ϵ}$ is big, but it’s probably less infinite than $\infty$ somehow. Let’s just group it into the $C$ and hope for the best.
quiet_NaN May 10, 2024, 4:32 PM
1 point
0

Edit: looks like was already raised by Dacyn and answered to my satisfaction by Robert_AIZI. Correctly applying the fundamental theorem of calculus will indeed prevent that troublesome zero from appearing in the RHS in the first place, which seems much preferable to dealing with it later.
~~My real analysis might be a bit rusty, but I think defining I as the definite integral breaks the magic trick.~~
~~I mean, in the last line of the ‘proof’,~~ $1 + I + I^{2} + \dots$ ~~gets applied to the zero function.~~
~~Any definitive integral of the zero function is zero, so you end up with f(x)=0, which is much less impressive.~~
More generally, asking the question Op(f)=0 for any invertable linear operator Op is likely to set yourself up for disappointment. Since the trick relies on inverting an operator, we might want to use a non-linear operator.
$I (f (x)) := \int_{0}^{x} f (x) d x + C$ ~~where C is some global constant might be better. (This might affect the radius of convergence of that Taylor series, do not use for production yet!)~~
~~This should result in… uhm…~~ $C + (C x + C) + (C \frac{x^{2}}{2} + C x + C) + \dots$ ?
Which is a lot more work to reorder than the original convention used in the ‘proof’ where all the indefinite integrals of the zero function are conveniently assumed to be the same constant, and all other indefinite integrals conveniently have integration constants of zero.
~~Even if we sed s/C//~~ $ϵ$ ~~ and proclaim that~~ $ϵ$ should be small (e.g. compared to x) and we are only interested in the leading order terms, this would not work. What one would have to motivate is throwing everything but the leading power of x out for every $I^{n}$ ~~evaluation, then later meticulously track these lower order terms in the sum to arrive at the Taylor series of the exponential.~~