### An optimization problem: theme

Let’s start simple:

Question 1: What is the linear function $\ell(x)$ that minimizes the integral $\int_{-1}^1 |x^2 + x - \ell(x)| ~\mathrm{d}x$? In other words, what is the best linear approximation of $x^2 + x$ in the $L^1([-1,1])$ sense?

This is something that can in principle be solved by a high schooler with some calculus training. Here’s how one solution may go:

Solution: All linear functions take the form $\ell(x) = ax + b$. The integrand is equal to $x^2 + x - \ell(x)$ when $x^2 + x \geq \ell(x)$, and $\ell(x) - x - x^2$ otherwise. So we need to find the points of intersection. This requires solve $x^2 + x - ax - b = 0$, which we can solve by the quadratic formula. In the case where $x^2 + x - a x - b$ is signed, we see that changing $b$ we can make the integrand strictly smaller, and hence we cannot attain a minimizer. So we know that at the minimizer there must exist at least one root.

Consider the case where there is only one root in the interval (counted with multiplicity), call the root $x_0$. We have that the integral to minimize is equal to
$\displaystyle \left| \int_{-1}^{x_0} x^2 + x - ax - b~\mathrm{d}x \right| + \left| \int_{x_0}^1 x^2 + x - a x - b ~\mathrm{d}x \right|$
each part of which can be computed explicitly to obtain
$\displaystyle \left| \frac13 x_0^3 + \frac13 + \frac{1-a}{2} x_0^2 - \frac{1-a}{2} - b x_0 - b \right| + \left| \frac13 - \frac13 x_0^3 + \frac{1-a}{2} - \frac{1-a}{2} x_0^2 - b + b x_0\right|$
Since we know that the two terms comes from integrands with different signs, we can combine to get
$\displaystyle \left| \frac23 x_0^3 + (1-a) x_0^2 - (1-a) - 2b x_0 \right|$
as the integrand. Now, we cannot just take the partial derivatives of the above expression with respect to $a,b$ and set that to zero and see what we get: the root $x_0$ depends also on the parameters. So what we would do then is to plug in $x_0$ using the expression derived from the quadratic formula, $x_0 = \frac{1}{2} \left( a - 1 \pm \sqrt{ (1-a)^2 + 4b}\right)$, and then take the partial derivatives. Before that, though, we can simplify a little bit: since $x_0^3 + (1-a) x_0^2 - b x_0 = 0$ from the constraint, the quantity to minimize is now
$\displaystyle \left| - \frac13 x_0^3 - (1-a) - b x_0 \right|$
A long computation taking the $\partial_b$ now shows that necessarily $x_0 = 0$, which implies that $b = 0$. But for the range of $a$ where there is only one root in the interval, the quantity does not achieve a minimum. (The formal minimizer happens at $a = 1$ but we see for this case the integrand of the original cost function is signed.

So we are down to the case where there are two roots in the interval. Now we call the roots $x_+$ and $x_-$, and split the integral into
$\displaystyle \left| \int_{-1}^{x_-} x^2 + x - ax - b~\mathrm{d}x - \int_{x_-}^{x_+} x^2 + x - ax - b~\mathrm{d}x + \int_{x_+}^1 x^2 + x - ax - b~\mathrm{d}x \right|$
and proceed as before. (The remainder of the proof is omitted; the reader is encouraged to carry out this computation out by hand to see how tedious it is.)

Suppose now after this long computation, you are given the question:

Question 2: What is the linear function $\ell(x)$ that minimizes the integral $\int_{-1}^1 |e^x - \ell(x)| ~\mathrm{d}x$?

You say: can’t we do just what we did before? And then you realizes that there is a problem.

In the “solution” to Question 1 given above, crucially we were able to have closed form solutions for $x^2 + x - ax - b = 0$ via the quadratic formula. To replicate the same proof we would need closed form solutions to $e^x - ax - b = 0$ as a function of $a$ and $b$. This is not easy to do.

At this point, however, we realize something important:

The formula $\ell(x) = ax + b$, aka the slope-intercept form, is not the only representation of linear functions.

In particular, we saw in the “solution” to Question 1 above it is important to consider the roots. Can we parametrize the functions in terms of roots?

The answer, it turns out, is yes. And it depends on convexity.

First, observe that whenever $f(x)$ is a convex function (in our context we take $f$ to be twice continuously differentiable with positive second derivative), there can be at most 2 roots to the equation $f(x) - \ell(x) = 0$, where $\ell(x)$ is linear. (Suppose there are three or more roots, between two of the roots there must be a local maximum. But we know that at local maxima the second derivative is non-positive, whereas the second derivative $f''(x) - \ell''(x) = f''(x) > 0$ by assumption.)

I will rule out the case where there are only zero or one root by pictures.

A zero-crossing approximant (solid green and blue) can always be improved (dashed green and blue) by changing the intercept.

A single crossing approximant (solid blue) can always be improved (dashed blue) by changing the slope.

Therefore whenever the function we want to approximate is convex (or concave for that matter), our approximating linear function must intersect it exactly twice on the interval. Now, a linear function is precisely described by its values on two points. So instead of using the slope-intercept form, it is better to use a point-slope form
$\displaystyle \ell(x) = (x - x_-) \cdot \frac{ f(x_+) - f(x_-)}{x_+ - x_-} + f(x_-)$
parametrizing the linear approximation using its two intersections with our function.

In other words, rather than computing the intersections $x_+,x_-$ from the parameters $a,b$, we use the explicit expression of $a,b$ in slope-intercept form as functions of the intersections $x_+, x_-$. And in this form we can prove a more general theorem.

Theorem. Let $f(x)$ be a strictly convex function on $[-1,1]$. Then the minimizer of the cost function $\int_{-1}^1 |f(x) - \ell(x)| ~\mathrm{d}x$ among linear functions $\ell(x)$ is achieved by
$\displaystyle \ell(x) = (x + \frac12) \cdot (f(\frac12) - f(-\frac12)) + f(-\frac12)$

I will give two proofs. First is the computational one

Proof 1: Since $f(x)$ is convex, we parametrize the space of linear functions (having two intersections with $f(x)$ in the interval $[-1,1]$) as described above, with $x_- \leq x_+$. Then the sign changes of the integrand implies that the cost function can be expressed as
$\displaystyle C(x_-, x_+; f) = \int_{-1}^{x_-} f(x) - \ell(x; x_-, x_+) ~\mathrm{d}x + \int_{x_+}^1 f(x) - \ell(x; x_-, x_+) ~\mathrm{d}x - \int_{x_-}^{x_+} \ell(x;x_-,x_+) - f(x) ~\mathrm{d}x$
Taking the derivatives relative to $x_\pm$ and using the fundamental theorem of calculus we have
$\displaystyle \partial_{x_-} C = 2[f(x_-) - \ell(x_-; x_-, x_+)] - \int_{-1}^{x_-} \partial_{x_-} \ell(x; x_-, x_+) ~\mathrm{d}x - \int_{x_+}^1 \partial_{x_-} \ell(x; x_-, x_+)~\mathrm{d}x + \int_{x_-}^{x_+} \partial_{x_-} \ell(x; x_-, x_+) ~\mathrm{d}x$
and a similar one for $x_+$. Here we see why it is so nice to use the parametrization we chose: the “boundary terms” that appear under the fundamental theorem of calculus evaluates to exactly 0 by construction.

Computing the $\partial_{x_-} \ell(x; x_-, x_+)$ gives us
$\displaystyle \partial_{x_-} \ell = - \frac{f(x_+) - f(x_-)}{x_+-x_-} + (x - x_-) \cdot \frac{- f'(x_-)}{x_+ - x_-} + (x - x_-) \cdot \frac{f(x_+) - f(x_-)}{x_+ - x_-} \frac{1}{x_+ - x_-} + f'(x_-)$
which we simplify to
$\displaystyle \partial_{x_-} \ell = \left[ \frac{x - x_-}{x_+ - x_-} - 1\right] \left[ \frac{f(x_+) - f(x_-)}{x_+ - x_-} - f'(x_-) \right]$
and similarly
$\displaystyle \partial_{x_+} \ell = \frac{x - x_-}{x_+ - x_-} \left[ f'(x_+) - \frac{f(x_+) - f(x_-)}{x_+ - x_-} \right]$
By strict convexity the terms in the brackets involving $f'$ are non-zero constants. So we see that the problem reduces to choosing $x_+, x_-$ such that
$\displaystyle \left( \int_{-1}^{x_-} + \int_{x_+}^1 - \int_{x_-}^{x_+} \right) x - x_- ~\mathrm{d}x = 0 = \left( \int_{-1}^{x_-} + \int_{x_+}^1 - \int_{x_-}^{x_+} \right) x - x_+ \mathrm{d}x$
Directly evaluating the resulting integrals we see that this is achieved when $x_+ = \frac12$ and $x_- = -\frac12$. Q.E.D.

A second proof which involves much less computation can be found using the intuition we derived from the two above pictures and some knowledge of the philosophy of Lebesgue integration.

Proof 2: From the discussion above we know to expect there to be exactly two intersections between $\ell(x)$ and $f(x)$, and by convexity we know that $(f(x) - \ell(x))(x - x_-)(x_ - x_+) \geq 0$. Now consider the cost function for two different lines $\ell_1$ and $\ell_2$ differing by a small vertical translation.

The difference in cost function between the blue and green lines is the signed area of the shaded region, where the sign is whether the point is above or below the red curve.

Taking the vertical difference to be infinitesimal, we see that the variation $\delta C$ is precisely
$\delta C = (x_+ - x_-) - (1 - x_+) - (x_- + 1)$
So that for $\delta C$ to be zero, it is necessary that $x_+ - x_-$ is exactly 1; or that the function $f(x)$ spend exactly half the time above $\ell(x)$, and half the time below.

Using again that $f$ is convex, we see that having fixed $x_+ - x_-$, to fix the “position” of this interval it remains to fix the slope. The analogous picture to above for slope variation is as follows.

The variation in the cost function for the blue and green lines, which differ by a a change of slope (rotation around the intersection point) is the signed area of the shaded region, where the sign is determined by whether the red curve is above or below it.

Rotating a line $\ell$ about center $x_0$ with a small angle incurs a variation of the cost
$\displaystyle \delta C = \left( \int_{-1}^{x_-} + \int_{x_+}^1 - \int_{x_-}^{x_+} \right) x - x_0 ~\mathrm{d}x$
This $\delta C = 0$ needs to hold for every $x_0$ at the critical parameter, and in particular must hold for $x_0 = \frac12 (x_- + x_+)$. By symmetry we see that the contribution of the integral between $(x_-, x_+)$ vanishes always for this point. But the integral in the two external regions only cancel if $x_- + 1 = 1 - x_+$, which now implies the desired result. Q.E.D.