An optimization problem: theme

by Willie Wong

Let’s start simple:

Question 1: What is the linear function \ell(x) that minimizes the integral \int_{-1}^1 |x^2  + x - \ell(x)| ~\mathrm{d}x? In other words, what is the best linear approximation of x^2 + x in the L^1([-1,1]) sense?

This is something that can in principle be solved by a high schooler with some calculus training. Here’s how one solution may go:

Solution: All linear functions take the form \ell(x) = ax + b. The integrand is equal to x^2 + x - \ell(x) when x^2 + x \geq \ell(x), and \ell(x) - x - x^2 otherwise. So we need to find the points of intersection. This requires solve x^2 + x - ax - b = 0, which we can solve by the quadratic formula. In the case where x^2 + x - a x - b is signed, we see that changing b we can make the integrand strictly smaller, and hence we cannot attain a minimizer. So we know that at the minimizer there must exist at least one root.

Consider the case where there is only one root in the interval (counted with multiplicity), call the root x_0. We have that the integral to minimize is equal to
\displaystyle \left| \int_{-1}^{x_0} x^2 + x - ax - b~\mathrm{d}x \right| + \left| \int_{x_0}^1 x^2 + x - a x - b ~\mathrm{d}x \right|
each part of which can be computed explicitly to obtain
\displaystyle \left| \frac13 x_0^3  + \frac13 + \frac{1-a}{2} x_0^2 - \frac{1-a}{2} - b x_0 - b \right| + \left| \frac13 - \frac13 x_0^3 + \frac{1-a}{2} - \frac{1-a}{2} x_0^2 - b + b x_0\right|
Since we know that the two terms comes from integrands with different signs, we can combine to get
\displaystyle \left| \frac23 x_0^3 + (1-a) x_0^2 - (1-a) - 2b x_0 \right|
as the integrand. Now, we cannot just take the partial derivatives of the above expression with respect to a,b and set that to zero and see what we get: the root x_0 depends also on the parameters. So what we would do then is to plug in x_0 using the expression derived from the quadratic formula, x_0 = \frac{1}{2} \left( a - 1 \pm \sqrt{ (1-a)^2 + 4b}\right), and then take the partial derivatives. Before that, though, we can simplify a little bit: since x_0^3 + (1-a) x_0^2 - b x_0 = 0 from the constraint, the quantity to minimize is now
\displaystyle \left| - \frac13 x_0^3 - (1-a) - b x_0 \right|
A long computation taking the \partial_b now shows that necessarily x_0 = 0, which implies that b = 0. But for the range of a where there is only one root in the interval, the quantity does not achieve a minimum. (The formal minimizer happens at a = 1 but we see for this case the integrand of the original cost function is signed.

So we are down to the case where there are two roots in the interval. Now we call the roots x_+ and x_-, and split the integral into
\displaystyle \left| \int_{-1}^{x_-} x^2 + x - ax - b~\mathrm{d}x  - \int_{x_-}^{x_+} x^2 + x - ax - b~\mathrm{d}x + \int_{x_+}^1 x^2 + x - ax - b~\mathrm{d}x \right|
and proceed as before. (The remainder of the proof is omitted; the reader is encouraged to carry out this computation out by hand to see how tedious it is.)

Suppose now after this long computation, you are given the question:

Question 2: What is the linear function \ell(x) that minimizes the integral \int_{-1}^1 |e^x - \ell(x)| ~\mathrm{d}x?

You say: can’t we do just what we did before? And then you realizes that there is a problem.

In the “solution” to Question 1 given above, crucially we were able to have closed form solutions for x^2 + x - ax - b = 0 via the quadratic formula. To replicate the same proof we would need closed form solutions to e^x - ax - b = 0 as a function of a and b. This is not easy to do.

At this point, however, we realize something important:

The formula \ell(x) = ax + b, aka the slope-intercept form, is not the only representation of linear functions.

In particular, we saw in the “solution” to Question 1 above it is important to consider the roots. Can we parametrize the functions in terms of roots?

The answer, it turns out, is yes. And it depends on convexity.

First, observe that whenever f(x) is a convex function (in our context we take f to be twice continuously differentiable with positive second derivative), there can be at most 2 roots to the equation f(x) - \ell(x) = 0, where \ell(x) is linear. (Suppose there are three or more roots, between two of the roots there must be a local maximum. But we know that at local maxima the second derivative is non-positive, whereas the second derivative f''(x) - \ell''(x) = f''(x) > 0 by assumption.)

I will rule out the case where there are only zero or one root by pictures.

A zero-crossing approximant (solid green and blue) can always be improved (dashed green and blue) by changing the intercept.

A zero-crossing approximant (solid green and blue) can always be improved (dashed green and blue) by changing the intercept.

A single crossing approximant (solid blue) can always be improved (dashed blue) by changing the slope.

A single crossing approximant (solid blue) can always be improved (dashed blue) by changing the slope.

Therefore whenever the function we want to approximate is convex (or concave for that matter), our approximating linear function must intersect it exactly twice on the interval. Now, a linear function is precisely described by its values on two points. So instead of using the slope-intercept form, it is better to use a point-slope form
\displaystyle \ell(x) = (x - x_-) \cdot \frac{ f(x_+) - f(x_-)}{x_+ - x_-} + f(x_-)
parametrizing the linear approximation using its two intersections with our function.

In other words, rather than computing the intersections x_+,x_- from the parameters a,b, we use the explicit expression of a,b in slope-intercept form as functions of the intersections x_+, x_-. And in this form we can prove a more general theorem.

Theorem. Let f(x) be a strictly convex function on [-1,1]. Then the minimizer of the cost function \int_{-1}^1 |f(x) - \ell(x)| ~\mathrm{d}x among linear functions \ell(x) is achieved by
\displaystyle \ell(x) = (x + \frac12) \cdot (f(\frac12) - f(-\frac12)) + f(-\frac12)

I will give two proofs. First is the computational one

Proof 1: Since f(x) is convex, we parametrize the space of linear functions (having two intersections with f(x) in the interval [-1,1]) as described above, with x_- \leq x_+. Then the sign changes of the integrand implies that the cost function can be expressed as
\displaystyle C(x_-, x_+; f) = \int_{-1}^{x_-} f(x) - \ell(x; x_-, x_+) ~\mathrm{d}x + \int_{x_+}^1 f(x) - \ell(x; x_-, x_+) ~\mathrm{d}x - \int_{x_-}^{x_+} \ell(x;x_-,x_+) - f(x) ~\mathrm{d}x
Taking the derivatives relative to x_\pm and using the fundamental theorem of calculus we have
\displaystyle \partial_{x_-} C =  2[f(x_-) - \ell(x_-; x_-, x_+)] - \int_{-1}^{x_-} \partial_{x_-} \ell(x; x_-, x_+) ~\mathrm{d}x - \int_{x_+}^1 \partial_{x_-} \ell(x; x_-, x_+)~\mathrm{d}x + \int_{x_-}^{x_+} \partial_{x_-} \ell(x; x_-, x_+) ~\mathrm{d}x
and a similar one for x_+. Here we see why it is so nice to use the parametrization we chose: the “boundary terms” that appear under the fundamental theorem of calculus evaluates to exactly 0 by construction.

Computing the \partial_{x_-} \ell(x; x_-, x_+) gives us
\displaystyle \partial_{x_-} \ell = - \frac{f(x_+) - f(x_-)}{x_+-x_-} + (x - x_-) \cdot \frac{- f'(x_-)}{x_+ - x_-} + (x - x_-) \cdot \frac{f(x_+) - f(x_-)}{x_+ - x_-} \frac{1}{x_+ - x_-} + f'(x_-)
which we simplify to
\displaystyle \partial_{x_-} \ell = \left[ \frac{x - x_-}{x_+ - x_-} - 1\right] \left[ \frac{f(x_+) - f(x_-)}{x_+ - x_-} - f'(x_-) \right]
and similarly
\displaystyle \partial_{x_+} \ell = \frac{x - x_-}{x_+ - x_-} \left[ f'(x_+) - \frac{f(x_+) - f(x_-)}{x_+ - x_-} \right]
By strict convexity the terms in the brackets involving f' are non-zero constants. So we see that the problem reduces to choosing x_+, x_- such that
\displaystyle \left( \int_{-1}^{x_-} + \int_{x_+}^1 - \int_{x_-}^{x_+} \right) x - x_- ~\mathrm{d}x = 0 = \left( \int_{-1}^{x_-} + \int_{x_+}^1 - \int_{x_-}^{x_+} \right) x - x_+ \mathrm{d}x
Directly evaluating the resulting integrals we see that this is achieved when x_+ = \frac12 and x_- = -\frac12. Q.E.D.

A second proof which involves much less computation can be found using the intuition we derived from the two above pictures and some knowledge of the philosophy of Lebesgue integration.

Proof 2: From the discussion above we know to expect there to be exactly two intersections between \ell(x) and f(x), and by convexity we know that (f(x) - \ell(x))(x - x_-)(x_ - x_+) \geq 0. Now consider the cost function for two different lines \ell_1 and \ell_2 differing by a small vertical translation.

The difference in cost function is the signed area of the shaded region, where the sign is whether the point is above or below the red curve.

The difference in cost function between the blue and green lines is the signed area of the shaded region, where the sign is whether the point is above or below the red curve.

Taking the vertical difference to be infinitesimal, we see that the variation \delta C is precisely
\delta C = (x_+ - x_-) - (1 - x_+) - (x_- + 1)
So that for \delta C to be zero, it is necessary that x_+ - x_- is exactly 1; or that the function f(x) spend exactly half the time above \ell(x), and half the time below.

Using again that f is convex, we see that having fixed x_+ - x_-, to fix the “position” of this interval it remains to fix the slope. The analogous picture to above for slope variation is as follows.

The variation in the cost function for the blue and green lines, which differ by a a change of slope (rotation around the intersection point) is the signed area of the shaded region, where the sign is determined by whether the red curve is above or below it.

The variation in the cost function for the blue and green lines, which differ by a a change of slope (rotation around the intersection point) is the signed area of the shaded region, where the sign is determined by whether the red curve is above or below it.

Rotating a line \ell about center x_0 with a small angle incurs a variation of the cost
\displaystyle \delta C = \left( \int_{-1}^{x_-} + \int_{x_+}^1 - \int_{x_-}^{x_+} \right)  x - x_0 ~\mathrm{d}x
This \delta C = 0 needs to hold for every x_0 at the critical parameter, and in particular must hold for x_0 = \frac12 (x_- + x_+). By symmetry we see that the contribution of the integral between (x_-, x_+) vanishes always for this point. But the integral in the two external regions only cancel if x_- + 1 = 1 - x_+, which now implies the desired result. Q.E.D.

Advertisements