### An optimization problem: theme

#### by Willie Wong

Let’s start simple:

Question 1: What is the linear function that minimizes the integral ? In other words, what is the best linear approximation of in the sense?

This is something that can in principle be solved by a high schooler with some calculus training. Here’s how one solution may go:

*Solution*: All linear functions take the form . The integrand is equal to when , and otherwise. So we need to find the points of intersection. This requires solve , which we can solve by the quadratic formula. In the case where is signed, we see that changing we can make the integrand strictly smaller, and hence we cannot attain a minimizer. So we know that at the minimizer there must exist at least one root.

Consider the case where there is only one root in the interval (counted with multiplicity), call the root . We have that the integral to minimize is equal to

each part of which can be computed explicitly to obtain

Since we know that the two terms comes from integrands with different signs, we can combine to get

as the integrand. Now, we cannot just take the partial derivatives of the above expression with respect to and set that to zero and see what we get: the root depends also on the parameters. So what we would do then is to plug in using the expression derived from the quadratic formula, , and then take the partial derivatives. Before that, though, we can simplify a little bit: since from the constraint, the quantity to minimize is now

A long computation taking the now shows that necessarily , which implies that . But for the range of where there is only one root in the interval, the quantity does not achieve a minimum. (The formal minimizer happens at but we see for this case the integrand of the original cost function is signed.

So we are down to the case where there are two roots in the interval. Now we call the roots and , and split the integral into

and proceed as before. (The remainder of the proof is omitted; the reader is encouraged to carry out this computation out by hand to see how tedious it is.)

Suppose now after this long computation, you are given the question:

Question 2: What is the linear function that minimizes the integral ?

You say: can’t we do just what we did before? And then you realizes that there is a problem.

In the “solution” to Question 1 given above, crucially we were able to have closed form solutions for via the quadratic formula. To replicate the same proof we would need closed form solutions to as a function of and . This is not easy to do.

At this point, however, we realize something important:

The formula , aka the

slope-intercept form, is not the only representation of linear functions.

In particular, we saw in the “solution” to Question 1 above it is important to consider the roots. Can we parametrize the functions in terms of roots?

The answer, it turns out, is yes. And it depends on *convexity*.

First, observe that whenever is a convex function (in our context we take to be twice continuously differentiable with positive second derivative), there can be *at most* 2 roots to the equation , where is linear. (Suppose there are three or more roots, between two of the roots there must be a local maximum. But we know that at local maxima the second derivative is non-positive, whereas the second derivative by assumption.)

I will rule out the case where there are only zero or one root by pictures.

Therefore whenever the function we want to approximate is convex (or concave for that matter), our approximating linear function must intersect it exactly twice on the interval. Now, a linear function is precisely described by its values on two points. So instead of using the slope-intercept form, it is better to use a point-slope form

parametrizing the linear approximation using its two intersections with our function.

In other words, rather than computing the intersections from the parameters , we use the explicit expression of in slope-intercept form as functions of the intersections . And in this form we can prove a more general theorem.

**Theorem.** Let be a strictly convex function on . Then the minimizer of the cost function among linear functions is achieved by

I will give two proofs. First is the computational one

*Proof 1*: Since is convex, we parametrize the space of linear functions (having two intersections with in the interval ) as described above, with . Then the sign changes of the integrand implies that the cost function can be expressed as

Taking the derivatives relative to and using the fundamental theorem of calculus we have

and a similar one for . Here we see why it is so nice to use the parametrization we chose: the “boundary terms” that appear under the fundamental theorem of calculus evaluates to exactly 0 by construction.

Computing the gives us

which we simplify to

and similarly

By strict convexity the terms in the brackets involving are non-zero constants. So we see that the problem reduces to choosing such that

Directly evaluating the resulting integrals we see that this is achieved when and . Q.E.D.

A second proof which involves much less computation can be found using the intuition we derived from the two above pictures and some knowledge of the philosophy of Lebesgue integration.

*Proof 2:* From the discussion above we know to expect there to be exactly two intersections between and , and by convexity we know that . Now consider the cost function for two different lines and differing by a small vertical translation.

Taking the vertical difference to be infinitesimal, we see that the variation is precisely

So that for to be zero, it is necessary that is exactly 1; or that the function spend exactly half the time above , and half the time below.

Using again that is convex, we see that having fixed , to fix the “position” of this interval it remains to fix the slope. The analogous picture to above for slope variation is as follows.

Rotating a line about center with a small angle incurs a variation of the cost

This needs to hold for every at the critical parameter, and in particular must hold for . By symmetry we see that the contribution of the integral between vanishes always for this point. But the integral in the two external regions only cancel if , which now implies the desired result. Q.E.D.

You have written “Taking the vertical difference to be infinitesimal, we see that the variation $\delta C is precisely \delta C = (x_+ – x_-) – (1 – x_+) – (x_- – 1)$”

I think this contains a sign-error in the final term: the variance should be $\delta C = (x_+ – x_-) – (1 – x_+) – (x_- + 1)$

I mean the *variation*, not the variance, of course.

Will fix it. Thanks!