An optimization problem: theme
by Willie Wong
Let’s start simple:
Question 1: What is the linear function
that minimizes the integral
? In other words, what is the best linear approximation of
in the
sense?
This is something that can in principle be solved by a high schooler with some calculus training. Here’s how one solution may go:
Solution: All linear functions take the form . The integrand is equal to
when
, and
otherwise. So we need to find the points of intersection. This requires solve
, which we can solve by the quadratic formula. In the case where
is signed, we see that changing
we can make the integrand strictly smaller, and hence we cannot attain a minimizer. So we know that at the minimizer there must exist at least one root.
Consider the case where there is only one root in the interval (counted with multiplicity), call the root . We have that the integral to minimize is equal to
each part of which can be computed explicitly to obtain
Since we know that the two terms comes from integrands with different signs, we can combine to get
as the integrand. Now, we cannot just take the partial derivatives of the above expression with respect to and set that to zero and see what we get: the root
depends also on the parameters. So what we would do then is to plug in
using the expression derived from the quadratic formula,
, and then take the partial derivatives. Before that, though, we can simplify a little bit: since
from the constraint, the quantity to minimize is now
A long computation taking the now shows that necessarily
, which implies that
. But for the range of
where there is only one root in the interval, the quantity does not achieve a minimum. (The formal minimizer happens at
but we see for this case the integrand of the original cost function is signed.
So we are down to the case where there are two roots in the interval. Now we call the roots and
, and split the integral into
and proceed as before. (The remainder of the proof is omitted; the reader is encouraged to carry out this computation out by hand to see how tedious it is.)
Suppose now after this long computation, you are given the question:
Question 2: What is the linear function
that minimizes the integral
?
You say: can’t we do just what we did before? And then you realizes that there is a problem.
In the “solution” to Question 1 given above, crucially we were able to have closed form solutions for via the quadratic formula. To replicate the same proof we would need closed form solutions to
as a function of
and
. This is not easy to do.
At this point, however, we realize something important:
The formula
, aka the slope-intercept form, is not the only representation of linear functions.
In particular, we saw in the “solution” to Question 1 above it is important to consider the roots. Can we parametrize the functions in terms of roots?
The answer, it turns out, is yes. And it depends on convexity.
First, observe that whenever is a convex function (in our context we take
to be twice continuously differentiable with positive second derivative), there can be at most 2 roots to the equation
, where
is linear. (Suppose there are three or more roots, between two of the roots there must be a local maximum. But we know that at local maxima the second derivative is non-positive, whereas the second derivative
by assumption.)
I will rule out the case where there are only zero or one root by pictures.

A zero-crossing approximant (solid green and blue) can always be improved (dashed green and blue) by changing the intercept.

A single crossing approximant (solid blue) can always be improved (dashed blue) by changing the slope.
Therefore whenever the function we want to approximate is convex (or concave for that matter), our approximating linear function must intersect it exactly twice on the interval. Now, a linear function is precisely described by its values on two points. So instead of using the slope-intercept form, it is better to use a point-slope form
parametrizing the linear approximation using its two intersections with our function.
In other words, rather than computing the intersections from the parameters
, we use the explicit expression of
in slope-intercept form as functions of the intersections
. And in this form we can prove a more general theorem.
Theorem. Let be a strictly convex function on
. Then the minimizer of the cost function
among linear functions
is achieved by
I will give two proofs. First is the computational one
Proof 1: Since is convex, we parametrize the space of linear functions (having two intersections with
in the interval
) as described above, with
. Then the sign changes of the integrand implies that the cost function can be expressed as
Taking the derivatives relative to and using the fundamental theorem of calculus we have
and a similar one for . Here we see why it is so nice to use the parametrization we chose: the “boundary terms” that appear under the fundamental theorem of calculus evaluates to exactly 0 by construction.
Computing the gives us
which we simplify to
and similarly
By strict convexity the terms in the brackets involving are non-zero constants. So we see that the problem reduces to choosing
such that
Directly evaluating the resulting integrals we see that this is achieved when and
. Q.E.D.
A second proof which involves much less computation can be found using the intuition we derived from the two above pictures and some knowledge of the philosophy of Lebesgue integration.
Proof 2: From the discussion above we know to expect there to be exactly two intersections between and
, and by convexity we know that
. Now consider the cost function for two different lines
and
differing by a small vertical translation.

The difference in cost function between the blue and green lines is the signed area of the shaded region, where the sign is whether the point is above or below the red curve.
Taking the vertical difference to be infinitesimal, we see that the variation is precisely
So that for to be zero, it is necessary that
is exactly 1; or that the function
spend exactly half the time above
, and half the time below.
Using again that is convex, we see that having fixed
, to fix the “position” of this interval it remains to fix the slope. The analogous picture to above for slope variation is as follows.

The variation in the cost function for the blue and green lines, which differ by a a change of slope (rotation around the intersection point) is the signed area of the shaded region, where the sign is determined by whether the red curve is above or below it.
Rotating a line about center
with a small angle incurs a variation of the cost
This needs to hold for every
at the critical parameter, and in particular must hold for
. By symmetry we see that the contribution of the integral between
vanishes always for this point. But the integral in the two external regions only cancel if
, which now implies the desired result. Q.E.D.
You have written “Taking the vertical difference to be infinitesimal, we see that the variation $\delta C is precisely \delta C = (x_+ – x_-) – (1 – x_+) – (x_- – 1)$”
I think this contains a sign-error in the final term: the variance should be $\delta C = (x_+ – x_-) – (1 – x_+) – (x_- + 1)$
I mean the *variation*, not the variance, of course.
Will fix it. Thanks!