A topic so cool it doesn't need a catchy title.

*This is the live blog of Lecture 11 of my graduate class “Convex Optimization.” A Table of Contents is here.*

The most underappreciated part of Boyd & Vandenberghe is the endless trove of examples. Three chapters of the book are dedicated to modeling examples, and we’re now moving into the part of the course where we apply the theory to modeling.

We start with a topic dear to my heart: *linear inverse problems*. I’m casting a wide net with this term here. Let’s skip the linear part for a second. What do I mean by an inverse problem? *Imaging* is perhaps the easiest problem to describe. We imagine some 2D or 3D representation of the world we’d like to capture through a measurement device and want to use what we know about our measurements to build the image.

In a CT scan, we send lines of X-rays through your body and then use a computer to decode an image of your insides. MRI is similar, except it uses complex measurements of the magnetic fields induced by molecules in your body. Neither of these modalities takes “images” in the way we think a camera does, yet we can use algorithms to decode what’s inside. These are canonical inverse problems. We measure the state of the world through some process, and we’d like to determine the state from our measurements.

Even cameras solve inverse problems these days. Algorithms in your phone can take multiple blurry snapshots and use algorithms to make them into a deblurred image. Similarly, multiple images can be combined to yield high dynamic range when the lighting is unfavorable.

All of these modalities combine a *forward model,* which simulates the physics of what happens between the world and the imaging sensor, and an *image model,* which builds in what we think should be true about our final image. Algorithms then combine these together to produce a final reconstruction.

Surprisingly, all of the examples above can be computed using linear models. The mapping from the world to the measurement is a linear function, insofar as if you “add stuff,” the resulting measurement is their sum. For example, the measured output in a CT scan is the amount of X-rays that made it through the body. This decreases linearly in the amount of absorbing material between the X-ray emitter and detector. An image blur is a linear operation that adds together multiple views of a scene. A linear inverse problem is one where the mapping from reality to our measurement is modelable as a linear transformation.

There are surprisingly many problems that can be posed as linear inverse problems. A few examples include finding a model of a dynamical system that predicts time series data, estimating biomarkers relevant to particular disease outcomes, and finding the delays in GPS signals so that you can triangulate and find your location. It’s popular to use linear maps to model your content preferences based on your past consumption behavior and all of the consumption behavior of everyone else on the internet. I could do a whole blog series on inverse problems and their applications. For this course, we’ll settle with one post.

If we have a linear model, we are saying

`measurement = forward model ( image ) + noise`

Where the “forward model” function is a linear mapping. Our goal is to figure out what the image is by removing the noise and inverting the forward model. The problem here is that there is an infinite set of images and noises that produce the same measurement. What do you attribute to noise? What do you attribute to signal? The answer is to come up with some cost function that balances attribution. The general linear inverse problem is then given by an optimization problem of the form

```
minimize (1-h) * plausibility(noise) + h * plausibility(image)
subject to measurement = forward model (image) + noise
```

where h is a constant between 0 and 1. The plausibility functions are different for the noise and the image and are part of the modeling process. You want to pick functions that are easy to optimize (hint: convex) and large for implausible realizations of the signals you care about. If you were Bayesian, you might call these *priors* about the signals in your problem. But prior is a loaded word in Bayesian land, implying things about probabilities and whatnot. Today, I’m going to stick with calling them plausibility functions to avoid statistical mess.

The most common plausibility functions we use date back to Gauss. One example is the mean-sum-of-squares, which captures some notion of a signal’s variability. Another plausibility penalty would be the sum of absolute deviations. This plausibility function is useful when you expect the noise process to be bursty or sparse.

Similarly, different convex plausibility functions encourage particular image structures. Sparse images are encouraged by penalizing the sum of absolute values. Piecewise constant images are encouraged by penalizing the sum of absolute values of the derivative. Smooth images are encouraged by penalizing the sum of the squares of the second derivative.

The modeling choices here are intimidating as it’s not clear what plausibility penalties you should pick. But there’s a rule of thumb that I’ve found helpful in linear inverse problems (and which I may have written a dozen or so papers about). The plausibility function is trying to encourage your algorithm to choose a simple signal. One means of simple is “the signal can be written as a short sum of basic signals.” For example, you might assume your signal is just a spike train, then it can be written as a sum of spikes. Sparse images are a sum of a few basic point sources. A low rank matrix is a sum of a few rank one matrices. In all of these examples, the assumption is that the signal is a sum of a few *atoms. *If you know the atoms, a reasonable plausibility function penalizes the coefficients needed to write a signal as a sum of atoms. In math, for those who like equations:

When the atoms are unit vectors, this is the Euclidean norm (root mean squared error). When the atoms are standard basis vectors, this is the sum of absolute values (l1-norm). There are tons of other options. No matter which atomic set you choose, this plausibility function is convex. The general linear inverse problem is thus posed as a convex optimization problem. This gives you a general cookbook for posing and solving inverse problems.

Now what about that constant h? It is called a *hyperparameter *and is part of your convex model. Sometimes you know what that parameter should be, but other times you need to figure it out experimentally. But remember, this whole model is made up, so don’t fret if there are parts you can’t specify by pure reason alone.

Subscribe now

By Ben Recht