## Tomography of the magnetic fields of the Milky Way?

I have just found this “new” (well 150 years old actually) tomographical method… for measuring the magnetic field of our own galaxy

New all-sky map shows the magnetic fields of the Milky Way with the highest precision
by Niels Oppermann et al. (arxiv work available here)

Selected excerpt:

“… One way to measure cosmic magnetic fields, which has been known for over 150 years, makes use of an effect known as Faraday rotation. When polarized light passes through a magnetized medium, the plane of polarization rotates. The amount of rotation depends, among other things, on the strength and direction of the magnetic field. Therefore, observing such rotation allows one to investigate the properties of the intervening magnetic fields.”

Mmmm… very interesting, at least for my personal knowledge of the wonderful tomographical problem zoo (amongst gravitational lensing, interferometry, MRI, deflectometry).

P.S. Wow… 16 months without any post here. I’m really bad.

Posted in General | Tagged , | 1 Comment

Recently, a friend of mine asked me few questions about Noiselets for Compressed Sensing applications, i.e., in order to create efficient sensing matrices incoherent with signal which are sparse in the Haar/Daubechies wavelet basis. It seems that some of the answers are difficult to find on the web (but I’m sure they are well known to specialists) and I have therefore decided to share the ones I got.

Context:
I wrote in 2008 a tiny Matlab toolbox (see here) to convince myself that the Noiselet followed a Cooley-Tukey implementation already followed by the Walsh-Hadamard transform. It should have been optimized in C but I lacked of time to write this.Since this first code, I realized that Justin Romberg wrote already in 2006 with Peter Stobbe a fast code (also O(N log N) but much faster than mine) available here:

People could be interested in using Justin’s code since, as it will be clarified from my answers given below, it is already adapted to real valued signals, i.e., it produces real valued noiselets coefficients.

Q1. Do we need to use both the real and imaginary parts of noiselets to design sensing matrices (i.e., building the $\Phi$ matrix)?  Can we use just the real part or just the imaginary part)?  Any reason why you’d do one thing or another?

As for the the Random Fourier Ensemble sensing, what I personally do when I use noiselet sensing is to pick uniformly at random $M/2$ complex values in half the noiselet-frequency domain, and concatenate the real and the imaginary part into a real vector of length $M=2*M/2$. The adjoint (transposed) operation — often needed in most of Compressed Sensing solvers — must of course sum the previously split real and imaginary parts into $M/2$ complex values before to pad the complementary measured domain with zeros and run the inverse noiselet transform.

To understand the special treatment of the real and the imaginary parts (and not simply by considering it similar to what is done for Random Fourier Ensemble), let us consider the origin, that is, Coifman et al. Noiselets paper.

Recall that in this paper, two kinds of noiselets are defined. The first basis, the common Noiselets basis on the interval $[0, 1]$, is defined thanks to the recursive formulas:

The second basis, or Dragon Noiselets, is slightly different. Its elements are symmetric under the change of coordinates $x \to 1-x$. Their recursive definition is

To be more precise, the two sets

$\{f_j: 2^N\leq j < 2^{N+1}\}$,

$\{g_j: 2^N\leq j < 2^{N+1}\}$,

are orthonormal bases for piecewise constant functions at resolution $2^N$, that is, for functions of

$V_N = \{h(x): h\ \textrm{is constant over each}\big[2^{-N}k,2^{-N}(k+1)\big), 0\leq k < 2^N \}.$

In Coifman et al. paper, the recursive definition of Eq. (2) (and also Eq (4) for Dragon Noiselets), which connects the noiselet functions between the noiselet index $n$ and indices $2n$ or $2n+1$, is simply a common butterfly diagram that sustains the Cooley-Tukey implementation of the Noiselet transform.

The coefficients involved in Eqs (2) and (4) are simply $1 \pm i$, which are of course complex conjugate of each other.

Therefore, in the Noiselet transform $\hat X$ of a real vector $X$ of length $2^N$ (in one to one correspondance with the piecewise constant functions of $V_N$) involving the noiselets of indices $n \in \{2^N, ..., 2^{N+1}-1\}$, the resulting decomposition diagram is fully symmetric (with a complex conjugation) under a flip of indices $k \leftrightarrow k' = 2^N - 1 - k$, for $k = 0,\, \cdots, 2^N -1$.

This shows that

$\hat X_k = \hat X^*_{k'}$,

with ${(\cdot)}^*$ the complex conjugation, if $X$ is real, and allows us to define “Real Random Noiselet Ensemble” by picking uniformly at random $M/2$ complex values in the half domain $k = 0,\,\cdots, 2^{N-1}-1$, that is $M$ independent real values in total, as obtained by concatenating the real and the imaginary parts (see above).

Therefore, for real valued signals, as for Fourier, the two halves of the noiselet spectrum are not independent, and therefore, only one half is necessary to perform useful CS measurements.

Justin’s code is close to this interpretation by using a real valued version of the symmetric Dragon Noiselets described in the initial Coifman et al. paper.

Q2. Are noiselets always binary?  or do they take +1, -1, 0 values like Haar wavelets?

Actually, a noiselet of index $j$ take the complex values $2^{j} (\pm 1 \pm i)$, never $0$.This can be easily seen from the recursive formula of Eq. (2).

They fill also the whole interval $[0, 1]$.

Update — 26/8/2013:
I was obviously wrong above on the values that noiselets can take (Thank you to Kamlesh Pawar for the detection).

Noiselet amplitude can never be zero, however, either the real part or the imaginary part (not both) can vanish for certain $n$.

So, to be correct and from few computations, a noiselet of index $2^n + j$ with $0 \leq j < 2^n$ takes, over the interval $[0,1]$, the complex values $2^{(n-1)/2} (\pm 1 \pm i)$ if $n$ is odd, and $2^{n/2} (\pm 1)$ or $2^{n/2} (\pm i)$ if $n$ is even.

In particular, we see that the amplitude of these noiselets is always $2^{n/2}$ for the considered indexes.

Q3. Walsh functions have the property that they are binary and zero mean, so that one half of the values are 1 and the other half are -1.  Is it the same case with the real and/or imag parts of the noiselet transform?

To be correct, Walsh-Hadamard functions have mean equal to 1 if their index is a power of 2 and 0 else, starting with the [0,1] indicator function of index 1.

For Noiselets, they are all of unit average, meaning that the imaginary part has the zero average property. This can be proved easily (by induction) from their recursive definition in Coifman et al. paper (Eqts (2) and (4)). Interestingly, their unit average, that is their projection on the unit constant function, shows directly that a constant function is not sparse at all in the noiselet basis since its “noiselet spectrum” is just flat.

In fact, it is explained in Coifman paper that any Haar-Walsh wavelet packets, that is, elementary functions of formula

$W_n(2^q x - k)$

with $W_n$ the Walsh functions (including the Haar functions), have a flat noiselet spectrum (all coefficients of unit amplitude), leading to the well known good (in)coherence results (that is, low coherence). To recall, the coherence is given by $1$ for the Haar wvaelet basis, and it corresponds to slightly higher values for the Daubechies wavelets D4 and D8 respectively (see, e.g., E.J. Candès and M.B. Wakin, “An introduction to compressive sampling”, IEEE Sig. Proc. Mag., 25(2):21–30, 2008.)

Q4. How come noiselets require O(N logN) computations rather than O(N) like the haar transform?

This is a verrry common confusion. The difference comes from the locality of the Haar basis elements.

For the Haar transform, you can use the well known pyramidal algorithm running in $O(N)$ computations. You start from the approximation coefficients computed at the finest scale, using then the wavelet scaling relations to compute the detail and approximation coefficients at the second scale, and so on. Because of the sub-sampling occuring at each scale, the complexity is proportional to the number of coefficients, that is, it is $O(N)$.

For the 3 bases Hadamard-Walsh, Noiselets and Fourier, their non-locality (i.e., their support is the whole segment [0, 1]) you cannot run a similar alorithm. However, you can use the Cooley-Tukey algorithm arising from the Butterfly diagrams linked to the corresponding recursive definitions (Eqs (2) and (4) above).

This one is in $O(N log N)$, since the final diagram has $\log_2 N$ levels, each involving $N$ multiplication-additions.

Feel free to comment this post and ask other questions. It will provide perhaps eventually a general Noiselet FAQ/HOWTO 😉

Posted in Compressed Sensing | 9 Comments

## New class of RIP matrices ?

Wouaw, almost one year and half without any post here…. Shame on me! I’ll try to be more productive with shorter posts now 😉

I just found this interesting paper about concentration properties of submodular function (very common in “Graph Cut” methods for instance) on arxiv:

### A note on concentration of submodular functions. (arXiv:1005.2791v1 [cs.DM])

Jan Vondrak, May 18, 2010

We survey a few concentration inequalities for submodular and fractionally subadditive functions of independent random variables, implied by the entropy method for self-bounding functions. The power of these concentration bounds is that they are dimension-free, in particular implying standard deviation O(\sqrt{\E[f]}) rather than O(\sqrt{n}) which can be obtained for any 1-Lipschitz function of n variables.

In particular, the author shows some interesting concentration results in his corollary 3.2.

Without having performed any developments, I’m wondering if this result could serve to define a new class of matrices (or non-linear operators) satisfying either the Johnson-Lindenstrauss Lemma or the Restricted Isometry Property.

For instance, by starting from Bernoulli vectors $X$, i.e., the rows of a sensing matrix, and defining some specific submodular (or self-bounding) functions $f$ (e.g. $f(X_1,\cdots, X_n) = g(X^T a)$ for some sparse vector $a$ and a “kind” function $g$), I’m wondering if the concentration results above are better than those coming from the classical concentration inequalities (based on the Lipschitz properties of $f$ or $g$. See e.g., the books of Ledoux and Talagrand)?

Ok, all this is perhaps just due to too early thoughts …. before my mug of black coffee 😉

Posted in General | 1 Comment

## 1000th visit and some Compressed Sensing “humour”

As detected by Igor Carron, this blog has reached its 1000th visit ! Well, perhaps it’s 1000th robot visit 😉

Yesterday I found some very funny (math) jokes on Bjørn’s maths blog about “How to catch a lion in the Sahara desert” with some … mathematical tools.

Bjørn collected there many ways to realize this task from many places on the web. There are really tons of examples. To give you an idea, here is the Schrodinger’s method:

“At any given moment there is a positive probability that there is a lion in the cage. Sit down and wait.”

or this one :

“The method of inverse geometry: We place a spherical cage in the desert and enter it. We then perform an inverse operation with respect to the cage. The lion is then inside the cage and we are outside.”

So, let’s try something about Compressed Sensing. (Note: if you have something better than my infamous suggestion, I would be very happy to read it as a comment to this post.)

“How to catch a lion in the Sahara desert”

The compressed sensing way: First you consider that only one lion in a big desert is definitely a very sparse situation by comparing lion’s size and the desert area. No need for a cage, just project randomly the whole desert into a dune of just 5 times the lion’s weight ! Since the lion obviously died in this shrinking operation, you use the RIP (!) .. and relaxed, you eventually reconstruct its best tame approximation.

Image: Wikipedia

Posted in General | 13 Comments

## SPGL1 and TV: Answers from SPGL1 Authors

Following the writing of my previous post, which obtained various interesting comments (many thanks to Gabriel Peyré, Igor Carron and Pierre Vandergheynst), I sent a mail to Michael P. Friedlander and Ewout van den Berg to point them this article and possibly obtain their point of views.

Nicely, they sent me interesting answers (many thanks to them). Here they are (using the notations of the previous post) :

Michael’s answer is about the need of a TV Lasso solver :

“It’s an intriguing project that you describe.  I suppose in principle the theory behind spgl1 should readily extend to TV (though I haven’t thought how a semi-norm might change things).  But I’m not sure how easy it’ll be to solve the “TV-Lasso” subproblems.  Would be great if you can see a way to do it efficiently. “

Ewout on his side explained this :

“The idea you suggest may very well be feasible, as the approach taken in SPGL1 can be extended to other norms (i.e., not just the one-norm), as long as the dual norm is known and there is way to orthogonally project onto the ball induced by the primal norm. In fact, the newly released version of SPGL1 takes advantage of this and now supports two new formulations.

I heard (I haven’t had time to read the paper) that Chambolle has described the dual to the
TV-norm. Since the derivate of $\phi(\tau) =\|y-Ax_\tau\|_2$ on the appropriate interval is given by the dual norm on $A^Ty$, that part should be fine (for the one norm this gives the infinity norm).

In SPGL1 we solve the Lasso problem using a spectrally projected gradient method, which
means we need to have an orthogonal projector for the one-norm ball of radius $tau$. It is not immediately obvious how to (efficiently) solve the related problem (for a given $x$):

minimize $\|x - p\|_2$ subject to $\|p\|_{\rm TV} \leq \tau$.

However, the general approach taken in SPGL1 does not really care about how the Lasso
subproblem is solved, so if there is any efficient way to solve

minimize $\|Ax-b\|_2$ subject to $\|x\|_{\rm TV} \leq \tau$,

then that would be equally good. Unfortunately it seems the complexification trick (see the previous post) works only from the image to the differences; when working with the differences themselves, additional constraints would be needed to ensure consistency in the image; i.e., that
summing up the difference of going right first and then down, be equal to the sum of
going down first and then right.”

In a second mail, Ewout added an explanation on this last remark :

“I was thinking that perhaps, instead of minimizing over the signal it would be possible to minimize over the differences (expressed in complex numbers in the two-dimensional setting). The problem with that is that most complex vectors do not represent difference vectors (i.e., the differences would not add up properly). For such an approach to work, this consistency would have to be enforced by adding some constraints.”

Actually, I saw similar considerations in A. Chambolle‘s paper: “﻿﻿﻿﻿ It is even more clear in the paper he wrote with J.-F. Aujol, “Dual Norms and Image Decomposition Models”. They develop there the notions of TV (semi) norm for different exponent $p$ (i.e. in the $\ell_p$ norm used on the $\ell_2$ norm of the gradient components) and in particular they answer to the problem of finding and computing the corresponding dual norms. For the usual TV norm, this leads to the G-norm :

$\|u\|_{\rm G} = {\rm inf}_g\big\{ \|g\|_\infty=\max_{kl} \|g_{kl}\|_2\ :\ {\rm div}\,g = u,\ g_{kl}=(g^1_{kl},g^2_{kl})\,\big\},$

where, as for the continuous setting, ${\rm div}$ is the discrete divergence operator defined as the adjoint of the finite difference gradient operator $\nabla$ used to defined the TV norm. In other words, $\langle -{\rm div}\,g, u\rangle_X = \langle g, \nabla u\rangle_Y$, where $X=\mathbb{R}^N$ and $Y=X\times X$.

Unfortunately, the G norm computation seems not so obvious that the one of its dual counterpart and an optimization method must be used. I don’t know if this could lead to an efficient implementation of a TV spgl1.

## SPGL1 and TV minimization ?

Recently, I was using the SPGL1 toolbox to recover some “compressed sensed” images. As a reminder, SPGL1 implements the method described in “Probing the Pareto Frontier for basis pursuit solutions” of Michael P. Friedlander and Ewout van den Berg. It solves the Basis Pursuit DeNoise (or $BP_\sigma$) problem with a error power $\sigma >0$

$\min_{u\in\mathbb{R}^N} \|u\|_1\ \textrm{such that}\ \|Au - b\|_2 \leq \sigma,$

where $A\,\in\,\mathbb{R}^{m\times N}$ is the usual measurement matrix for a measurement vector $b\in \mathbb{R}^m$, and $\|u\|_{1}$ and $\|u\|_{1}$ are the $\ell_1$ and the $\ell_2$ norm of the vector $u\in\mathbb{R}^N$ respectively. In short, as shown by E. Candès, J. Romberg and T. Tao, if $A$ is well behaved, i.e. if it satisfies the so-called Restricted Isometry Property for any $2K$ sparse signals, then the solution of $BP_\sigma$ approximates (with a controlled error) a $K$ sparse (or compressible) signal $v$ such that $b = Av + n$, where $n$ is an additive noise vector with power $\|n\|_2 \leq \sigma$.

The reason of this post is the following : I’m wondering if SPGL1 could be “easily” transformed into a solver of the Basis Pursuit with the Total Variation (TV) norm. That is, the minimization problem $TV_{\sigma}$

$\min_{u\in\mathbb{R}^N} \|u\|_{TV}\ \textrm{such that}\ \|Au - b\|_2 \leq \sigma,$

where $\|u\|_{TV} = \|D u\|_1$ with $(D u)_j = (D_1 u)_j + i\,(D_2 u)_j$ is the $j$th component of the complex finite difference operator applied on the vectorized image $u$ of $N$ pixels (in a set of coordinates $x_1$ and $x_2$).  I have used here a “complexification” trick putting the finite differences $D_1$ and $D_2$ according to the directions $x_1$ and $x_2$ in the real part and the imaginary part respectively of the complex operator $D$. The TV norm of $u$ is then really the $\ell_1$ norm of $Du$.

This problem is particularly well designed for the reconstruction of compressed sensed images since most of them are very sparse in the “gradient basis” (see for instance some references about Compressed Sensing for MRI). Minimizing the TV norm, since performed in the spatial domain, is also sometimes more efficient than minimizing the $\ell_1$ norm is a particular sparsity basis (e.g. 2-D wavelets, curvelets, …).

Therefore, I would say that, as for the initial SPGL1 theoretical framework, it could be interesting to study the Pareto frontier related to $TV_\sigma$, even if the TV norm is now a quasi-norm, i.e.  $\|u\|_{TV} =0$ does not imply $u =0$ but $u_j = c$ for a certain constant $c \in \mathbb{R}$.

To explain better that point, let me first summarize the paper of Friedlander and van den Berg quoted above. They proposed to solve $BP_\sigma$ by solving a LASSO problem (or $LS_\tau$) regulated by a parameter $\tau>0$,

$\min_{u\in\mathbb{R}^N} \|Au - b\|_2\ \textrm{such that}\ \|u\|_1 \leq \tau.$

If I’m right, the key idea is that there exists a $\tau_\sigma$ such that $LS_{\tau_\sigma}$ is equivalent to $BP_\sigma$. The problem is thus to assess this point. SPGL1 finds $\tau_\sigma$ iteratively using the fact that all the $LS_\tau$ problems define a smooth and decreasing curve of $\tau$ (the Pareto curve) from the $\ell_2$ norm of the residual $r_\tau = b - Au_\tau$, where $u_\tau$ is the solution of $LS_{\tau}$. More precisely, the function

$\phi(\tau) \triangleq \|r_\tau\|_2$

is decreasing from $\phi(0) = \|b\|_2$ to a value $\tau_{\sigma}>0$ such that

$\phi(\tau_\sigma) = \sigma.$

Interestingly, the derivative $\phi'(\tau)$ exists on ${[}0, \tau{]}$ and it is simply equal to $-\|A^Ty_\tau\|_{\infty}$ with $y_\tau = r_\tau / \|r_\tau\|_2$.

As explained, on the point $\tau=\tau_\sigma$, the problem $LS_\tau$ provides the solution to $BP_\sigma$. But since both $\phi$ and $\phi'$ are known, a Newton method on this Pareto curve can then iteratively estimate $\tau_\sigma$ from the implicit equation $\phi(\tau_\sigma) = \sigma$. Practically, this is done by solving of an approximate $LS_\tau$ at each $\tau$ (and the convergence of the Newton method is still linear).

At the end, the whole approach is very efficient for solving high dimensional BPDN problems (such that BPDN for images) and the final computation cost is mainly due to the cost of the forward and transposed multiplication of the matrix/operator $A$ with vectors.

So what happens now if the $\ell_1$ norm is replaced by the TV norm in this process ? If we switch from $BP_\sigma$ to $TV_\sigma$ ? Is there a “SPGL1 way” to solve that ?

The function $\phi$ resulting from such a context would have now the initial point $\phi^2(0) = \|b\|^2_2 - (b^TA\bf 1)^2/\|A\bf 1\|^2_2$ (with $\bf 1$ the constant vector) since a zero TV norm means a constant $u = c \bf 1$ (the value of $\phi(0)$ arises just from the minimization on $c$). Notice that if $A$ is for instance a Gaussian measurement matrix, $\phi(0)$ will be very close to $\|b\|_2$ since the expectation value of the average of any row is zero.

For the rest, I’m unfortunately not sufficiently familiar with convex optimization theory to deduce what is $\phi'$ for the TV framework (hum. I should definitely study that).

However, for the $\ell_1$ case, $\phi(\tau)$ (i.e. $LS_\tau$) is computed approximately for each $\tau$. This approximation, which is also iterative, uses a special projection operator to guarantee that the current candidate solution in the iteration remains feasible, i.e. remains in the $\ell_1$ ball $\{u : \|u\|_1 \leq \tau\}$. As usual, this projection is accomplished through a Soft Thresholding procedure, i.e. as a solution of the problem

$\min_{u} \|w -u\|^2_2 + \lambda \|u\|_{1}$,

where $w$ is the point to project, and where $\lambda>0$ is set so that the projection is inside the $\ell_1$ ball above.

For the TV minimization case, the TV ball $\{u : \|u\|_{TV} \leq \tau\}$ defining the feasible set of the approximate LASSO procedure would possibly generate a projection operator equivalent to the one solving the problem

$\min_{u} \|w -u\|^2_2 + \lambda \|u\|_{TV}$.

This is somehow related to one of the lessons provided in the TwIST paper (“A new twIst: two-step iterative shrinkage/thresholding algorithms for image restoration”) of J. Bioucas-Dias and M. Figueiredo about the so-called Moreau function : There is a deep link between some iterative resolutions of a regularized BP problem using a given sparsity metric, e.g. the $\ell_1$ or the TV norm, and the canonic denoising method of this metric, i.e. when the measurement is the identity operator, giving Soft Thresholding or TV denoising respectively.

Thanks to the implementation of Antonin Chambolle (used also by TwIST), this last canonic TV minimization can be computed very quickly. Therefore, if needed, the required projection on the TV ball above can be also inserted in a potential “SPGL1 for TV sparsity problem”.

OK… I agree that all that is just a very rough intuition. There is a lot of points to clarify and to develop. However, if you know something about all that (or if you detect that I’m totally wrong), or if you just want to comment this idea, feel free to use the comment box below …

Posted in Compressed Sensing | 23 Comments

## Matching Pursuit Before Computer Science

Recently, I have found some interesting references about techniques designed around 1938 and that, in my opinion, could be qualified of (variant of) Matching Pursuit. Perhaps this is something known by a lot of researchers in the right scientific field, but here is however what I recently discovered.

From what I knew until yesterday, when S. Mallat and Z. Zhang [1] defined their greedy or iterative algorithm named “Matching Pursuit” to decompose a signal $s\in \mathbb{R}^N$ into a linear combination of atoms taken in a dictionary $\mathcal{D}=\{g_k\in\mathbb{R}^N: 1\leq k\leq d, g_k^Tg_k = 1\}$ of $d$ elements, the previous work to which they were referring to was the “Progression Pursuit” of J. Friedman and W. Stuetzle [2] in the field of statistical regression methods.

In short, MP is very simple. It reads (using matrix notations)

$\quad R^0 = s, A^0 = 0,\hfill{\rm (initialization)}\\ \quad R^{n+1}\ =\ R^n\ -\ (g_{*}^T R^n)\, g_{*},\hfill{\rm (reconstruction)}\\ \quad A^{n+1}\ =\ A^n\ +\ (g_{*}^T R^n)\,g_{*} \hfill{\rm (reconstruction)}\\$

with at each step $n\geq 0$

$\quad g_{*} \ = \ {\rm arg\,max}_{g\in\mathcal{D}}\ |g^T R^n| \hfill{\rm (sensing)}.$

The quantities $R^n$ and $A^n$ are the residual and the approximation of $s$ respectively at the $n$th MP step (so also an approximation in $n$ terms of $s$). A quite direct modification of MP is the Orthogonal Matching Pursuit [8] where only the index (or parameters) of the best atom (i.e. maximizing its correlation with $R^n$) at each iteration is recorded, and the approximation $A^n$ computed by a least square minimization on the set of the $n$ previously selected atoms.

It is proved in [1] that MP converges always to … something, since the energy of the residual $\|R^n\|$ decreases steadily towards 0 with $n$.  Under certain extra assumptions on the dictionary $\mathcal{D}$ (e.g. with small coherence, or cumulative coherence, that roughly measure its closeness to an orthogonal basis) it is also proved that, if $s$ is described by a linear combination of few elements of the dictionary (for sparse or compressible signal), i.e. with $s = \mathcal{D}\alpha_*$ for $\alpha_*\in\mathbb{R}^d$ having few non-zero (or large) components, then OMP recovers $\alpha_*$ in the set of coefficients $(g_{*}^T R^n)$ computed at each iteration [9]. For instance, in the trivial case of an orthonormal basis (i.e. with vanishing coherence) (O)MP finds iteratively $\alpha = \mathcal{D}^{-1} s = \alpha_*$.

Dozens (or hundreds ?) of variations of these initial greedy methods have been introduced since their first formulations in the signal processing community. These variations have improved for instance the initial MP rate of convergence through the iterations, or the ability to solve the sparse approximation problem (i.e. finding $\alpha_*$ expressed above), or are MP techniques adapted to some specific problem like the emerging Compressed Sensing. Let’s quote for instance the gradient Pursuit, stagewise OMP, CoSaMP, regularized MP,  subspace pursuit, … (see here and here for more informations on these).

Another variation of (O)MP explained by K. Schnass and P. Vandergheynst in [3], splits the sensing part from the reconstruction part in the initial MP algorithm above (adding also the possibility to select more than only one atom per iteration). Indeed, the selection of the best atom $g_*$ is performed there by a Sensing dictionary $\mathcal{S}$ while the reconstruction stage building the residuals and approximations is still assigned to $\mathcal{D}$. In short, this variation is also proved to solve the sparse problem if the two dictionaries satisfy a small cross (cumulative) coherence criterion, which is easier to fulfill than asking for a small (cumulative) coherence of only one dictionary in the initial (O)MP.

I introduced more precisely this last (O)MP variation above since it is under this form that I discovered it in the separated works of R.V. Southwell [4] and G. Temple [5] (the last being more readable in my opinion) in 1935 and in 1938 (!!), i.e. before the building of the first (electronic) computers in the 40’s.

The context of the method in the initial Southwell’s work was the determination of stresses in structures. Let’s summarize his problem : the structure was modelized by $N$ connected springs. If $\alpha_*\in\mathbb{R}^N$ represents any motion vector of the springs extremities, then, at the new equilibrium state reached when some external forces $F_{\rm external}\in\mathbb{R}^N$ are applied to the structure, the internal forces provided by the springs follows of course the Hooke law, i.e $F_{\rm internal}=D \alpha_*$ for a certain symmetric matrix $D\in\mathbb{R}^{N\times N}$ containing the spring constants, and finally Newton’s first law implies :

$F_{\rm internal} + F_{\rm external} = D \alpha_* + F_{\rm external} = 0$.

The global problem of Southwell was thus : given a linear system of equations $D\alpha_* = s$, with $s= -F_{\rm external}$ and $D$ positive definite, how can you recover practically $\alpha_*$ from $s$ and $D$  ? As explained by Temple, a solution to this problem is of course also “applicable to any problem which is reducible to the solution of a system of non-homogeneous, linear, simultaneous algebraic equations in a finite number of unknown variables”.

Nowadays, the numerical solution seems trivial : take the inverse of $D$ and apply it to $s$, and if $D$ is really big (or even small since I’m lazy) compute $D^{-1}$ for instance with Matlab and run “>> inv(D)*s” (or do something clever with the “/” Matlab operator).

However, imagine the same problem in the 30’s ! And assume you have to inverse a ridiculously small matrix of size 13×13. It can be really long to solve it analytically and worthless since you are interested in finding $\alpha_*$. That’s why some persons were interested at that time in computing $A^{-1}s$, or an approximation to it, without to have this painful inverse computation.

The technique found by R.V. Southwell and generalized later by Temple [4,5], was dubbed of “Successive Relaxation” inside a more general context named “Successive Approximations”. Mathematically, rewriting that work under notations similar to these of modern Matching Pursuit methods, Successive Relaxation algorithm reads :

$R^0 = s,\ \alpha^0 = 0,\hfill {\rm (initialization)}\\ R^{n+1}\ =\ R^n\ -\ \beta\,(R^n)_{k^*} D_{k^*}\\ \alpha^{n+1}\ =\ \alpha^n\ +\ \beta\,(R^n)_{k^*}\, e_{k^*}\hfill{\rm (reconstruction)},$

where $\beta\in (0,2)$, $D_j$ is the $j$th column of $D$, $(R^n)_{m}$ is the $m$th component of $R^n$, $e_j$ is the vector such that $(e_j)_k=\delta_{jk}$ (canonical basis vector), and with, at each step $n\geq 0$, the selection (sensing)

$k^{*} = {\rm arg\,max}_{k} |(R^n)_k|, \hfill{\rm (sensing)}.$

i.e. the component of the $n$th residual with the highest amplitude.

In this framework, since $D$ is positive definite and thus non-singular, it is proved in [5] that the vectors $\alpha^n$ tend to the true answer $\alpha_*=D^{-1}s$. The parameter $\beta$ controls the importance of what you removed or add in the residual and in the approximation respectively. You can prove easily that the decreasing of the residual energy is of course maximum when $\beta=1$.

In other words, in the concepts introduced in [3], they designed a Matching Pursuit where they selected for the reconstruction dictionary the orthonormal basis $D$ and for the sensing dictionary the identity (the canonical basis) of $\mathbb{R}^N$. Amazing, No ?

An interesting learning of the algorithm above is the presence of the factor $\beta$. In the more general context of (modern) MP with non-orthonormal dictionary, such a factor could be useful to minimize the “decoherence” effect observed experimentally in the decomposition of a signal when this one is not exactly fitted by elements of the dictionary (e.g., in image processing, arbitrarily oriented edges to be described by horizontal and vertical atoms).

G. Temple in [5] extended also the method to infinite dimensional Hilbert spaces for a different updating step of $\alpha^n$. This is nothing else but the foundation of the (continuous) MP studied by authors like R. DeVore and Temlyakov some years ago (on that topic you can read also [6], i.e. a paper I wrote with C. De Vleeschouwer for a geometric description of this continuous formalism).

By googling a bit on “matching pursuit” and Southwell, I found this presentation of Peter Buhlmann who makes a more general connection between Southwell’s work, Matching Pursuit, and greedy algorithms (around slide 15) in the context of “Iterated Regularization for High-Dimensional Data“.

In conclusion of all that, who is this person who explained that we do nothing but always reinventing the wheel ? 😉

If you want to complete this kind of “archeology of Matching Pursuit” please feel free to add some comments below. I’ll be happy to read them and improve my general knowledge of the topic.

Laurent

### References :

[1] : S. G. Mallat and Z. Zhang, Matching Pursuits with Time-Frequency Dictionaries, IEEE Transactions on Signal Processing, December 1993, pp. 3397-3415.

[2] : J. H. Friedman and J. W. Tukey (Sept. 1974). “A Projection Pursuit Algorithm for Exploratory Data Analysis“. IEEE Transactions on Computers C-23 (9): 881–890. doi:10.1109/T-C.1974.224051. ISSN 0018-9340.

[3] : K. Schnass and P. Vandergheynst, Dictionary preconditioning for greedy algorithms, IEEE Transactions on Signal Processing, Vol. 56, Nr. 5, pp. 1994-2002, 2008.

[4] : R. V. Southwell, “Stress-Calculation in Frameworks by the Method of “Systematic Relaxation of Constraints“. I and II. Proc Roy. Soc. Series A, Mathematical and Physical Sciences, Vol. 151, No. 872 (Aug. 1, 1935), pp. 56-95

[5] : G. Temple, The General Theory of Relaxation Methods Applied to Linear Systems, Proc. Roy. Soc. Series A, Mathematical and Physical Sciences, Vol. 169, No. 939 (Mar. 7, 1939), pp. 476-500.

[6] : L. Jacques and C. De Vleeschouwer, “A Geometrical Study of Matching Pursuit Parametrization“, Signal Processing, IEEE Transactions on [see also Acoustics, Speech, and Signal Processing, IEEE Transactions on], 2008, Vol. 56(7), pp. 2835-2848

[7] : R.A. DeVore and V.N. Temlyakov. “Some remarks on greedy algorithms.” Adv. Comput. Math., 5:173–187, 1996.

[8] : Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad, “Orthogonal projection pursuit: Recursive function approximation with applications to wavelet decomposition,” in Proceedings of Twenty-Seventh Asilomar Conference on Signals, Systems and Computers, vol. 1, (Pacific Grove, CA), pp. 40-
44, NOV. 1-3, 1993.

[9] : Tropp, J, “Greed is good: algorithmic results for sparse approximation“, IEEE T. Inform. Theory., 2004, 50, 2231-2242

#### Image credit :

EDSAC was one of the first computers to implement the stored program (von Neumann) architecture. Wikipedia.

Posted in Greedy | Tagged | 6 Comments