## Matching Pursuit Before Computer Science

Recently, I have found some interesting references about techniques designed around 1938 and that, in my opinion, could be qualified of (variant of) Matching Pursuit. Perhaps this is something known by a lot of researchers in the right scientific field, but here is however what I recently discovered.

From what I knew until yesterday, when S. Mallat and Z. Zhang  defined their greedy or iterative algorithm named “Matching Pursuit” to decompose a signal $s\in \mathbb{R}^N$ into a linear combination of atoms taken in a dictionary $\mathcal{D}=\{g_k\in\mathbb{R}^N: 1\leq k\leq d, g_k^Tg_k = 1\}$ of $d$ elements, the previous work to which they were referring to was the “Progression Pursuit” of J. Friedman and W. Stuetzle  in the field of statistical regression methods.

In short, MP is very simple. It reads (using matrix notations) $\quad R^0 = s, A^0 = 0,\hfill{\rm (initialization)}\\ \quad R^{n+1}\ =\ R^n\ -\ (g_{*}^T R^n)\, g_{*},\hfill{\rm (reconstruction)}\\ \quad A^{n+1}\ =\ A^n\ +\ (g_{*}^T R^n)\,g_{*} \hfill{\rm (reconstruction)}\\$

with at each step $n\geq 0$ $\quad g_{*} \ = \ {\rm arg\,max}_{g\in\mathcal{D}}\ |g^T R^n| \hfill{\rm (sensing)}.$

The quantities $R^n$ and $A^n$ are the residual and the approximation of $s$ respectively at the $n$th MP step (so also an approximation in $n$ terms of $s$). A quite direct modification of MP is the Orthogonal Matching Pursuit  where only the index (or parameters) of the best atom (i.e. maximizing its correlation with $R^n$) at each iteration is recorded, and the approximation $A^n$ computed by a least square minimization on the set of the $n$ previously selected atoms.

It is proved in  that MP converges always to … something, since the energy of the residual $\|R^n\|$ decreases steadily towards 0 with $n$.  Under certain extra assumptions on the dictionary $\mathcal{D}$ (e.g. with small coherence, or cumulative coherence, that roughly measure its closeness to an orthogonal basis) it is also proved that, if $s$ is described by a linear combination of few elements of the dictionary (for sparse or compressible signal), i.e. with $s = \mathcal{D}\alpha_*$ for $\alpha_*\in\mathbb{R}^d$ having few non-zero (or large) components, then OMP recovers $\alpha_*$ in the set of coefficients $(g_{*}^T R^n)$ computed at each iteration . For instance, in the trivial case of an orthonormal basis (i.e. with vanishing coherence) (O)MP finds iteratively $\alpha = \mathcal{D}^{-1} s = \alpha_*$.

Dozens (or hundreds ?) of variations of these initial greedy methods have been introduced since their first formulations in the signal processing community. These variations have improved for instance the initial MP rate of convergence through the iterations, or the ability to solve the sparse approximation problem (i.e. finding $\alpha_*$ expressed above), or are MP techniques adapted to some specific problem like the emerging Compressed Sensing. Let’s quote for instance the gradient Pursuit, stagewise OMP, CoSaMP, regularized MP,  subspace pursuit, … (see here and here for more informations on these).

Another variation of (O)MP explained by K. Schnass and P. Vandergheynst in , splits the sensing part from the reconstruction part in the initial MP algorithm above (adding also the possibility to select more than only one atom per iteration). Indeed, the selection of the best atom $g_*$ is performed there by a Sensing dictionary $\mathcal{S}$ while the reconstruction stage building the residuals and approximations is still assigned to $\mathcal{D}$. In short, this variation is also proved to solve the sparse problem if the two dictionaries satisfy a small cross (cumulative) coherence criterion, which is easier to fulfill than asking for a small (cumulative) coherence of only one dictionary in the initial (O)MP.

I introduced more precisely this last (O)MP variation above since it is under this form that I discovered it in the separated works of R.V. Southwell  and G. Temple  (the last being more readable in my opinion) in 1935 and in 1938 (!!), i.e. before the building of the first (electronic) computers in the 40’s.

The context of the method in the initial Southwell’s work was the determination of stresses in structures. Let’s summarize his problem : the structure was modelized by $N$ connected springs. If $\alpha_*\in\mathbb{R}^N$ represents any motion vector of the springs extremities, then, at the new equilibrium state reached when some external forces $F_{\rm external}\in\mathbb{R}^N$ are applied to the structure, the internal forces provided by the springs follows of course the Hooke law, i.e $F_{\rm internal}=D \alpha_*$ for a certain symmetric matrix $D\in\mathbb{R}^{N\times N}$ containing the spring constants, and finally Newton’s first law implies : $F_{\rm internal} + F_{\rm external} = D \alpha_* + F_{\rm external} = 0$.

The global problem of Southwell was thus : given a linear system of equations $D\alpha_* = s$, with $s= -F_{\rm external}$ and $D$ positive definite, how can you recover practically $\alpha_*$ from $s$ and $D$  ? As explained by Temple, a solution to this problem is of course also “applicable to any problem which is reducible to the solution of a system of non-homogeneous, linear, simultaneous algebraic equations in a finite number of unknown variables”.

Nowadays, the numerical solution seems trivial : take the inverse of $D$ and apply it to $s$, and if $D$ is really big (or even small since I’m lazy) compute $D^{-1}$ for instance with Matlab and run “>> inv(D)*s” (or do something clever with the “/” Matlab operator).

However, imagine the same problem in the 30’s ! And assume you have to inverse a ridiculously small matrix of size 13×13. It can be really long to solve it analytically and worthless since you are interested in finding $\alpha_*$. That’s why some persons were interested at that time in computing $A^{-1}s$, or an approximation to it, without to have this painful inverse computation.

The technique found by R.V. Southwell and generalized later by Temple [4,5], was dubbed of “Successive Relaxation” inside a more general context named “Successive Approximations”. Mathematically, rewriting that work under notations similar to these of modern Matching Pursuit methods, Successive Relaxation algorithm reads : $R^0 = s,\ \alpha^0 = 0,\hfill {\rm (initialization)}\\ R^{n+1}\ =\ R^n\ -\ \beta\,(R^n)_{k^*} D_{k^*}\\ \alpha^{n+1}\ =\ \alpha^n\ +\ \beta\,(R^n)_{k^*}\, e_{k^*}\hfill{\rm (reconstruction)},$

where $\beta\in (0,2)$, $D_j$ is the $j$th column of $D$, $(R^n)_{m}$ is the $m$th component of $R^n$, $e_j$ is the vector such that $(e_j)_k=\delta_{jk}$ (canonical basis vector), and with, at each step $n\geq 0$, the selection (sensing) $k^{*} = {\rm arg\,max}_{k} |(R^n)_k|, \hfill{\rm (sensing)}.$

i.e. the component of the $n$th residual with the highest amplitude.

In this framework, since $D$ is positive definite and thus non-singular, it is proved in  that the vectors $\alpha^n$ tend to the true answer $\alpha_*=D^{-1}s$. The parameter $\beta$ controls the importance of what you removed or add in the residual and in the approximation respectively. You can prove easily that the decreasing of the residual energy is of course maximum when $\beta=1$.

In other words, in the concepts introduced in , they designed a Matching Pursuit where they selected for the reconstruction dictionary the orthonormal basis $D$ and for the sensing dictionary the identity (the canonical basis) of $\mathbb{R}^N$. Amazing, No ?

An interesting learning of the algorithm above is the presence of the factor $\beta$. In the more general context of (modern) MP with non-orthonormal dictionary, such a factor could be useful to minimize the “decoherence” effect observed experimentally in the decomposition of a signal when this one is not exactly fitted by elements of the dictionary (e.g., in image processing, arbitrarily oriented edges to be described by horizontal and vertical atoms).

G. Temple in  extended also the method to infinite dimensional Hilbert spaces for a different updating step of $\alpha^n$. This is nothing else but the foundation of the (continuous) MP studied by authors like R. DeVore and Temlyakov some years ago (on that topic you can read also , i.e. a paper I wrote with C. De Vleeschouwer for a geometric description of this continuous formalism).

By googling a bit on “matching pursuit” and Southwell, I found this presentation of Peter Buhlmann who makes a more general connection between Southwell’s work, Matching Pursuit, and greedy algorithms (around slide 15) in the context of “Iterated Regularization for High-Dimensional Data“.

In conclusion of all that, who is this person who explained that we do nothing but always reinventing the wheel ? 😉

If you want to complete this kind of “archeology of Matching Pursuit” please feel free to add some comments below. I’ll be happy to read them and improve my general knowledge of the topic.

Laurent

### References :

 : S. G. Mallat and Z. Zhang, Matching Pursuits with Time-Frequency Dictionaries, IEEE Transactions on Signal Processing, December 1993, pp. 3397-3415.

 : J. H. Friedman and J. W. Tukey (Sept. 1974). “A Projection Pursuit Algorithm for Exploratory Data Analysis“. IEEE Transactions on Computers C-23 (9): 881–890. doi:10.1109/T-C.1974.224051. ISSN 0018-9340.

 : K. Schnass and P. Vandergheynst, Dictionary preconditioning for greedy algorithms, IEEE Transactions on Signal Processing, Vol. 56, Nr. 5, pp. 1994-2002, 2008.

 : R. V. Southwell, “Stress-Calculation in Frameworks by the Method of “Systematic Relaxation of Constraints“. I and II. Proc Roy. Soc. Series A, Mathematical and Physical Sciences, Vol. 151, No. 872 (Aug. 1, 1935), pp. 56-95

 : G. Temple, The General Theory of Relaxation Methods Applied to Linear Systems, Proc. Roy. Soc. Series A, Mathematical and Physical Sciences, Vol. 169, No. 939 (Mar. 7, 1939), pp. 476-500.

 : L. Jacques and C. De Vleeschouwer, “A Geometrical Study of Matching Pursuit Parametrization“, Signal Processing, IEEE Transactions on [see also Acoustics, Speech, and Signal Processing, IEEE Transactions on], 2008, Vol. 56(7), pp. 2835-2848

 : R.A. DeVore and V.N. Temlyakov. “Some remarks on greedy algorithms.” Adv. Comput. Math., 5:173–187, 1996.

 : Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad, “Orthogonal projection pursuit: Recursive function approximation with applications to wavelet decomposition,” in Proceedings of Twenty-Seventh Asilomar Conference on Signals, Systems and Computers, vol. 1, (Pacific Grove, CA), pp. 40-
44, NOV. 1-3, 1993.

 : Tropp, J, “Greed is good: algorithmic results for sparse approximation“, IEEE T. Inform. Theory., 2004, 50, 2231-2242

#### Image credit :

EDSAC was one of the first computers to implement the stored program (von Neumann) architecture. Wikipedia.

This entry was posted in Greedy and tagged . Bookmark the permalink.

### 6 Responses to Matching Pursuit Before Computer Science

1. Igor Carron says:

Laurent,

When you “was the determination of stresses in frameworks”, did you mean “was the determination of stresses in structures” or “was the determination of structural stress” ?

Igor.

2. jackdurden says:

Actually, I was using the same terminology than the one of G. Temple in the beginning of his paper. I think that the real meaning is “the determination of stresses in structures” in the sense that they had to determine the motion of the equilibrium state of a network of spring (used as a model of the structure) under the application of external forces on the structure.

In that context, if $x$ represents the vector of motion of the springs extremities, then, at the new equilibrium state reached when there exist external force $F=-b$, $Ax$ represents the Hooke’s law expressing the spring force w.r.t. the deviations $x$ (i.e. internal forces), so that $Ax + F = Ax - b = 0$ is the Newton’s first law that allows to find x.

Laurent

3. jackdurden says:

I update the text in function of what I explained. Thank you Igor. That point was indeed unclear and it deserved a better explanation.

Laurent

4. Igor Carron says:

Thanks for the explanation, it definitely provided some context that one could relate to. By the way, when reading the paper by Southwell, would there be a case where the physics allows for D to be a rectangular matrix leading to an undetermined system. Could some boundary conditions make the system undetermined for instance ?

Cheers,

Igor.

5. jackdurden says:

That’s a very interesting aspect. I cannot really answer but here are some (random) thoughts.

Perhaps one possibility would be to stay with this structure model and continue to see the knowledge of the external forces as the measurements. So, you would not be required to know all the external forces applied to each spring, but only a few of them if you can assume than the spring motions (i.e. $\alpha_*$) are sparse. In that case, you could work with a random sub-sampling of the rows of $D$ (as a random subsampling of any orthogonal basis) and still be able to deduce $\alpha_*$.

Interestingly, the non-zero entries of $D$ corresponds to the non-zero entries of the connectivity matrix of the springs, and the value of the non-zero entries are the spring constants of the Hooke’s model. In 2-D, this link with the connectivity could lead also to some interesting interpretations related to graph theory and measurement matrix (e.g. recent paper of Goldberg, Indyk and al.). But I’m not at all an expert on that topic and I’m perhaps wrong.

Finally, in Temple’s paper, the system $D\alpha_*=s$, since inherited from Newton’s first law, has a connection with an “energy” description. Indeed, this system is valid when the energy $W=V_1+V_2$ formed with the usual spring potential energy $V_1 = \frac{1}{2}\alpha_*^T D \alpha_*$ and the (assumed constant) external forces potential $V_2= -s^T\alpha_*$ is minimum (equilibrium state). Temple created another Matching Pursuit like method (not explained in the post) to minimize as fast as possible this energy in the iterations.
I’m very curious to see if such a (quadratic) energy concept can survive outside of this physical model as another theoretical element to minimize to find $\alpha_*$ in a CS formulation of the problem. But perhaps it is worthless since possibly just equivalent to a least square like minization problem. I don’t know. It is just some thoughts 😉

Laurent

6. jackdurden says:

Correction: $D$ is not the connectivity matrix but the Laplacian made with the connectivity matrix, i.e. the degree of the connectivity on the diagonal, and minus the connectivity on the off-diagonal elements.