Murua Alejandro


The goal of every clustering (or unsupervised classification) method is to estimate the "true" clustering structure or partition underlying the feature space. This estimate is usually unveiled by optimizing a score function that gives a figure-of-merit on admissible partitions of the data. In fields such as bioinformatics (e.g. microarray gene expression data, proteomic sequencing) and text mining (document clustering), clustering high-dimensional data sets is a common problem. Note that one of the main di culties with genome data is the usually small sample size together with the high-dimensionality of the data. Sound machine learning and statistical methods for variable selection and/or data reduction based on the intrinsic structure of the data are needed. In the absence of such techniques, efficient and reliable clustering methods whose score functions are based on similarity or distance matrices are advantageous. Kernel methods and their extensions such as kernel K-means, multiway normalized cut (MNCut), and, recently, adaptive Potts model clustering use the similarity matrix information in an effective way, often yielding good results.

The goal of my current research is to develop sound machine learning and probabilistic methods to find (1) intrinsic structure (clustering), and (2) patterns (variable selection or functionals that intrinsically describe and interpret the data in few dimensions), in mainly complex and/or high-dimensional data and large datasets.

This research is partly based on the belief that the key to the success of modeling complex data structures lies in the embedding of kernel-based methods within parametric and Bayesian models. For example, the Bayesian parametric paradigm gives a framework for introducing structural constraints, whereas the Kernel based methods introduces flexibility (e.g. local, spatial and/or temporal adaptation). Our penalized adaptive Potts model clustering implicitly models the clusters densities via kernel density estimators (a by-product of using a Mercer kernel within a Potts model), and uses Bayesian computational techniques (Markov Chain Monte Carlo) to estimate the clusters and their number, and other parameters of the model.

Some examples of key applications guiding this research are finding groups of genes associated to certain diseases (e.g. leukemia or lung cancer) in order to devise effective individualized treatments; discovering the di erent developmental stages of sensory organs (e.g. vision); finding the role of proteins through association with proteins whose functions are known; finding relevant association between drugs and potential adverse effects; and discovering neuronal firing patterns in the brain using fMRI images.


. Murua, A., Stanberry, L. and Stuetzle, W., (2008) "On Potts model clustering, kernel K-means and density estimation," Journal of Computational & Graphical Statistics, 17 (2008), 629-658.

Stanberry, L., Murua, A. and Cordes, D., "Functional connectivity mapping using the ferromagnetic Potts spin model." Human Brain mapping, 29 (2008) 422-440.

Murua, A., and Wicker, N., (2010) "The Conditional-Potts Clustering Model," Submitted for publication.

This work uncovers the link between several seemingly unconnected kernel-based methods (e.g. kernel K-means, Multiway Normalized Cut). Within the same framework it reveals (and re-introduces) the Potts model as one of the simplest to conceive kernel-based clustering method. This work also connects kernel-based methods to non-parametric kernel density estimation. It uses these two findings to present a powerful kernel-based clustering methodology based on Potts model, kernel density estimation, and consensus clustering. The second paper presents an important application to the exploration of brain connectivity via our adaptive Potts model clustering methodology described in the first paper. Our work has been presented at several conferences all over the world. The third paper presents a Bayesian kernel-based clustering method. The model arises as an embedding of the Potts density for label membership probabilities into an extended Bayesian model for joint data and label membership probabilities. In contrast to previous work, our Conditional-Potts clustering model introduces a complete principled way to do clustering. This is accomplished through the elucidation of an informative prior for the model parameters, and also by the development of a Wang-Landau flat-histogram-like stochastic algorithm to estimate the posterior of the parameters. In particular, the link between the Potts and the random cluster model models allows us to elucidate, using random graph theory, a very informative prior for the temperature parameter.

. Murua, A., and Wicker, N., "Kernel-based Mixture Models for Classification," Submitted for publication.

Kernels are now everywhere present in statistics as far as a dot product is at hand. However to the best of our knowledge kernels have not been used in mixture models. In the present work we show that they can be useful for classification purposes. They o er a flexibility in the modeling process through the kernel trick which enables capturing interesting features in some cases more easily than just using the standard methods. Our method is based on mixtures of Gamma distributions. These model the point distances to cluster centroids in the transformed space. The distances are readily computed using the kernel trick. Nested within our model are the two special cases of a mixture of exponentials and the kernel K-means method. We suggest using the log-likelihood ratio or the Bayesian Information criterion to select an appropriate parsimonious model for the data at hand. A comparison with other popular classification methods such as support vector machines, shows that our method is very competitive and computationally efficient.

. Murua, A., Stuetzle, W., Tantrum, J. and Sieberts, S., "Model based document classification and clustering." International Journal of Statistics and Tomography, 8 (2008) 1-25.

Tantrum, J., Murua, A. and Stuetzle, W. (2004), "Hierarchical model-based clustering of large datasets through fractionation and refractionation," Information Systems, 29, 315-326.

Tantrum, J., Murua, A. and Stuetzle, W. (2003), "Assessment and pruning of hierarchical model-based clustering," Proceedings of the 9th International Conference on Knowledge Discovery and Data Mining (KDD03), 197-205.

In this work we develop a complete methodology for document classification and clustering based on model-based clustering. This explicitly models the data as being drawn from a Gaussian mixture. One main advantage of our approach is the ability to automatically select the number of clusters present in the document collection via Bayes factors. We also reviewed fractionation, an idea originally conceived by Cutting, Karger, Pedersen and Tukey (1992) for non-parametric hierarchical clustering of large data sets, and described an adaptation of it that extends model-based clustering to large data sets. We called this procedure model-based fractionation. The main idea is to reduce the data size by working with meta-data form by clusters of data points. Each meta-data is formed recursively and carries the information contained in the empirical moments of the associated cluster. We also conceived a further extension of fractionation, called model-based refractionation. This latter procedure proved successful even when dealing with di cult situations comprising sets with large numbers of small clusters. A key advantage of model-based refractionation over other competing clustering of large data sets approaches is that it does not require the number of clusters be known a priori. Part of this work was first presented at the 2002 International Conference On Data Mining and Knowledge Discovery (KDD’02) in Edmonton, Canada, where it was selected as one of the best papers. We were later on invited to submit our extended work to the Information Systems Journal.

. Yeung, K.Y., Fraley, C., Murua, A., Raftery, A. and Ruzzo, L. (2001), "Model-Based clustering and data transformations for gene expression data," Bioinformatics, 17, 977-987.

This is a pioneering work on exploratory analysis of gene-expression microarray data. In it we benchmarked the performance of model-based clustering on several synthetic and real gene expression data sets for which external validation were available. We showed that on real gene expression data model-based clustering performed as well as leading heuristic clustering methods. But our approach has the advantage of suggesting the number of clusters (through the Bayesian information criterion BIC) as well as an appropriate model. We also assessed the degree to which real gene expression data fits the multivariate Gaussian mixture assumptions on several common data transformations. We showed that suitable chosen transformations can greatly enhance normality in gene expression data, and that models have varying performance on data sets that are transformed di erently. This paper has been identified by Thomson-ISI as one of the most cited papers in the research area of Gene Expression Data, and it continues to be cited in most journals that publish work on bioinformatics.

. Murua, A. Martin, D. (2005) "S+Bayes" library.

Doug Martin and I have conceived, developed and implemented several Bayesian hierarchical generalized linear mixed models. Among them are linear, Poisson and Binomial models. The fruit of this work has been compiled in the so-called S+Bayes software library (written in C++ with Graphical User Interface in S-PLUS). The goal is to facilitate Bayesian inference techniques to a larger audience than just statisticians in Academia. The library is freely available from Insightful Corporation.

. Murua, A. (2002), "Upper bounds for error rates associated to linear combination of classifiers, "IEEE Transactions on Pattern Analysis and Machine Intelligence, 9, 591-602.

In this work I introduced a useful notion of weak dependence between many classifiers built using the same training data. I showed that if both this dependence is low and the expected margins are large, then decision rules based on linear combinations of these classifiers can achieve error rates that decrease exponentially fast. Several experiments showed that the three common methods used to construct multiple classifiers from the same training data -bagging, boosting and randomized trees (random forests)- produce mutually weakly dependent trees. Furthermore, these results also suggested that there is a trade-off between weak dependence and expected margins: to compensate for low expected margins, there should be low mutual dependence among the classifiers making up the combination. This result has been cited in the engineering and pattern recognition literature as M "Murua's theory."

. Amit, Y. and Murua, A. (2001), "Speech recognition using randomized relational decision trees," IEEE Transactions on Speech and Audio Processing, 9, 333-341..

In this work we recognized speech signals using a large collection of coarse acoustic events which describe temporal relations between a few local cues in the spectrogram. The invariance to changes in duration of speech signal events was addressed by defining temporal relations in a rather coarse manner, allowing a large degree of slack. The problem is of combinatorial nature. However our acoustic events were defined in terms of labeled graphs and hence inherited a partial ordering. This allowed us to use multiple randomized trees to access the large pool of acoustic events in a systematic manner. The trees were aggregated to produce a classifier. We showed that the learning stage of this procedure is much more efficient than that for hidden Markov models (HMM). It also performed much better than HMM in our experiments. This work was presented as an invited talk at the World Congress of the Bernoulli Society.