Variable selection for model-based clustering. Application for transcriptome data analysis

Cathy MAUGIS

2008, November

Abstract

We are interested in variable selection for clustering with Gaussian mixture models. This research is motivated by the clustering of genes described by transcriptome datasets in particular. In the two parts, this problem is regarded as a model selection problem in a model-based cluster analysis framework.
In the first part, the proposed model, generalizing the one of Raftery and Dean (2006), specifies the variable role for the clustering process. The irrelevant clustering variables can be dependent to a relevant variable subset. Models are compared with a BIC-like criterion. The model identifiability is established and the consistency of the criterion is proved under regularity conditions. In practice, the variable role is obtained through an algorithm embedding two backward stepwise algorithms for variable selection for the clustering and the linear regression. The interest of this procedure is highlighted by a transcriptome dataset application especially. An improvement of the variable role modelling, consisting of partitioning the irrelevant variables according to their dependence or independence with some relevant clustering variables, is suggested to avoid an overpenalization of some models. Finally, the DNA microarray technology generating many missing values, an extension of our variable selection procedure taken into account the existence of missing entries is proposed. It avoids the missing entry imputation usually used in preprocessing.
In the second part, specific Gaussian mixtures are considered and a non asymptotic penalized criterion is proposed to select the number of mixture components and the relevant clustering variable subset. A general model selection theorem for maximum likelihood estimation, proposed by Massart (2007), is used to obtain the penalty function form. This theorem requires to control the bracketing entropy of studied Gaussian mixture families. This criterion depending on unknown constants, the « slope heuristics » method is carried out to allow the practical use of this criterion.

Type

Thesis

Publication

PhD Thesis, University Paris-Sud 11

Advisors: Gilles Celeux (INRIA Saclay Ile-de-France) and Marie-Laure Martin-Magniette (AgroParisTech et URGV, CR1 INRA)

Referees: Yannick Baraud (University of Nice Sophia-Antipolis, France) and Mark van der Laan (University of California at Berkeley)

Committee:

Christophe Ambroise
Sébastien Aubourg
Yannick Baraud
Gilles Celeux
Marie-Laure Martin-Magniette
Pascal Massart