2022

Masters Projects

@Mathematics and Statistics

+Computer Science

+Biology

+Medicine

 

#Multiple imputation

#complex data

#missing values

#biology

#heterogeneity

#sequential imputation

#clusterwise regression

 

Project Summary

Identifying clusters of individuals in heterogeneous data is a classical task in data mining. However, many cluster analyses methods do not address missing data. The topics of missing data is largely studied in the literature, but it remains limited in the context of clustering.

Multiple imputation is one of the efficient methods to tackle the missing data issue, notably in the regression framework, but also recently in clustering. Its principle consists in replacing each missing value by several plausible values. Denoting M this number, it consists in generating M imputed data sets, i.e. completed. In the linear regression framework, regression coefficients are then estimated from each imputed data set, leading to M sets of regression coefficients. Finally, these estimates are pooled using the so-called Rubin’s rules, providing a unique point estimate as well as a unique estimate of its associated standard error.

One classical way for generating imputed data consists in imputing variables one by one according to univariate regression models. This technique is well-known from the medical field under the name fully conditional specification and very popular beyond this field. However, such methods are not tailored to deal with heterogeneous data. Some works have been recently done in this line, but they remain limited.

Thus, in this project we propose a novel fully conditional specification method based on clusterwise linear regression instead of linear regression. Clusterwise regression methods boil down building a collection of local regression models so that each group of individuals is associated to a specific regression model. In this work, the novel multiple imputation method is presented. Its properties are theoretically studied and assessed by an extensive simulation study.

From a practical point of view, this method offers to practicians an efficient way to tackle complex data in their medical studies. More generally, this multiple imputation method can be valuable for any clustering analysis with missing values (since implicitly assuming a heterogeneity between individuals) and thus covers a large spectrum of applications in data science.

 

Matthieu Resche-Rigon

 

Projects in the same discipline