2022
Masters Projects
@Mathematics and Statistics
+Computer Science
+Biology
+Medicine
#Multiple imputation
#complex data
#missing values
#biology
#heterogeneity
#sequential imputation
#clusterwise regression
Project Summary
Identifying clusters of individuals in heterogeneous data is a classical task in data mining. However, many cluster analyses methods do not address missing data. The topics of missing data is largely studied in the literature, but it remains limited in the context of clustering.
Multiple imputation is one of the efficient methods to tackle the missing data issue, notably in the regression framework, but also recently in clustering. Its principle consists in replacing each missing value by several plausible values. Denoting M this number, it consists in generating M imputed data sets, i.e. completed. In the linear regression framework, regression coefficients are then estimated from each imputed data set, leading to M sets of regression coefficients. Finally, these estimates are pooled using the so-called Rubin’s rules, providing a unique point estimate as well as a unique estimate of its associated standard error.
One classical way for generating imputed data consists in imputing variables one by one according to univariate regression models. This technique is well-known from the medical field under the name fully conditional specification and very popular beyond this field. However, such methods are not tailored to deal with heterogeneous data. Some works have been recently done in this line, but they remain limited.
Thus, in this project we propose a novel fully conditional specification method based on clusterwise linear regression instead of linear regression. Clusterwise regression methods boil down building a collection of local regression models so that each group of individuals is associated to a specific regression model. In this work, the novel multiple imputation method is presented. Its properties are theoretically studied and assessed by an extensive simulation study.
From a practical point of view, this method offers to practicians an efficient way to tackle complex data in their medical studies. More generally, this multiple imputation method can be valuable for any clustering analysis with missing values (since implicitly assuming a heterogeneity between individuals) and thus covers a large spectrum of applications in data science.
Matthieu Resche-Rigon
Projects in the same discipline
Exploration of press articles related to Covid-19 at the European level within the Covid-19 Museum
2023Masters Projects@Mathematics and Statistics +Computer Science+Mathematics/Statistics+Linguistics #Covid-19 Museum#Newspaper analysis#Timeline Project Summaryto be updated. Yves Rozenholc Projects in the same discipline
Modeling genetic pleiotropy using machine learning to understand the human genetic architecture
2022Masters Projects@Mathematics and Statistics +Computer Science+Biology+Medicine #statistical genetics#pleiotropy#complex traits and diseases#post-genomic data#machine learning#GPU programming Project SummaryNowadays in human genetics, one particular concept seems...
Influence of blood pressure and aqueous humor dynamics on the response to glaucoma medication: a data-driven computational study
2021Masters Projects@Mathematics and Statistics +Computer Science+Mathematics/Statistics+Engineering+Medicine #data-driven computational study#mathematical modeling#glaucoma#enhanced data set Project SummaryIntraocular pressure (IOP) is the pressure created by the...
Machine learning for the study of EEG data recorded during general anesthesia
2021Masters Projects@Mathematics and Statistics +Engineering+Medicine+Neuroscience #anesthesia#EEG#classification Project SummaryGeneral Anesthesia (GA) is a drug-induced, reversible condition with three commonly accepted goals: lack of experience of surgery,...