2022

Masters Projects

@Computer Science

+Mathematics/Statistics

+Biology

 

#Big data

#RNA-seq

#computational optimization

#GPU parallelization

#differential analysis

Project Summary

Next-generation sequencing such as RNA-seq aims to quantify the transcriptome of biological samples and compare gene expression between different experimental conditions. The quantification of the genome alignements stemming from such technologies represent the relative measurements which cannot be directly compared between conditions without an adequate data normalization. The optimal approach to normalize such data has not reached a consensus to date (Abrams et al. 2019). Unfortunately, existing methods suffer from practical limitations and may be compromised by the presence of genes showing high expression level or strong variability. In this case a single normalization procedure can lead to erroneous results and false conclusions. Therefore, a novel statistical framework for differential analysis in transcriptomics has been proposed (Desaulle et al. 2021) which is based on intensive iterative random data normalizations and provides good control of the statistical errors. At present, it has been implemented in the R package DArand (Desaulle and Rozenholc 2021) and is publicly available from the Comprehensive R Archive Network. The current package is written in R language and uses only CPU parallelization. Due to the large data size and the framework based on intensive iterative randomizations, further project development requires more advance programming. More precicely, the iterative procedure uses intensive computations and may become rapidly time-consuming with respect to both the size of the transcriptomic experiment and the number of samples. Therefore, the main mission during the internship will consist in adapting the code for efficient parallel processing on a graphic processing unit (GPU) using CUDA.The computational optimization will play an important role in further methodological development. Indeed, the subsequent contribution will aim at extending the methodology from two to more biological conditions. It will be directed towards statistical analysis with more than two conditions such as differential analysis, principal component analysis (PCA) and more generally unsupervised learning tools. Here the difficulty will be to preserve an iterative structure of the procedure with data normalization and while combining results from different approaches in data analysis. The methodological aspects, the implementation and the validation will be followed by the real-data application involving the miRNA data.

 

Dorota Desaulle

 

Projects in the same discipline