2022 internship opportunities for Master’s students

Below are listed internship opportunities currently offered by diiP. These offers are open to second year Master’s students.

Optimizing a physical RNA force-field via Machine Learning

Supervised by: Samuela Pasquali (Université Paris Cité)

Description: The main goal of this project is the optimization of our RNA model through ML to obtain a cuttingedge RNA force field to facilitate building functional three-dimensional structures for RNA molecules. We will employ machine learning to optimize the model exploiting extensively the structural data available in databanks and the sparse thermodynamic and dynamic data available from experiments. This approach will allow our model to give much more accurate and reliable structural predictions and to be deployed on systems of more complex architectures than currently possible. Our aim here is to anchor our force field model deep into the corresponding physics by adapting recent and promising Symbolic Regression algorithms to our data format and selecting the possible improvements in the functional form of the force field uncovered by this technique, based on sound physical principles.
The M2 internship will be the first step of a larger project where we propose to first use the existing functional form of the force field and train its 100+ coefficients and then to then build upon the ML pipeline developed in the first step to learn additional terms of the force field. The first step will serve two purposes: i) improving the existing, physics-based force field and ii) establish an accuracy baseline for further improvement.

The work will be divided in 4 phases:
1. Set up the global optimization scheme coupling Pytorch to the coarse-grained force-field code.
2. Generate an appropriate training set of RNA structures.
3. Run the optimization on the training set.
4. Run the optimized force-fields on a set of benchmark systems.

How to apply: please send a motivation letter and a CV to samuela.pasquali[at]u-paris.fr

Further information can be found here

The diffusion of technology during the last five Millennia

Supervised by: Thomas Chaney (Sciences Po/USC), Danial Lashkari (Boston College), Johannes Boehm (Sciences Po)

Description: Thomas Chaney (Sciences Po/USC), Danial Lashkari (Boston College), and Johannes Boehm (Sciences Po) are looking for a talented and enthusiastic Master student in Computer Science, Data Science, or related fields to work as an intern on a exciting project at the intersection of Economics, Data Science, and History. The objective of the project is to study technology diffusion through the lens of data on artefacts from museum records.

Requirements: The ideal candidate has experience doing NLP in Python, in particular with transformer-based language models. Fluency in English is required. Internships would last between 3 and 6 months during 2022.

How to apply: please send your CV and a short letter explaining qualifications and interest before 31/12/2021 to johannes.boehm[at]sciencespo.fr

Automatic detection and location of hydro-acoustic signals linked to Mayotte submarine eruption

Supervised by: Jean-Marie Saurel (IPGP), Lise Retailleau (IPGP/OVPF), Valérie Ferrazzini (IPGP/OVPF), Clément Hibert (ITES), Themis Palpanas (LIPADE, Université Paris Cité)

Description: The goal of this internship is to automatically detect and locate the hydro-acoustic signals on the OBS continuous recordings since the first deployment in February 2019, until 2021. These events were first handpicked on a small number of deployments starting from the end of 2019, when they were first identified. On each OBS stations, the hydro- acoustic signals are recorded both on the 3 components geophone and on the hydrophone (with usually shows clearer arrivals). Depending on the deployment, the sampling rate of the continuous timeseries varies from 62.5 Hz to 250 Hz with 4 to 16 stations deployed at a time. A previous work on a 10-days deployment has shown that several types of waveforms are recorded related to these signals (Figure 2), although all of them have a duration of less than 0.1s. Because of the high level of seismicity in Mayotte and because of instrumental noise and glitches, it’s challenging to identify these small signals and discriminate them from other sources in the OBS continuous recordings. Another challenge is the amount and variety of data and instruments to be processed with sometimes more than two years of data recorded on a few stations with broadband hydrophones and shorter deployments with more than ten stations equipped with short-period hydrophones.

How to apply: please contact Jean-Marie Saurel (saurel[at]ipgp.fr).

Further information can be found here.

[POSITION FILLED] Modeling and characterizing genetic variant pleiotropy using machine learning to understand the human genetic architecture

Supervised by: Marie Verbanck (Université Paris Cité)

Description: Nowadays in human genetics, one particular concept seems to resurge: pleiotropy. Pleiotropy occurs when one genetic element (e.g. variant, gene) has independent effects on several traits. Although pleiotropy is extremely common and thought to play a central role in the genetic architecture of human complex traits and diseases, it is one of the least understood phenomena. We have shown that several biological mechanisms exist and induce different pleiotropy states at the level of the variants. Specifically, we have conceptualized 5 biological mechanisms 1) linkage disequilibrium; 2) causality between traits; 3) genetic correlation between traits; 4) high polygenicity of traits; 5) horizontal pleiotropy (true independent effects of a variant on two traits). This internship will be dedicated to building a comprehensive framework to disentangle all 5 states of pleiotropy and provide a genome-wide map of pleiotropy using machine learning. Specifically, we propose 1) to improve on a method that we have published in a proofof-concept paper using unsupervised approaches based on penalized methods, random forests or deep learning; 2) to explore semi-supervised learning using a creative strategy to label data that we have developed. There is a growing utility for Human genetic variant databases, from the interpretation of genetic analyses to clinical interpretation. We strongly believe that a database describing the pleiotropic nature of variants will complement existing databases and serve the community.

Requirements: The successful candidate:

will have a master of data science linked to statistics or artificial intelligence, candidates with more theoretical background however showing strong interest in life science applications are also welcome;
will be enthusiastic about transdisciplinary research and open science at the interface between data science and genetics;
will show a clear interest to use applied science methodology to benefit biological understanding;
will have good programming skills, preferentially R and/or Python;
can have a background in biology or genetics;
should be open-minded and willing to work as a team with other lab members;
is expected to pursue PhD training as funding has already been secured through the PleioMap project (Funding: Agence Nationale de la Recherche, PI: Dr Marie Verbanck);
will speak decent English since we are closely collaborating with Mount Sinai Hospital in New York City, US

How to apply: please send a concise email describing your research interests and experience as well as an up-to-date CV to Marie Verbanck (marie.verbanck[at]u-paris.fr). Name and contact for references will be appreciated.

Further information can be found here.

Search for features in astrophysical objects close to cosmic neutrinos: An indirect approach using deep learning and statistical inference

Supervised by: Yvonne Becherini (Université Paris Cité, Astroparticule et Cosmologie, diiP) and Themis Palpanas (Université Paris Cité, LIPADE, diiP)

Description: The proposed work is in the field of Astroparticle Physics, focusing on the search for a connection between high-energy neutrinos and gamma rays in the extragalactic sky. Two large observatories have been designed to be able to detect high energy neutrinos from astrophysical environments: IceCube1 and KM3NeT2. IceCube already has collected 10-years of data, which resulted in a catalogue of neutrinos having a high probability of being of cosmic origin, while KM3NeT is an observatory under construction. The significance of the signal of IceCube cosmic neutrinos shows that still no firm conclusion can be drawn on the association of these with astrophysical objects.

Requirements: Excellent Python programming skills, very good knowledge of deep learning frameworks (PyTorch/GPU, etc.) and libraries in data analysis workflow (NumPy, Matplotlib, etc.). Research/project experiments and publications on deep learning or data analysis is a plus.

How to apply: Please send your resume and transcripts to Prof. Yvonne Becherini (yvonne.becherini[at]apc.in2p3.fr)

Further information can be found here.

[POSITION FILLED] Deep Learning-based EEG Epilepsy Detection and Analysis

Supervised by: Themis Palpanas (Université Paris Cité, LIPADE, diiP) and Qitong Wang (Université Paris Cité)

Description: Electroencephalogram (EEG) is one of the most common and essential medical signal collected by neural scientists for the analysis of nerve diseases. With the rapid development of medical instruments and data collection techniques, EEG analysis has also been witnessed a dramatic progress. One important problem of EEG analysis is epilepsy pattern detection and analysis. Epilepsy is a brain disease generally associated with seizures, deteriorating the life quality of many patients. This internship targets to design effective deployment schemes of modern deep learning techniques on EEG Epilepsy detection, with a focus on real-world applications for neural scientists.

How to apply: Please send your resume and transcripts to Prof. Themis Palpanas (themis[at]mi.parisdescartes.fr)

Further information can be found here.

Explanation Methods for Multivariate Time Series Classification

Supervised by: Themis Palpanas (Université Paris Cité, LIPADE, diiP) and Paul Boniol (Université Paris Cité, EDF R&D)

Description: Various data series classification algorithms have been proposed in the past few years, including several deep learning methods. The advantages of a deep learning model are that it can benefit from GPU accelerations, and use Class Activation Maps (CAMs) to explain the classification results. Nevertheless, the CAM only provides a univariate data series regardless of the dimensionality of the input. Thus, for the case of multivariate data series, CAM will only be able to highlight significant temporal events without any information on which dimensions are relevant. Per our objectives, we will extend the CAM method for multivariate data series classification and anomaly detection. We will continue our ongoing work on novel methods that can provide a multivariate CAM. We will also study extensions, such that the explanation provided is rich enough to consider time dependencies between discriminant features.

How to apply: Please send your resume and transcripts to Prof. Themis Palpanas (themis[at]mi.parisdescartes.fr)

Further information can be found here.

Mise en place d'algorithmes intelligents pour la détection et l'analyse multi-tâches d'éruptions au Piton de la Fournaise

Encadré par : Charles Le Losq, Lise Retailleau et Aline Peltier

Description: Les aléas et risques découlant de l’activité des édifices volcaniques sont nombreux, et souvent difficilement prévisibles. Les conséquences d’une éruption mal anticipée peuvent ainsi être dramatiques, comme montré par différents cas historiques récents (e.g., El Chichon, Mexique, en 1982 : 1 900 victimes ; Mt Pelée, France, en 1902 : 30 000 victimes). Les édifices volcaniques français actifs sont étroitement surveillés par les observatoires volcanologiques et sismologiques de l’Institut de Physique du Globe de Paris (IPGP). Différentes méthodes géophysiques (sismologie, déformation, GPS…) et géochimiques (suivi des émissions de gaz, température des fumeroles, géochimie des eaux…) sont des indicateurs de l’activité volcanique et permettent d’anticiper les éruptions. L’interprétation de ces données guide ainsi les autorités pour les décisions à prendre concernant la mitigation des risques. Cependant, l’analyse humaine des données devient potentiellement difficile du fait de leur quantité et diversité en constante hausse. De plus, la mise en rapport d’observations de nature très différente, comme par exemple la sismologie et les flux de gaz, est complexe. Dans le cadre de ce stage de M2, nous proposons de tester une approche basée sur le machine learning pour, à partir d’un jeux de données combinant observations géochimiques et géophysiques, mettre en place un algorithme multi−tâches pour la détection automatique de la remontée de magma et des éruptions. Nous prendrons comme cas d’étude le Piton de la Fournaise, sur l’île de la Réunion. Ce volcan a présenté des éruptions très fréquentes au cours des 20 dernières années. Des séries de données géophysiques et géochimiques, avec une très bonne couverture spatio−temporelle, sont disponibles durant cette période. Ces données sont marquées par plusieurs éruptions majeures. En outre, elles comprennent des périodes avec peu d’éruption pendant plusieurs années, et d’autres avec plusieurs éruptions par an.

Comment candidater : Merci de contacter Charles Le Losq (lelosq[at]ipgp.fr)

Duré du stage : du 01/02/2022 au 17/06/2022

Cliquer ici pour obtenir plus d’informations.

Deep Learning-based Prediction of Query Answering Times for Data Series Similarity Search

Supervised by: Themis Palpanas (Université Paris Cité, LIPADE, diiP) and Qitong Wang (Université Paris Cité)

Description: A key operation for the (increasingly large) data series collection analysis is similarity search. Recent studies demonstrate that SAX-based indexes offer state-of-the-art performance for similarity search tasks. To facilitate the deployment of real-world data series similarity search components, query answering time estimation is essential for the purpose of systematic throughput optimization and latency analysis. Deep learning techniques have been recently applied to database tunning and optimization. However, existing methods suffer from the problem that the training of deep neural network models requires large amounts of accurately labeled data, which is usually unaffordable in real-world applications. In this internship, we will exploit and develop state-of-the-art deep neural network models for data series similarity querying answering time estimation for the iSAX family of indexes, with a focus on novel techniques for training data efficiency.

How to apply: Please send your resume and transcripts to Prof. Themis Palpanas (themis[at]mi.parisdescartes.fr)

Further information can be found here.

[POSITION FILLED] Transcriptomic Analysis using Intensive Randomization

Supervised by: Dorota Desaulle (MCF, UR 7537 BioSTM – Biostatistique, Traitement et Modélisation des données biologiques, Faculté de Pharmacie, Université Paris Cité)

Description: Next-generation sequencing such as RNA-seq aims to quantify the transcriptome of biological samples and compare gene expression between different experimental conditions. The quantification of the genome alignements stemming from such technologies represent the relative measurements which cannot be directly compared between conditions without an adequate data normalization. The optimal approach to normalize such data has not reached a consensus to date (Abrams et al. 2019). Unfortunately, existing methods suffer from practical limitations and may be compromised by the presence of genes showing high expression level or strong variability. In this case a single normalization procedure can lead to erroneous results and false conclusions. Therefore, a novel statistical framework for differential analysis in transcriptomics has been proposed (Desaulle et al. 2021) which is based on intensive iterative random data normalizations and provides good control of the statistical errors. At present, it has been implemented in the R package DArand (Desaulle and Rozenholc 2021) and is publicly available from the Comprehensive R Archive Network. The current package is written in R language and uses only CPU parallelization. Due to the large data size and the framework based on intensive iterative randomizations, further project development requires more advance programming. More precicely, the iterative procedure uses intensive computations and may become rapidly time-consuming with respect to both the size of the transcriptomic experiment and the number of samples. Therefore, the main mission during the internship will consist in adapting the code for efficient parallel processing on a graphic processing unit (GPU) using CUDA.The computational optimization will play an important role in further methodological development. Indeed, the subsequent contribution will aim at extending the methodology from two to more biological conditions. It will be directed towards statistical analysis with more than two conditions such as differential analysis, principal component analysis (PCA) and more generally unsupervised learning tools. Here the difficulty will be to preserve an iterative structure of the procedure with data normalization and while combining results from different approaches in data analysis. The methodological aspects, the implementation and the validation will be followed by the real-data application involving the miRNA data.

Requirements: The successful candidate should hold a master degree in data science or computer science with knowledge related to statistics, machine learning or AI and is also expected to interact with the researchers of the interdisciplinary teams throughout the internship. Moreover any of the following skills will be considered as an advantage: good programming skills including GPU computing; strong interest for biology 1; advanced level in English.

How to apply: Please contact Dorota Desaulle (dorota.desaulle[at]u-paris.fr)

Further information can be found here.

[POSITION FILLED] Veracity assessment framework for discovering social activities in urban big datasets

Supervised by: Philipp Brandt (SciencesPo) and Soror Sahri (Université Paris Cité, LIPADE, diNo)

Description: Digital technologies provide access datasets that have been unfamiliar to social scientists, including behavioral traces (e.g., point of sales, geolocation data, social media scrapings, CCTV recordings), machine-readable texts, and code and data repositories. These secondary data sources produced without research goals in mind require new technical skills and computing capacities to manage their scale and content. A particular recent trend for social scientists is to understand the potential of big data in complementing traditional research methods and their value in making decisions. Several major issues have to be closely investigated around big data in social sciences, including political polarization, viral information diffusion, and economic performance. The veracity and value characteristics of big data are the main concerns for social scientists. This master internship will focus on urban data, particularly the NYC taxi dataset, to develop technical procedures that help social scientists deal with this and similar urban datasets. Social scientists have used the NYC dataset in the past and yet left many dimensions unexplored. Most problematically, they have not yet provided a technology that allows for fast, flexible data access and a strategy for ensuring the quality of the data. Once such an infrastructure is in place, the NYC taxi dataset can lead to better understanding of core questions in the social sciences, such as economic decision-making and labor mobility, as well as a strategy for how social scientists can work with novel datasets.

Requirements: We are looking for a student in Master 2 or engineering school in computer science. The ideal candidate would have excellent programming skills, good knowledge in data management, and an interest in handling large amount of data.

How to apply: Apply by emailing your detailed CV (including transcripts) to Soror Sahri (soror.sahri[at]parisdescartes.fr)

Further information can be found here.

Monitoring the seismic activity of Mayotte through image processing of fiber optic signals

Supervised by: Lise Retailleau (Institut de physique du globe de Paris, Observatoire Volcanologique du Piton de la Fournaise, Université Paris Cité), Laurent Wendling (Université Paris Cité, LIPADE), Arnaud Lemarchand (Institut de physique du globe de Paris, Université Paris Cité), and Camille Kurz (Université Paris Cité, LIPADE)

Description: The seismicity in Mayotte is monitored daily using the seismic stations installed on the island. Retailleau et al. (under review) developed a process to automatically detect and locate seismic events in real-time. Using a neural-network-based method we identified the main waves (P and S) that are generated by an earthquake and then propagate through the earth and are recorded by the different seismic stations. The arrivals are then associated as an earthquake and are used to locate the event. This process helped greatly to increase the number of earthquakes detected and located. However, the quality of data recorded by the land stations suffers a lot from anthropic noise. In the last few years seismologists have increasingly used Distributed Acoustic Sensing (DAS) measurements. Using a DAS interrogator, a fiber optic cable can be used as a high sampling seismic network line, leading to an equivalent of a seismic sensor every 10m. Multiple studies showed that this data can be used in various contexts, to analyze seismic signals and to study the structure of the Earth (Zhan, 2020). On land, DAS recordings have been used to analyze Volcano-tectonic seismicity (Jousset et al., 2018) as well as anthropic signals (Lindsey et al., 2020). Measurements have also been made on fiber optic cables deployed on the seafloor, permitting to observe its subsurface as well as the signals generated by ocean waves (Lindsey et al., 2019). The proposed work will be based on several incremental steps: as a first step, we plan to consider active contour models (Niu et al., 2017) to locate the waves from a ground truth (approximate localization of P and S). Such models are widely used to extract regions in noisy images (e.g. in biomedical images, Lidar). They often required key points to initialize the fitting process near the region of interest (in this case the wavefront). We will choose these points from a polygonal approximation calculated on the given curves S and P. Secondly, we want to consider an automatic localization of the starting points by considering a progressive search area (vertical patch) from the left to the right using homogeneous area criteria. Due to the specificity of the wave – high density – we also propose to study two effective strategies to approximate waves (Haar wavelet and multi-scale edge detectors). The underlying idea is to provide a set of possible curves before running the active model process. Such methods could also be extended to process a series of consecutive wavefronts (fast events). Finally, we also plan to explore deep learning approaches (Khan et al., 2020) by considering ground truths (for instance from an expert evaluation of results achieved at the previous step) or an augmentation process by defining a set of possible wavefronts.

How to apply: Please contact Lise Retailleau (retailleau[at]ipgp.fr)

Further information can be found here.

Optimizing a physical RNA force-field via Machine Learning

The diffusion of technology during the last five Millennia

Automatic detection and location of hydro-acoustic signals linked to Mayotte submarine eruption

[POSITION FILLED] Modeling and characterizing genetic variant pleiotropy using machine learning to understand the human genetic architecture

Search for features in astrophysical objects close to cosmic neutrinos: An indirect approach using deep learning and statistical inference

[POSITION FILLED] Deep Learning-based EEG Epilepsy Detection and Analysis

Explanation Methods for Multivariate Time Series Classification

Mise en place d'algorithmes intelligents pour la détection et l'analyse multi-tâches d'éruptions au Piton de la Fournaise

Deep Learning-based Prediction of Query Answering Times for Data Series Similarity Search

[POSITION FILLED] Transcriptomic Analysis using Intensive Randomization

[POSITION FILLED] Veracity assessment framework for discovering social activities in urban big datasets

Monitoring the seismic activity of Mayotte through image processing of fiber optic signals

À lire aussi

Nikos Paragios – Seeing the Invisible – Doing the Impossible: Reinventing Healthcare with Generative AI-powered diagnosis, treatment and beyond

Deeply Learning from Neutrino Interactions with the KM3NeT neutrino telescope

Alon Halevy – Well-being, AI, and You: Developing AI-based Technology for Well-being

Shen Liang – Knowledge-guided Data Science