Funded projects in 2022

In November 2021, the Data Intelligence Institute of Paris (diiP) selected 17 interdisciplinary projects using data science and machine learning. They will run from January to December 2022, and consist of 14 master’s internships and 3 strategic projects. Learn more below.

Strategic Projects

Physics & Astronomy

Learning from deep sea light with KM3NeT

Other relevant disciplines: Biology, Earth Sciences/Geosciences

Neutrinos are fundamental particles that are produced in a multitude of nuclear processes and permeate our universe. Despite their abundance, neutrinos are extremely difficult to observe due to their very weak interaction with matter. Neutrinos produced in the atmosphere will typically traverse the whole Earth as if it wasn’t even there. In order to detect such elusive particles, the KM3neT experiment is building gigantic arrays of photosensors submerged in the deepest regions of the Mediterranean, where few other particles can reach and the clear seawater provides a huge natural target for neutrino interactions.
In the rare occasions when these neutrinos interact inside or near the KM3NeT detectors, multiple charged particles are created which in turn produce light as they travel through the seawater. By observing the pattern that these light signals leave in the detector, KM3NeT is able to reconstruct basic properties of the neutrino interactions such as energy, momentum, and flavour. Currently, these tasks are performed mostly by hand-crafted algorithms based on fundamental physics knowledge. The goal of this project is to enhance the capabilities of KM3NeT by exploring cutting-edge deep learning techniques to replace traditional reconstruction methods, pushing the boundaries of what is possible and enabling new areas of research with the KM3NeT infrastructure.

Key words: Neutrinos, Machine Learning, Bioluminescence, Graph Neural Networks

Project coordinator: Joao Coehlo (Université Paris Cité)

Link to poster

Computer Science

Language-Independent Massive Network Attitudinal Embedding

Other relevant disciplines: Political Science

Public opinion on different issues of public debate is traditionally studied through polls and surveys. Recent advancements in network ideological scaling methods, however, show that digital behavioral traces in social platforms can be used to infer opinions at a massive scale. Current approaches allow to position social network users on ideological scales ranging extreme left- to extreme right-wing stances. This is suited for two-party systems and binary social divides, as the case of the US. However, no similar methodology exists for European (and other) settings, where public political debate is structured along several different issues and cleavages, and party systems impose a multi-polar landscape. To overcome this gap, we propose a novel approach that mixes ideological scaling of social graphs with external attitudinal data. The former uses social network structures to embed users in spaces where dimensions are informative of ideological traits. The latter provide stances for a few users on attitudinal scales (from most opposed to most favorable) for a range of well-identified issues of public debate: e.g., taxation, immigration, European integration, environmental protection, civil liberties. Mixing the two, we aim at embedding a massive number of users in attitudinal spaces where positions along several dimensions indicate opinions towards identifiable issues structuring public debate. Attitudinal embeddings finds numerous applications explored within the project, from recommendation algorithm audit and design, to the study of party systems and political competition online, polarization, and informational and media dynamics.

Key words: network embedding, sparse embedding ideological scaling, attitudinal data, ideologies, polarization, party systems

Project coordinator: Pedro Ramaciotti-Morales (SciencesPo)

Link to poster


Studying the ability of teenagers to spot fake news over their usage-time on social networks

Other relevant disciplines: Computer Science

The spread of online fake news is emerging as a major threat to human society and democracy (Lazer et al., 2018). Now anyone with access to a phone or computer can publish information online, it’s getting harder to tell what’s real? What’s fake? Therefore, the impact of manipulative online content on society cannot be underestimated. While considerable efforts have been invested to study the cognitive processes involved in media truth discernment, no study to date has investigated 1) the development of media truth discernment during adolescence and 2) how this key process is influenced by the time spend by adolescents on their smartphone and the nature of the applications they use. This is critical given that (a) 84% of 13 to 18 years of age own a smartphone, (b) 81% spend more than 2 hours per day on screen media (18% between 2h and 4h, 33% between 4h and 8h and 29% more than 8h) and an hour a day watching online videos on platforms such as YouTube (c) 39% of their time online is on social media with Snapchat and Instagram being the most used social media, (d) 70% log on to social media several times a day and (e) that the main source of news for adolescents is YouTube (Rideout et al., 2019). Furthermore, most teens get their news from their feeds, so they need to learn how to view the stories critically. Adolescence is typically defined as a transition period between childhood and adulthood (approximately ages 12-22 years) during which parental influence decreases and peers become more important. The period of adolescence begins with the physical, cognitive, and social changes occurring with the onset of puberty around 12 years of age (Crone et al., 2012). Cognitive and socio-affective development in adolescence is accompanied by extensive changes in the brain. Consequently, adolescents due to cognitive and socio-affective specificities might thus be at greater risk to believe fake news in particular those shared on social media. Thus, there is an urgent need to determine how media truth discernment develops from early adolescence (10 y.o.) to late adolescence (18-21 y.o) and how both the usage times and the nature of the applications present on their smartphones affect media truth discernment. The project aims at bringing an interdisciplinary team together from psychology and computer science fields.
The main objective of the project is to analyze the ability of teenagers to distinguish what’s real and what’s fake by detecting the fallacious arguments used by unscrupulous content writers on social networks.

Key words: fake news detection, symbolic AI

Project coordinator: Salima Benbernou (Université Paris Cité)

Link to poster

Master’s Projects


Optimization of a physical forcefield for simulations of noncoding RNA molecules

Other relevant disciplines: Computer Science, Physics/Astronomy

The importance of the study of RNA molecules has been highlighted by the recent pandemic, with the SARS-CoV-2 virus featuring an RNA-based genome and a replication mechanism controlled by non-coding RNA. The function of these molecules strictly depends on the 3D structure adopted, but these are hard to obtain experimentally, and even harder is to study how a structure changes in response to the environment. Computational modeling using dedicated force-fields can provide a coherent view of the molecule, which can follow this dynamical behavior and include the effect of the environment.
Inspired also by the recent success of AlphaFold for protein folding, we plan to use ML to ameliorate our description of the molecule to predict structures. The main goal of this project is the optimization of our RNA model, HiRE-RNA, through ML to obtain a cutting-edge RNA force field to facilitate building functional three-dimensional structures for RNA molecules. We will employ machine learning to optimize the model, exploiting extensively the structural data available in the Nucleic Acids Database (NDB) and the sparse thermodynamic and dynamic data available from experiments. This approach will allow our model to give much more accurate and reliable structural predictions and to be deployed on systems of more complex architectures than currently possible. Our aim here is to anchor our force field model deep into the corresponding physics by adapting recent and promising Symbolic Regression algorithms to our data format and selecting the possible improvements in the functional form of the force field uncovered by this technique, based on sound physical principles.
The M2 internship will be the first step of a larger project where we propose to first use the existing functional form of the force field and train its 100+ coefficients and then to then build upon the ML pipeline developed in the first step to learn additional terms of the force field.

Key words: RNA modeling, force-field optimization, biomolecular simulations

Project coordinator: Samuela Pasquali (Université Paris Cité)

Link to poster

Computer Science

Monitoring the seismic activity of Mayotte through image processing of fiber optic signals

Other relevant disciplines: Computer Science, Mathematics/Statistics, Earth Sciences/Geosciences

The seismicity in Mayotte is monitored daily using the seismic stations installed on the island. Retailleau et al. (under review) developed a process to automatically detect and locate seismic events in real-time. Using a neural-network-based method we identified the main waves (P and S) that are generated by an earthquake and then propagate through the earth and are recorded by the different seismic stations. The arrivals are then associated as an earthquake and are used to locate the event. This process helped greatly to increase the number of earthquakes detected and located. However, the quality of data recorded by the land stations suffers a lot from anthropic noise. In the last few years seismologists have increasingly used Distributed Acoustic Sensing (DAS) measurements. Using a DAS interrogator, a fiber optic cable can be used as a high sampling seismic network line, leading to an equivalent of a seismic sensor every 10m. Multiple studies showed that this data can be used in various contexts, to analyze seismic signals and to study the structure of the Earth (Zhan, 2020). On land, DAS recordings have been used to analyze Volcano-tectonic seismicity (Jousset et al., 2018) as well as anthropic signals (Lindsey et al., 2020). Measurements have also been made on fiber optic cables deployed on the seafloor, permitting to observe its subsurface as well as the signals generated by ocean waves (Lindsey et al., 2019). The proposed work will be based on several incremental steps: as a first step, we plan to consider active contour models (Niu et al., 2017) to locate the waves from a ground truth (approximate localization of P and S). Such models are widely used to extract regions in noisy images (e.g. in biomedical images, Lidar). They often required key points to initialize the fitting process near the region of interest (in this case the wavefront). We will choose these points from a polygonal approximation calculated on the given curves S and P. Secondly, we want to consider an automatic localization of the starting points by considering a progressive search area (vertical patch) from the left to the right using homogeneous area criteria. Due to the specificity of the wave – high density – we also propose to study two effective strategies to approximate waves (Haar wavelet and multi-scale edge detectors). The underlying idea is to provide a set of possible curves before running the active model process. Such methods could also be extended to process a series of consecutive wavefronts (fast events). Finally, we also plan to explore deep learning approaches (Khan et al., 2020) by considering ground truths (for instance from an expert evaluation of results achieved at the previous step) or an augmentation process by defining a set of possible wavefronts.

Key words: Earthquake, fiber optics, image analysis, event detection

Project coordinator: Lise Retailleau (Université Paris Cité)

Link to poster

Transcriptomic Analysis using Intensive Randomizarion

Other relevant disciplines: Mathematics/Statistics, Biology

Next-generation sequencing such as RNA-seq aims to quantify the transcriptome of biological samples and compare gene expression between different experimental conditions. The quantification of the genome alignements stemming from such technologies represent the relative measurements which cannot be directly compared between conditions without an adequate data normalization. The optimal approach to normalize such data has not reached a consensus to date (Abrams et al. 2019). Unfortunately, existing methods suffer from practical limitations and may be compromised by the presence of genes showing high expression level or strong variability. In this case a single normalization procedure can lead to erroneous results and false conclusions. Therefore, a novel statistical framework for differential analysis in transcriptomics has been proposed (Desaulle et al. 2021) which is based on intensive iterative random data normalizations and provides good control of the statistical errors. At present, it has been implemented in the R package DArand (Desaulle and Rozenholc 2021) and is publicly available from the Comprehensive R Archive Network. The current package is written in R language and uses only CPU parallelization. Due to the large data size and the framework based on intensive iterative randomizations, further project development requires more advance programming. More precicely, the iterative procedure uses intensive computations and may become rapidly time-consuming with respect to both the size of the transcriptomic experiment and the number of samples. Therefore, the main mission during the internship will consist in adapting the code for efficient parallel processing on a graphic processing unit (GPU) using CUDA.The computational optimization will play an important role in further methodological development. Indeed, the subsequent contribution will aim at extending the methodology from two to more biological conditions. It will be directed towards statistical analysis with more than two conditions such as differential analysis, principal component analysis (PCA) and more generally unsupervised learning tools. Here the difficulty will be to preserve an iterative structure of the procedure with data normalization and while combining results from different approaches in data analysis. The methodological aspects, the implementation and the validation will be followed by the real-data application involving the miRNA data.

Key words: Big data, RNA-seq, computational optimization, GPU parallelization, differential analysis

Project coordinator: Dorota Desaulle (Université Paris Cité)

Veracity assessment framework for discovering social activities in urban big datasets

Other relevant disciplines: Political Science, Economics, Sociology

A particular recent trend for social scientists is to understand the potential of big data in complementing traditional research methods and their value in making decisions. Several major issues have to be closely investigated around big data in social sciences, including political polarization, viral information diffusion, and economic performance. The veracity and value characteristics of big data are the main concerns for social scientists. This master internship will focus on urban data, particularly the NYC taxi dataset, to develop technical procedures that help social scientists deal with this and similar urban datasets. Social scientists have used the NYC dataset in the past and yet left many dimensions unexplored. Most problematically, they have not yet provided a technology that allows for fast, flexible data access and a strategy for ensuring the quality of the data. Once such an infrastructure is in place, the NYC taxi dataset can lead to better understanding of core questions in the social sciences, such as economic decision-making and labor mobility, as well as a strategy for how social scientists can work with novel datasets. In this work, we would study data quality issues in NYC taxi big dataset by considering all of data inconsistencies, data inaccuracies, and data incompletenesses. We will propose a veracity assessment model with a veracity score calculus and veracity assessment approaches that correlate the NYC taxi data veracity to their various business queries without repairing data.

Key words: veracity, value, big data, quality, social scientists

Project coordinator: Soror Sahri (Université Paris Cité)

Link to poster

Combining visual and textual informations for enhancing image retrieval systems in radiological practices

Other relevant disciplines: Medicine

The field of diagnostic imaging in Radiology has experienced tremendous growth both in terms of technological development (with new modalities such as MRI, PET-CT, etc.) and market expansion. This leads to an exponential increase in the production of imaging data, moving the diagnostic imaging task in a big data challenge. However, the production of a large amount of data does not automatically allow the real exploitation of its intrinsic value for healthcare. In modern hospitals, all imaging data acquired during clinical routines are stored in a picture archiving and communication system (PACS). A PACS is a medical imaging technology providing economical storage and convenient access to images from multiple modalities. Digital images linked to patient examinations are often accompanied by a medical report in text format, summarizing the radiologist’s report and the clinical data associated with the patient (age, sex, medical history, report of previous examinations, etc.). The problem with PACS systems is that they were primarily designed for archival purposes and not for image retrieval exploitation. Therefore they only allow a search by keywords (name of the patient, date of the examination, type of examination, etc.) and not by pathologies or by content of the image, and they cannot fulfill the function of diagnostic aid when the doctor is confronted with an image of difficult interpretation or of rare pathology. The objective of this project is to combine current research in computer vision and AI to implement a method making it possible to query PACS through example images in order to search for images containing similar pathological cases and to benefit radiologists as a potential decision-making aid during hospital routines.

Key words: Case Retrieval, Machine Learning, Contrastive Learning, Variational AutoEncoders

Project coordinator: Florence Cloppet (Université Paris Cité)

Link to poster

Investigating regulatory B cell differentiation and their therapeutic effect in neuroinflammatory disease through single cell analyses and computational biology

Other relevant disciplines: Mathematics/Statistics, Biology, Neuroscience

Abstract will be updated soon.

Key words: Systems Biology, Artificial Intelligence, Computational Sciences, Cell therapy, cell differentiation, neuroinflammation, high content single cell analyses

Project coordinator: Simon Fillatreau (Université Paris Cité)

Earth Sciences and Geosciences

Automatic detection and location of hydro-acoustic signals linked to Mayotte submarine eruption

Other relevant disciplines: Computer Science, Mathematics/Statistics, Earth Sciences/Geosciences

Submarine volcanic eruptions generate numerous seismic and hydro-acoustic signals (West Mata, Axial Seamount) and Mayotte is no exception. Ocean Bottom Seismometers (OBS) have been deployed since 2019 to continuously monitor the eruption. They record earth motion on a 3 component geophone and sound propagating in the water column through an hydrophone. Manual analysis of continuous OBS data have highlighted the recording of very short hydro-acoustic signals. The location of a few tens of them have shown they can be associated to active lava flows as in previous studies at Axial Seamount.

The goal of this internship is to automatically detect and locate the hydro-acoustic signals on the continuous OBS recordings since the first deployment in February 2019. The proposed work is to first use template matching type algorithms to detect those weak signals in the continuous data record. This will first allow to assess hydro-acoustic activity level evolution along the eruption. Next, those signals will be located and the location accuracies will be assessed using the numerous water sound velocity measures performed during marine surveys. Finally, the locations will be compared and associated to the lava flows identified during bathymetry surveys.

Key words: Mayotte, hydro-acoustic event, template matching

Project coordinator: Jean-Marie Saurel (Université Paris Cité)

Link to poster


The diffusion of technology during the last five Millennia

Other relevant disciplines: Computer Science, History, Digital Humanities

Technology diffusion is often considered to be one of the most important drivers of economic growth, but its drivers and quantitative importance is still not well understood. In this project we study technology diffusion over the very long run, and through the lens of a vast newly created database of records of artefacts in museum collections. Each record contains information on a date, a place (findspot or production place), as well as information about technology embodied in the object through description of the item or its materials. We harmonize and code dates and locations using gazetteers assembled by scholars in the digital humanities. To reduce the dimensionality of object descriptions, we use NLP techniques to link the materials, techniques, and objects to entities from controlled vocabularies. Preliminary results show that the approach scales well and is able to match the emergence of several key technologies in human history. Ultimately, our objective is to study the change in the spatial distributions of technologies in relation to trade, migration, and political changes.

Key words: Technology Diffusion, Growth, Trade, Cultural Heritage Metadata

Project coordinator: Johannes Boehm (SciencesPo)

Mathematics and Statistics

Multiple imputation for heterogeneous biological data

Other relevant disciplines: Computer Science, Biology, Medicine

Identifying clusters of individuals in heterogeneous data is a classical task in data mining. However, many cluster analyses methods do not address missing data. The topics of missing data is largely studied in the literature, but it remains limited in the context of clustering.

Multiple imputation is one of the efficient methods to tackle the missing data issue, notably in the regression framework, but also recently in clustering. Its principle consists in replacing each missing value by several plausible values. Denoting M this number, it consists in generating M imputed data sets, i.e. completed. In the linear regression framework, regression coefficients are then estimated from each imputed data set, leading to M sets of regression coefficients. Finally, these estimates are pooled using the so-called Rubin’s rules, providing a unique point estimate as well as a unique estimate of its associated standard error.

One classical way for generating imputed data consists in imputing variables one by one according to univariate regression models. This technique is well-known from the medical field under the name fully conditional specification and very popular beyond this field. However, such methods are not tailored to deal with heterogeneous data. Some works have been recently done in this line, but they remain limited.

Thus, in this project we propose a novel fully conditional specification method based on clusterwise linear regression instead of linear regression. Clusterwise regression methods boil down building a collection of local regression models so that each group of individuals is associated to a specific regression model. In this work, the novel multiple imputation method is presented. Its properties are theoretically studied and assessed by an extensive simulation study.

From a practical point of view, this method offers to practicians an efficient way to tackle complex data in their medical studies. More generally, this multiple imputation method can be valuable for any clustering analysis with missing values (since implicitly assuming a heterogeneity between individuals) and thus covers a large spectrum of applications in data science.

Key words: Multiple imputation, complex data, missing values, biology, heterogeneity, sequential imputation, clusterwise regression

Project coordinator: Matthieu Resche-Rigon (Université Paris Cité)

Modeling genetic pleiotropy using machine learning to understand the human genetic architecture

Other relevant disciplines: Computer Science, Biology, Medicine

Nowadays in human genetics, one particular concept seems to resurge: pleiotropy. Pleiotropy occurs when one genetic element (e.g. variant, gene) has independent effects on several traits. Although pleiotropy is extremely common and thought to play a central role in the genetic architecture of human complex traits and diseases, it is one of the least understood phenomena. We have shown that several biological mechanisms exist and induce different pleiotropy states at the level of the variants. Specifically, we have conceptualized 5 biological mechanisms 1) linkage disequilibrium; 2) causality between traits; 3) genetic correlation between traits; 4) high polygenicity of traits; 5) horizontal pleiotropy (true independent effects of a variant on two traits). This internship will be dedicated to building a comprehensive framework to disentangle all 5 states of pleiotropy and provide a genome-wide map of pleiotropy using machine learning. Specifically, we propose 1) to improve on a method that we have published in a proof of-concept paper using unsupervised approaches based on penalized methods, random forests or deep learning; 2) to explore semi-supervised learning using a creative strategy to label data that we have developed.

Key words: statistical genetics; pleiotropy; complex traits and diseases; post-genomic data; machine learning; GPU programming

Project coordinator: Marie Verbanck (Université Paris Cité)


Learning-based EEG Epilepsy Detection and Analysis

Other relevant disciplines: Computer Science

Electroencephalogram (EEG) is one of the most common and essential medical signal collected by neural scientists for the analysis of nerve diseases. With the rapid development of medical instruments and data collection techniques, EEG analysis has also been witnessed a dramatic progress. One important problem of EEG analysis is epilepsy pattern detection and analysis. Epilepsy is a brain disease generally associated with seizures, deteriorating the life quality of many patients. This internship targets to design effective deployment schemes of modern deep learning techniques on EEG Epilepsy detection, with a focus on real-world applications for neural scientists.

Key words: EEG, epilepsy, epileptic pattern detection and analysis

Project coordinator: Jérôme Cartailler (Université Paris Cité)

Link to poster


Tracking auto-immune diseases in electronic health record

Other relevant disciplines: Mathematics/Statistics, Biology

The collection of biomedical data requires a multitude of software tools for purposes ranging from websites for patients’ follow-ups to alarm algorithm for medical monitoring by physicians or large scale collection of data for epidemiological analyses. Yet many more opportunities remain to leverage technology to better patients’ care. To date, vast amount of human and technical efforts have already been spent on genetic disorders and the emergence of deep-sequencing analyses for patients’ samples has led to significant progress. However, rare auto-immune diseases, which can affect multiple organs simultaneously, have yet to be thoroughly investigated and effectively reported in the electronic health record.

This project aims to create and describe new tools to obtain a comprehensive view of all autoimmune diseases in the scientific literature and how it can be represented in electronic health record. The project will focus on curation and extraction of data using several databases (ORDO, Pubmed and medical records). This work, already initiated, will be done in collaboration with the Assistance Publique – Hôpitaux de Paris – AP-HP, rare disease nomenclature expert Orphanet and the Human Phenotype Ontology HPO.

Key words: Rare diseases, Information retrieval, Knowledge graph, NLP, data mining, healthcare databases

Project coordinator: Maud de Dieuleveult (Université Paris Cité)

Link to poster

Physics & Astronomy

Search for features in astrophysical objects close to cosmic neutrinos: An indirect approach to cosmic neutrino association with astrophysical objects

Other relevant disciplines: Computer Science

The work proposed here is in the field of Astroparticle Physics, a sub-branch of Physics dealing with the understanding of the Universe through the detection of gamma rays, neutrinos, gravitational waves and cosmic rays. In particular, here, we focus on a search for a connection between high-energy neutrinos and gamma rays in the extragalactic sky. Two large observatories have been designed to be able to detect high-energy neutrinos from astrophysical environments: IceCube and KM3NeT. IceCube already has collected 10-years of data, which resulted in a catalogue of neutrinos having a high probability of being of cosmic origin, while KM3NeT is an observatory under construction. The significance of the signal of IceCube cosmic neutrinos shows that still no firm conclusion can be drawn on the association of these with astrophysical objects. This Master project concerns an indirect search for neutrino associations with astrophysical objects using a statistical inference approach, taking advantage of the published neutrino lists, catalogues of astrophysical objects, and open data from the Fermi observatory. Technically, the project needs the development of a full Python analysis chain using Deep Learning.

Key words: Neutrino and Gamma-Ray Astronomy, Deep Learning, Real Data Analysis, Data Augmentation

Project coordinator: Yvonne Becherini (Université Paris Cité)

Link to poster

Random projections for the reduction of gravitational wave template banks

Other relevant disciplines: Computer Science

The direct observation of gravitational waves by the LIGO and Virgo detectors is one of the breakthrough discoveries of the beginning of the 21st century. Matched filtering techniques have shown to be very effective for searching for gravitational signals in the LIGO and Virgo data (dominated by instrumental noise). Matched filtering consists in correlating the data with a set of waveform templates (called bank) that provides a discrete sampling covering the entire relevant astrophysical parameter space. Template banks capable for covering the accessible parameter space typically include 500,000 waveforms. Correlating with this many templates thus represents a large computing cost. The goal of this work is to investigate ways to reduce the gravitational-wave template banks through random projections and to perform this reduction efficiently using Optical Processing Units.

Key words: gravitational-wave astronomy, data reduction, random projections, optical processing units

Project coordinator: Eric Chassande-Mottin (Université Paris Cité)

À lire aussi

diiP Summer School: June 10-14, 2024

diiP Summer School: June 10-14, 2024

The diiP is organizing a Summer School on Data Science (with a focus on deep learning data analytics techniques), on Jun10-14. Read the details below, and register now! The first diiP Summer School on Data Science (dSDS) will be held...

diiP Projects Day: December 6th, 2023

diiP Projects Day: December 6th, 2023

Join us for the diiP Projects Day, an in-person event that will highlight past and upcoming projects, offer opportunities for discussions and networking, and host Prof. Joseph Sifakis (Turing Award winner, 2007) for the last Distinguished Lecture of 2023....