Ph.D
Group : Bioinformatics
Passage à l'échelle, propriétés et qualité des algorithmes de classements consensuels pour les données biologiques massives
Starts on 01/10/2017
Advisor : COHEN-BOULAKIA, Sarah
Funding : Contrat doctoral uniquement recherche
Affiliation : Université Paris-Saclay
Laboratory : LRI - BioInfo
Defended on 14/06/2021, committee :
Directrice de thèse :
- Sarah Cohen-Boulakia Professeure, LISN, Université Paris-Saclay
Rapporteurs et examinateurs :
- Guillaume Fertin, Professeur, LS2N (Laboratoire des Sciences du Numérique de Nantes), Université de Nantes
- Sylvie Hamel, Professeure, Département d’informatique et de recherche opérationnelle, Université de Montréal, Canada
Examinateurs :
- Mokrane Bouzeghoub, Professeur, DAVID, UVSQ, Université Paris-Saclay
- Miguel Couceiro, Professeur, LORIA (Laboratoire lorrain de Recherche en Informatique et ses Applications), Université de Lorraine
- Gaëlle Lelandais, Professeure, I2BC (Institut de biologie intégrative de la cellule), Université Paris-Saclay
- Stéphane Vialette, Directeur de Recherche CNRS, LIGM (Laboratoire d’Informatique Gaspard-Monge), Université Gustave Eiffel
Co-encadrant, jury invité :
- Alain Denise, Professeur, LISN, Université Paris-Saclay
Jury invité :
- Adeline Pierrot, Maître de Conférences, LISN, Université Paris-Saclay
Research activities :
Abstract :
Biologists and physicians regularly query public biological databases, for example when they are looking for the most associated genes towards a given disease. The chosen keyword are particularly important: synonymous reformulations of the same disease (for example "breast cancer" and "breast carcinoma") may lead to very different rankings of (thousands of) genes. The genes, sorted by relevance, can be tied (equal importance towards the disease). Additionally, some genes returned when using a first synonym may be absent when using another synonym. The rankings are then called "incomplete rankings with ties". The challenge is to combine the information provided by these different rankings of genes. The problem of taking as input a list of rankings and returning as output a so-called consensus ranking, as close as possible to the input rankings, is called the "rank aggregation problem". This problem is known to be NP-hard. Whereas most works focus on complete rankings without ties, we considered incomplete rankings with ties. Our contributions are divided into three parts. First, we have designed a graph-based heuristic able to divide the initial problem into independent sub-problems in the context of incomplete rankings with ties. Second, we have designed an algorithm able to identify common points between all the optimal consensus rankings, allowing to provide information about the robustness of the provided consensus ranking. An experimental study on a huge number of massive biological datasets has highlighted the biological relevance of these approaches. Our last contribution the following one : we have designed a parameterized model able to consider various interpretations of missing data. We also designed several algorithms for this model and did an axiomatic study of this model, based on social choice theory.