Data Science Seminar
The LTCI Data Science Seminar is a joint research seminar between the DIG team and the S2A teams. It focuses on machine learning and data science topics.
June 6, 2019
The seminar took place from 2PM to 4PM (room C49), and featured two talks:
You can download the slides of this talk.
Abstract: We consider the representation of objects of the real world in a Euclidean space using their quantitative properties as coordinates, which is referred to as multivariate data. In the presence of several coordinates (or variables) no natural way to compare the objects (or observations) exists. In 1975 John W. Tukey suggested a novel way of ordering multivariate data according to their centrality. To achieve this, he defined what is nowadays known as the Tukey depth (also called halfspace or location depth), which computes how deep is an observation in the data cloud and thus ranks the data with respect to their degree of centrality, and yields a family of trimmed convex regions in a Euclidean space. As they are visual, affine invariant/equivariant and robust, Tukey depth and regions are useful tools in nonparametric multivariate analysis. However, their practical use in applications has been impeded so far by the lack of efficient computational procedures if the number of variables exceeds two.
First, we suggest a theoretical framework for computing the exact value of the halfspace depth of a point w.r.t. a data cloud in the Euclidean space of arbitrary dimension. Based on this framework, a whole class of algorithms can be derived while three variants of this class are studied in more detail. All of these algorithms are capable of dealing with data that are not in general position and even with data that contain ties. As our simulations show, all proposed algorithms prove to be very efficient.
Second, using similar ideas, we construct an algorithm to compute a Tukey trimmed region, that is much faster than the known ones. Also, a strict bound on the number of facets of a Tukey region is derived. We explore both speed and precision of the algorithm in a large simulation
study. Finally, the approach is extended to an algorithm that calculates the innermost Tukey region and its barycenter, the Tukey median.
Bio: Pavlo Mozharovskyi joined Télécom Paris as an Assistant Professor in 2018. After having finished his studies at Kyiv Polytechnic Institute in automation control and informatics, he obtained a PhD degree at the University of Cologne in 2014, where he conducted research in nonparametric and computational statistics and classification. He has been a postdoc at Agrocampus Ouest in Rennes with the Centre Henri Lebesgue for a year working on imputation of missing values, and then joined the CREST laboratory at the National School of Statistics and Information Analysis. His main research interests lie in the areas of statistical data depth function, classification, computational statistics, robust statistics, missing values, and data envelopment analysis.
You can download the slides of this talk.
Abstract: Treewidth is a parameter that measures how tree-like a data instance is, and whether it can reasonably be decomposed into a data structure resembling a tree. Many computation tasks are known to be tractable on data having small treewidth, but computing the treewidth of a given instance is intractable. This talk presents the first large-scale experimental study of treewidth and tree decompositions of real-world data, with a focus on graph data. We aim to find out which data, if any, can benefit of the wealth of algorithms for data of small treewidth. For each dataset, we obtain upper and lower bound estimations of their treewidth, and study the properties of their tree decompositions. We show in particular that, even when treewidth is high, using partial tree decompositions can result in data structures that can assist algorithms.
Bio: Silviu Maniu is associate professor of computer science in the LaHDAK team of LRI at Université Paris-Sud in Orsay, France, since September 2015. Before, he was a researcher in Noah’s Ark Lab of Huawei in Hong Kong, and, between 2012 and 2014, a Postdoctoral Fellow in the Department of Computer Science of the University of Hong Kong. He received the Ph.D. degree in computer science from Télécom Paris in 2012, and the Dipl.Ing. degree from Politehnica University in Timisoara, Romania in 2005. His research interests are centered on data mining on the Web, with a focus on uncertain and social data.