DPM Distributed Clustering

This code is an implementation of massively distributed clustering for multivariate and functional data (code available on github).

High Dimensional Data Clustering by means of Distributed Dirichlet Process Mixture Models : Khadidja Meguelati, Bénédicte Fontez, Nadine Hilgert, Florent Masseglia, IEEE International Conference on Big Data (IEEE BigData), Dec 2019, Los-Angeles, United States

Dirichlet Process Mixture Models made Scalable and Effective by means of Massive Distribution : Khadidja Meguelati, Bénédicte Fontez, Nadine Hilgert, Florent Masseglia, SAC 2019 – 34th Symposium On Applied Computing, Apr 2019, Limassol, Cyprus. pp.502-509,

DPM clustering is illustrated by the chinese restaurant process.

Terms correspondance in statistical language:
a table = a cluster
a client = an observation linked to a cluster label
a dish = parameters of a cluster
menu = space of all possible clusters

	DC-DPM	HD4C
Dressed table	Likelihood	Likelihood GP
New table	Predictive	TD approximation of the predictive

The workflow of our DC-DPM approach consists in 4 steps:

Identify local new clusters in the workers
Compute and send sufficient statistics and cluster sizes from each worker to the master
Synchronize and estimate cluster labels from sufficient statistics
Send updated cluster parameters and cluster sizes from master to workers

Contacts:

Khadidja Meguelati - khadidja.meguelati@inria.fr - INRA MISTEA - Inria, LIRMM, Univ Montpellier, CNRS - github

Bénédicte Fontez - benedicte.fontez@supagro.fr - Montpellier SupAgro MISTEA, Univ Montpellier

Nadine Hilgert - nadine.hilgert@inra.fr - INRA MISTEA, Univ Montpellier

Guilhem Huau - guilhem.huau@supagro.fr - student-engineer from Montpellier SupAgro

Florent Masseglia - florent.masseglia@inria.fr - Inria, LIRMM, Univ Montpellier, CNRS

Isabelle Sanchez - isabelle.sanchez@inra.fr - INRA MISTEA, Univ Montpellier

GPL-3 license - 2019

Dataset

Display level

Massively distributed clustering via DPM

DPM Distributed Clustering

Contacts: