Massively distributed clustering via DPM

DPM Distributed Clustering


This code is an implementation of massively distributed clustering for multivariate and functional data (code available on github).

High Dimensional Data Clustering by means of Distributed Dirichlet Process Mixture Models : Khadidja Meguelati, Bénédicte Fontez, Nadine Hilgert, Florent Masseglia, IEEE International Conference on Big Data (IEEE BigData), Dec 2019, Los-Angeles, United States

Dirichlet Process Mixture Models made Scalable and Effective by means of Massive Distribution : Khadidja Meguelati, Bénédicte Fontez, Nadine Hilgert, Florent Masseglia, SAC 2019 – 34th Symposium On Applied Computing, Apr 2019, Limassol, Cyprus. pp.502-509,

DPM clustering is illustrated by the chinese restaurant process.

Terms correspondance in statistical language:
a table = a cluster
a client = an observation linked to a cluster label
a dish = parameters of a cluster
menu = space of all possible clusters

DC-DPM HD4C
Dressed table Likelihood Likelihood GP
New table Predictive TD approximation of the predictive



The workflow of our DC-DPM approach consists in 4 steps:

  1. Identify local new clusters in the workers
  2. Compute and send sufficient statistics and cluster sizes from each worker to the master
  3. Synchronize and estimate cluster labels from sufficient statistics
  4. Send updated cluster parameters and cluster sizes from master to workers

Contacts:

  • Khadidja Meguelati - khadidja.meguelati@inria.fr - INRA MISTEA - Inria, LIRMM, Univ Montpellier, CNRS - github

  • Bénédicte Fontez - benedicte.fontez@supagro.fr - Montpellier SupAgro MISTEA, Univ Montpellier
  • Nadine Hilgert - nadine.hilgert@inra.fr - INRA MISTEA, Univ Montpellier
  • Guilhem Huau - guilhem.huau@supagro.fr - student-engineer from Montpellier SupAgro
  • Florent Masseglia - florent.masseglia@inria.fr - Inria, LIRMM, Univ Montpellier, CNRS
  • Isabelle Sanchez - isabelle.sanchez@inra.fr - INRA MISTEA, Univ Montpellier

  • GPL-3 license - 2019