Kandinsky: Using KMeans (and friends) to play with the colors of photograph(s)
Clustering is tricky yet absolutely essential for many a Machine Learning initiative. The what, the how and the why confound each time we look at the data, whether it is customer segmentation (or cohort) analysis or it is finding centers of influence or breaking down a population into groups to build different models for each.
Studying clustering algorithms like KMeans using toy datasets is insufficient (and often tedious) because it does not let you experience real-world problems. For e.g. the problem when the centroids don't settle, or situations where we have too many or too few clusters. Which distance measure to use and when? How to prepare (normalize? standardize?) the dataset for clustering?
Also, not too many real-world scenarios are "visual", unless we plot a graph or two, and that fails when we deal with higher dimensions.
What if we could use a non-trivial but visual data source? Like the colors and pixels of a photograph, where we could see the data that went in and the resultant output clusters?
The obvious **takeaways of this talk**, in my experience, are that *Data Science and Data Engineering practitioners* gain a deeper understanding of what's going on in the clustering algorithms in a fun, very "visual" and engaging manner; and also build a better intuition about the best approach to take for solving a problem.
About Shaurya Agarwal
Deputy Head - Engineering, at Barnes and Noble (BNED LoudCloud).
With 20+ years of experience in Analytics & Machine Learning, Big Data and Cloud Computing, Shaurya is leading the engineering teams at BNED that are working on building the next generation of data products for the company.