Unsupervised Learning Demystified

Unsupervised learning is a pattern-finding technique for mining inspiration from your data. Let's demystify!
By Cassie Kozyrkov, Google.
c
comments

Unsupervised learning may sound like a fancy way to say “let the kids learn on their own not to touch the hot oven” but it’s actually a pattern-finding technique for mining inspiration from your data. It has nothing to do with machines running around without adult supervision, forming their own opinions about things. Let’s demystify!

If this feels familiar, unsupervised machine learning might be your new best friend.

This post is beginner-friendly, but assumes you’re familiar with the story so far:

  • Machine learning is all about labeling things using examples.
  • If you train your system by feeding it the answers you’re looking for, you’re doing supervised learning.
  • To get started with supervised learning you need to know what labels you want. (Not so with unsupervised.)
  • Standard jargon includes instance, feature, label, model, and algorithm.

What’s unsupervised learning?

 


Your mission? Put these six images into two groups however you like.

Check out the six instances above. What’s missing? These photographs are not accompanied by labels. No worries, your brain is pretty good at unsupervised learning. Let’s try it.

Think about how you’d like to split these images into two groups. There are no wrong answers. Ready?

Clustering the data

 
In a live class, Googlers shout out answers like “sitting versus standing,” “can see a wooden floor versus can’t,” “cat selfie vs not cat selfie,” and so on. Let’s examine the first answer.


One way to split the images in to two clusters: sitting versus standing. Well, “sitting” versus standing.

Unsupervised learning‘s secret labels

 
If you chose to define your clusters based on whether the cats are standing, what are the labels your system outputs? Machine learning is about labeling things, after all.

If you’re thinking “sitting vs standing” are the labels, think again! That’s the recipe (model) you’re using for creating your clusters. The labels in unsupervised learning are far more boring: something like “Group 1 and Group 2” or “A or B” or “0 or 1”. They simply indicate group membership, and they have no additional human-interpretable (or poetic) meaning.


Unsupervised learning’s labels simply indicate cluster membership. They have no higher human-interpretable meaning, as disappointingly boring as that may feel.

All that is happening here is that the algorithm groups things by similarity. The similarity measure is specified by the choice of algorithm, but why not try as many as possible? After all, you don’t know what you’re looking for and that’s okay. Think of unsupervised learning as a sort of mathematical version of making “birds of a feather flock together.

Like a Rorschach card, the results are there to help you dream.

Don’t take whatever you see in them too seriously.

Look again!

 
As the proud mother of these two individual cats, I’m saddened that in the 50 or so times I’ve taught this lesson, only one audience noticed: “Cat 1 versus Cat 2.” Instead it’s answers like “sitting, standing” or “wooden floor absent/present” or sometimes even “ugly cats versus pretty cats.” (Awww.)


Turns out these were photos of my two individual cats! Maybe you spotted it, but most of my audiences don’t… unless I give them the labels (supervise their learning). If I’d presented the data with name labels in the first place and then asked you to classify the next photo, I bet you’d find the task easy.

Lessons learned

 
Imagine I’m a novice data scientist getting started with unsupervised learning and (naturally!) interested in my own two cats. I won’t be able to unsee my cats when I look at these data. Because my cuddlebugs are so meaningful to me, I expect my unsupervised machine learning system to be able to recover the only thing worth caring about here. Oops!

Before this decade, computers couldn’t even hope to compete with the best pattern finder in the world for this kind of task: the human brain. This is easy for people! So why did the thousands of Googlers who saw these unlabeled photos miss the “Cat 1 versus Cat 2” answer?

Think of unsupervised learning as a sort of mathematical version of making “birds of a feather flock together.”

Just because something’s interesting to me doesn’t mean my pattern finder will find it. Even if the pattern finder is awesome, I didn’t tell it what I’m looking for, so why would I expect my learning algorithm to deliver it? This isn’t magic! If I don’t tell it what the right answers are... I get what I get and I don’t get upset. All I can do is look at the clusters the system returns for me and see if I find them inspiring. If I don’t like 'em, I just run a different unsupervised algorithm (“Someone else in the audience, split them for me a different way”) over and over until something feels interesting.

The results are a Rorschach card to help you dream.

There’s no guarantee that anything inspiring will come out of the process, but it doesn’t hurt to try. Exploring the unknown is supposed to be a bit of an adventure, after all. Have fun with it!

In future episodes, we’ll look at cautionary tales of what can go wrong if you forget that the labels are just an inspiration and shouldn’t be taken too seriously, let alone treated as human-interpretable. (Hint: there may be mention of finding Elvis in a slice of toast.) They’re just there to give you ideas about what you might like to dive into next.

Summary: Unsupervised learning helps you find inspiration in data by grouping similar things together for you. There are many different ways of defining similarity, so keep trying algorithms and settings until a cool pattern catches your eye.

 
Bio: Cassie Kozyrkov is Chief Decision Intelligence Engineer at Google. ❤️ Stats, ML/AI, data, puns, art, theatre, decision science. All views are hers.

Original. Reposted with permission.

Related:

  • The 5 Clustering Algorithms Data Scientists Need to Know
  • K-Means in Real Life: Clustering Workout Sessions
  • Clustering Using K-means Algorithm