Rand Score

Introduction#

The Rand score is metric that shows us the similarity between 2 data clusterings or one clustering and the ground truth.

It is formally defined by this formula:

Rand \ Score = \frac{TP + TN}{TP + FP + TN + FN}

If you're unsure what TP, TN, FP and FN are, you can read more about them here.

Here's a meme if you just want a quick recap.

In the context of clustering, here's what they mean:

TP: Same class and Same cluster FP: Same class and Different cluster TN: Different class and Same cluster FN: Different class and Different cluster

The value of Rand score is between 0 and 1, 0 indicating that both clusters do not match at all and 1 indicating that the clustering is completely same as the other.

Another form of Rand score is:

Rand \ Score = \frac{a + b}{n \choose 2}

a is the number of pairs of points that are in the same cluster both clustering algorithims
b is the times the pair of points are in different clusters belong in different clustering algorithims.
ⁿC₂ is the total number of unordered pairs.

Let's understand this better with an example.

Example#

We have data of 5 flowers and we cluster them using 2 different clustering algorithms, here are the results.

Clustering 1: {1,1,1,2,2} Clustering 2: {1,1,2,2,3}

Now we will calculate the Rand score for both clustering algorithms, starting with counting the number of unordered pairs.

We have 5 flowers in the dataset, let's call them a, b, c, d, e.

Dataset: {A, B, C, D, E}

The 10 unordered pairs are: {A, B}, {A, C}, {A, D}, {A, E}, {B, C}, {B, D}, {B, E}, {C, D}, {C, E}, {D, E}, this is ⁿC₂ part of the equation.

The unordered pairs that are in the same in both clusters is {A, B} = 1, this is 'a'.

Now we find 'b' which includes {A, D}, {A, E}, {B, D}, {B, E} and {C, E}, these are unordered pairs that belong to different clusters across both clustering methods = 5

Plugging in all these values we get:

R = \frac{a + b}{n \choose 2} = \frac{1 + 5}{10} = 0.6