k- Means Clustering Using R

K- Means Clustering Using  R

 What is Cluster Analysis?

Cluster analysis is the process in which data is divided into  into various groups (clusters) which share some common characteristics i.e. same cluster are similar to each other than those in other clusters.

R is a free software environment used  for statistical computing and graphics. R is public domain open source software mainly used for analysis of  statistical and graphic techniques .

What is K-Means ?

K-means clustering  is the most  used unsupervised machine learning algorithm for partitioning the given set of data (iris) into a set of k groups (i.e. k clusters), where k represents the number of groups pre-specified by the analyst. It classifies various objects in multiple groups (i.e., clusters), such that objects within the same cluster are as similar as possible .


There are  common four steps of the K-means algorithm that can be identified:

Step 1. Initialization- A set of objects to be partitioned, the number of groups and a centroid is defined for each group.

Step 2. Classification- For each database object its distance to each of the centroids is calculated, the closest centroid is determined, and the object is incorporated to the group related to this centroid.

Step 3. Centroid calculation- From each group  centroid is recalculated.

Step 4. Convergence condition- Several convergence conditions have been used from which the most utilized are the following: stopping when reaching a given number of iterations, stopping when there is no exchange of objects among groups, or stopping when the difference among centroids at two consecutive iterations is smaller than a given threshold. If the convergence condition is not satisfied, steps two, three and four of the algorithm are repeated.

In k-means clustering, each cluster is represented by its center  which corresponds to the mean of points assigned to the cluster.


Here we will take the Iris Data Set is a database of types Iris plant, which has No. of instances: 150 (50 in each class), No. of attributes: 4 (Sepal length, Sepal width, Petal Length, Petal width) No. of classes: 3 (Hedges Iris, Iris versicolor, Iris virginica). 
 
Here  is the code for K-means clustering for data analysis.

#get the working directory
> getwd()

[1] "C:/Users/REENA/OneDrive/Documents"

#set the working directory

> "d:/NEW"

[1] "d:/NEW"

#View the data of iris

>  View(iris)

>plot(iris)




> firis<-iris

> head(firis)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

> firis$species<-NULL

> head(firis)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

> ds<-firis$Petal.Length

> km<-kmeans(ds,3,15)

> km

K-means clustering with 3 clusters of sizes 66, 34, 50

Cluster means:
      [,1]
1 4.431818
2 5.826471
3 1.462000

Clustering vector:
  [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [41] 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [81] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 2 2 2 1 2 2 1 1 2 2 2 2 1
[121] 2 1 2 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2 2 1 1 2 2 2 1 2 2 1

Within cluster sum of squares by cluster:
[1] 17.323182  6.506176  1.477800
 (between_SS / total_SS =  94.5 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"    
  
> km$cluster

  [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [41] 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [81] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 2 2 2 1 2 2 1 1 2 2 2 2 1
[121] 2 1 2 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2 2 1 1 2 2 2 1 2 2 1

> km$centers

      [,1]
1 4.431818
2 5.826471
3 1.462000

> km$iter

[1] 2
> plot(ds,col=km$cluster)

> plot(ds,col=km$cluster)




Comments