Skip to content

Implementation of K-means Clustering algorithm using Python and Spark on HDFS

Notifications You must be signed in to change notification settings

aayush210789/Large-Scale-Clustering-on-StackOverflow-Data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Large-Scale-Clustering-on-StackOverflow-Data

Objective

The main objective of the project is to implement K-means Clustering algorithm using Python and Spark on HDFS. Following are the goals achieved through clustering:

-Clustering is implemented on User base data to group similar users on the basis of their skills. Their skills are quantified taking appropriate features.

-Also, K-means algorithm is used to make homogeneous clusters of Posts on the basis of their popularity which is determined by taking suitable features.

-Elbow method is used to identify optimal number of clusters, and machine learning techniques such as normalization and one hot representation is implemented without using mlib library.

-The results are discussed, justified and the performance of the algorithm is evaluated based on output.

-Lastly, Mlib library is used to obtain the outputs for both the cases and the results are compared.

For detailed report refer Kmeans.pdf

About

Implementation of K-means Clustering algorithm using Python and Spark on HDFS

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published