Large-Scale-Clustering-on-StackOverflow-Data

Objective

The main objective of the project is to implement K-means Clustering algorithm using Python and Spark on HDFS. Following are the goals achieved through clustering:

-Clustering is implemented on User base data to group similar users on the basis of their skills. Their skills are quantified taking appropriate features.

-Also, K-means algorithm is used to make homogeneous clusters of Posts on the basis of their popularity which is determined by taking suitable features.

-Elbow method is used to identify optimal number of clusters, and machine learning techniques such as normalization and one hot representation is implemented without using mlib library.

-The results are discussed, justified and the performance of the algorithm is evaluated based on output.

-Lastly, Mlib library is used to obtain the outputs for both the cases and the results are compared.

For detailed report refer Kmeans.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
codes		codes
output		output
Kmeans.pdf		Kmeans.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Large-Scale-Clustering-on-StackOverflow-Data

Objective

About

Releases

Packages

Languages

aayush210789/Large-Scale-Clustering-on-StackOverflow-Data

Folders and files

Latest commit

History

Repository files navigation

Large-Scale-Clustering-on-StackOverflow-Data

Objective

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages