The main objective of the project is to implement K-means Clustering algorithm using Python and Spark on HDFS. Following are the goals achieved through clustering:
-Clustering is implemented on User base data to group similar users on the basis of their skills. Their skills are quantified taking appropriate features.
-Also, K-means algorithm is used to make homogeneous clusters of Posts on the basis of their popularity which is determined by taking suitable features.
-Elbow method is used to identify optimal number of clusters, and machine learning techniques such as normalization and one hot representation is implemented without using mlib library.
-The results are discussed, justified and the performance of the algorithm is evaluated based on output.
-Lastly, Mlib library is used to obtain the outputs for both the cases and the results are compared.
For detailed report refer Kmeans.pdf