GitHub - lkmokadam/TFIDF_Spark: Implementation of TFIDF using Apache Spark

lkmokadam / TFIDF_Spark Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Implementation of TFIDF using Apache Spark

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README		README
TFIDF.java		TFIDF.java

Repository files navigation

Run commands:

javac TFIDF.java
jar cf TFIDF.jar TFIDF*.class
spark-submit --class TFIDF TFIDF.jar input &> spark_output.txt
grep -v '^2017\|^(\|^-' spark_output.txt > output.txt

Explanation:

tfRDD :
Map: Map implementation using flatMapToPair
input : key value pair from wordsRDD 
	This function reformat the key and value as word@document and 1/docSize adds into the ArrayList and return the ArrayList. 
Reduce: Reduce implementation using reduceByKey
input : key , list of values pair from Map function
	sums all the numerators of the values of a a particular key, returns the total occurance of word in the document 


idfRDD :
Map: Map implementation using flatMapToPair
input: key value pair from tfRDD
    This function just do the reformating of the key  and value from ((word@document),(wordCount/docSize)) to (word,(1/document)), creates the ArrayList and adds the Tuple in it and return the ArrayList
Reduce: Reduce implementation using reduceByKey
input : key , list of values pair from Map function
	This function reduces the key value by adding the numerators and concatinating all the denominator strings as mentioned in comment section above the idfRDD. i.e. from (word,(1/document)) to  (word,(numDocsWithWord/document1,document2...) 
Map: Map implementation using flatMapToPair
input: key value pair from Reduce
    This function splits the value to seperate out the document array the iterate over the document loop, concatinate the word , @ and document create value using splitted numDocsWithWord and numDocs provided. This converst key value pair from (word, (numDocsWithWord/document1,document2...)) to ((word@document),(numDocs/numDocsWithWord));


tfFinalRDD: 
Map: Map implementation using mapToPair
input: key value pair from idfRDD
    This function just calculate the TF from value. It just splits wordCount and docSize then calculate TF = wordCount/docSize. 


idfFinalRDD:
Map: Map implementation using mapToPair
input: key value pair from idfRDD
    This function just calculate the IDF from value. It just splits numDocs and numDocsWithWord, then calculate IDF = log(numDocs/numDocsWithWord).


tfidfRDD:
Reduce: Reduce implementation using reduceByKey
input : key , list of values pair from tfFinalRDD union idfFinalRDD
    This function just calculate the TFIDF from TF and IDF of the particular key, then calculate TFIDF = TF*IDF.
Map: Map implementation using flatMapToPair
input: key value pair from tfidfRDD
    This function just reformats the key from ((word@document),TFIDF) to ((document@word),TFIDF)