Skip to content

lkmokadam/TFIDF_Spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Run commands:

javac TFIDF.java
jar cf TFIDF.jar TFIDF*.class
spark-submit --class TFIDF TFIDF.jar input &> spark_output.txt
grep -v '^2017\|^(\|^-' spark_output.txt > output.txt

Explanation:

tfRDD :
Map: Map implementation using flatMapToPair
input : key value pair from wordsRDD 
	This function reformat the key and value as word@document and 1/docSize adds into the ArrayList and return the ArrayList. 
Reduce: Reduce implementation using reduceByKey
input : key , list of values pair from Map function
	sums all the numerators of the values of a a particular key, returns the total occurance of word in the document 


idfRDD :
Map: Map implementation using flatMapToPair
input: key value pair from tfRDD
    This function just do the reformating of the key  and value from ((word@document),(wordCount/docSize)) to (word,(1/document)), creates the ArrayList and adds the Tuple in it and return the ArrayList
Reduce: Reduce implementation using reduceByKey
input : key , list of values pair from Map function
	This function reduces the key value by adding the numerators and concatinating all the denominator strings as mentioned in comment section above the idfRDD. i.e. from (word,(1/document)) to  (word,(numDocsWithWord/document1,document2...) 
Map: Map implementation using flatMapToPair
input: key value pair from Reduce
    This function splits the value to seperate out the document array the iterate over the document loop, concatinate the word , @ and document create value using splitted numDocsWithWord and numDocs provided. This converst key value pair from (word, (numDocsWithWord/document1,document2...)) to ((word@document),(numDocs/numDocsWithWord));


tfFinalRDD: 
Map: Map implementation using mapToPair
input: key value pair from idfRDD
    This function just calculate the TF from value. It just splits wordCount and docSize then calculate TF = wordCount/docSize. 


idfFinalRDD:
Map: Map implementation using mapToPair
input: key value pair from idfRDD
    This function just calculate the IDF from value. It just splits numDocs and numDocsWithWord, then calculate IDF = log(numDocs/numDocsWithWord).


tfidfRDD:
Reduce: Reduce implementation using reduceByKey
input : key , list of values pair from tfFinalRDD union idfFinalRDD
    This function just calculate the TFIDF from TF and IDF of the particular key, then calculate TFIDF = TF*IDF.
Map: Map implementation using flatMapToPair
input: key value pair from tfidfRDD
    This function just reformats the key from ((word@document),TFIDF) to ((document@word),TFIDF)

About

Implementation of TFIDF using Apache Spark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages