-
Notifications
You must be signed in to change notification settings - Fork 0
lkmokadam/TFIDF_Spark
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Run commands: javac TFIDF.java jar cf TFIDF.jar TFIDF*.class spark-submit --class TFIDF TFIDF.jar input &> spark_output.txt grep -v '^2017\|^(\|^-' spark_output.txt > output.txt Explanation: tfRDD : Map: Map implementation using flatMapToPair input : key value pair from wordsRDD This function reformat the key and value as word@document and 1/docSize adds into the ArrayList and return the ArrayList. Reduce: Reduce implementation using reduceByKey input : key , list of values pair from Map function sums all the numerators of the values of a a particular key, returns the total occurance of word in the document idfRDD : Map: Map implementation using flatMapToPair input: key value pair from tfRDD This function just do the reformating of the key and value from ((word@document),(wordCount/docSize)) to (word,(1/document)), creates the ArrayList and adds the Tuple in it and return the ArrayList Reduce: Reduce implementation using reduceByKey input : key , list of values pair from Map function This function reduces the key value by adding the numerators and concatinating all the denominator strings as mentioned in comment section above the idfRDD. i.e. from (word,(1/document)) to (word,(numDocsWithWord/document1,document2...) Map: Map implementation using flatMapToPair input: key value pair from Reduce This function splits the value to seperate out the document array the iterate over the document loop, concatinate the word , @ and document create value using splitted numDocsWithWord and numDocs provided. This converst key value pair from (word, (numDocsWithWord/document1,document2...)) to ((word@document),(numDocs/numDocsWithWord)); tfFinalRDD: Map: Map implementation using mapToPair input: key value pair from idfRDD This function just calculate the TF from value. It just splits wordCount and docSize then calculate TF = wordCount/docSize. idfFinalRDD: Map: Map implementation using mapToPair input: key value pair from idfRDD This function just calculate the IDF from value. It just splits numDocs and numDocsWithWord, then calculate IDF = log(numDocs/numDocsWithWord). tfidfRDD: Reduce: Reduce implementation using reduceByKey input : key , list of values pair from tfFinalRDD union idfFinalRDD This function just calculate the TFIDF from TF and IDF of the particular key, then calculate TFIDF = TF*IDF. Map: Map implementation using flatMapToPair input: key value pair from tfidfRDD This function just reformats the key from ((word@document),TFIDF) to ((document@word),TFIDF)
About
Implementation of TFIDF using Apache Spark
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published