-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
49 lines (38 loc) · 2.49 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
Run commands:
javac TFIDF.java
jar cf TFIDF.jar TFIDF*.class
spark-submit --class TFIDF TFIDF.jar input &> spark_output.txt
grep -v '^2017\|^(\|^-' spark_output.txt > output.txt
Explanation:
tfRDD :
Map: Map implementation using flatMapToPair
input : key value pair from wordsRDD
This function reformat the key and value as word@document and 1/docSize adds into the ArrayList and return the ArrayList.
Reduce: Reduce implementation using reduceByKey
input : key , list of values pair from Map function
sums all the numerators of the values of a a particular key, returns the total occurance of word in the document
idfRDD :
Map: Map implementation using flatMapToPair
input: key value pair from tfRDD
This function just do the reformating of the key and value from ((word@document),(wordCount/docSize)) to (word,(1/document)), creates the ArrayList and adds the Tuple in it and return the ArrayList
Reduce: Reduce implementation using reduceByKey
input : key , list of values pair from Map function
This function reduces the key value by adding the numerators and concatinating all the denominator strings as mentioned in comment section above the idfRDD. i.e. from (word,(1/document)) to (word,(numDocsWithWord/document1,document2...)
Map: Map implementation using flatMapToPair
input: key value pair from Reduce
This function splits the value to seperate out the document array the iterate over the document loop, concatinate the word , @ and document create value using splitted numDocsWithWord and numDocs provided. This converst key value pair from (word, (numDocsWithWord/document1,document2...)) to ((word@document),(numDocs/numDocsWithWord));
tfFinalRDD:
Map: Map implementation using mapToPair
input: key value pair from idfRDD
This function just calculate the TF from value. It just splits wordCount and docSize then calculate TF = wordCount/docSize.
idfFinalRDD:
Map: Map implementation using mapToPair
input: key value pair from idfRDD
This function just calculate the IDF from value. It just splits numDocs and numDocsWithWord, then calculate IDF = log(numDocs/numDocsWithWord).
tfidfRDD:
Reduce: Reduce implementation using reduceByKey
input : key , list of values pair from tfFinalRDD union idfFinalRDD
This function just calculate the TFIDF from TF and IDF of the particular key, then calculate TFIDF = TF*IDF.
Map: Map implementation using flatMapToPair
input: key value pair from tfidfRDD
This function just reformats the key from ((word@document),TFIDF) to ((document@word),TFIDF)