GitHub - lkmokadam/TFIDF_Hadoop: Implementation of TFIDF using Hadoop

lkmokadam / TFIDF_Hadoop Public

Notifications You must be signed in to change notification settings
Fork 0
Star 1

Implementation of TFIDF using Hadoop

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README		README
TFIDF.java		TFIDF.java

Repository files navigation

Run:

javac TFIDF.java
jar cf TFIDF.jar TFIDF*.class
hadoop jar TFIDF.jar TFIDF input &> hadoop_output.txt


##########################################################################
Implementation Explanation
##########################################################################

WCMapper:
WCMapper just accesses the document name and lines in the files.
Then it splits the lines using tokenizer and emits the key by combining the token and document name as token +"@"+document name with value as 1.

WCReducer:
WCReducer receives the key value in the format  (word@document , 1 )
It just adds all values in the value iterator for a given word@documnet

DSMapper:
DSMapper is doing nothing but the rearrangement of the data from file lines.
It first splits the document lines with tab. This provides separated (word@document) and wordCount
Then splitting first part with '@' gives us separated word and document
Now just concat the strings for the required format value as word + '=' + wordCount

DSReducer:
if one loop, word and wordcount are separated from the value pair using the split function of string. Also, it calculate total words in a document in total variable.
Then in the second loop, it creates the new list from total words and wordcount as wordcount/total
Then just emits the ((word@document) , (wordCount/docSize)) by iterating over the list and concatenating the required strings
            

TFIDFMapper:
TFIDFMapper is doing nothing but the rearrangement of the data from file lines.
It splits the line in the input file using split to a seperate word, document name, word count and docSize. Then it just concats the strings to form key and values as word and document+ "=" + wordCount+"/"+docSize respectively.

TFIDFReducer:
TFIDFReducer first creates the list of all values by iterating over the values iterator also calculates the numDocsWithWord by just incrementing a counter.
Then in next loop, we perform the split operation on the values to get sepearted document, wordCount and docSize. Then tfidf value is calculated from these values using formula  TFIDF = (wordCount/docSize) * ln(numDocs/numDocsWithWord). 
In the same loop we then just create the key by concatenating the strings document+ "@" + word and value as TFIDF value and store them in the map.

##################################################
Job creation and setup
##################################################

WordCount:

In word count, we providede the WCMapper , WCReducer to setMapperClass and setReducerClass function
The output key and output value class set using the respectiove functon. Input path and out put path is passed as provided.
waitForCompletion function is used to make sure the previous job completes before launching the next job. In this came we make sure the word count job completes befor data count job starts.
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(TFIDF.class);
job.setMapperClass(WCMapper.class);
job.setCombinerClass(WCReducer.class);
job.setReducerClass(WCReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, wcInputPath);
FileOutputFormat.setOutputPath(job, wcOutputPath);
job.waitForCompletion(true); 
    
Using same technique the job for data counr and tfidf value calculation has been set as follows.

Data Size:
Job jobc = Job.getInstance(conf, "data size");
jobc.setJarByClass(TFIDF.class);
jobc.setMapperClass(DSMapper.class);
jobc.setReducerClass(DSReducer.class);
jobc.setOutputKeyClass(Text.class);
jobc.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(jobc, dsInputPath);
FileOutputFormat.setOutputPath(jobc, dsOutputPath);
jobc.waitForCompletion(true);

TFIDF:
Job tfidfJob = Job.getInstance(conf, "tfidf");
tfidfJob.setJarByClass(TFIDF.class);
tfidfJob.setMapperClass(TFIDFMapper.class);
tfidfJob.setReducerClass(TFIDFReducer.class);
tfidfJob.setOutputKeyClass(Text.class);
tfidfJob.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(tfidfJob, tfidfInputPath);
FileOutputFormat.setOutputPath(tfidfJob, tfidfOutputPath);
tfidfJob.waitForCompletion(true);

In case of word count, we also have provided the combiner which executes after mappper on each mapper computing node but before reducer. This help reducing the workload on the reducer system. WCReducer class is provided for combiner.