-
Notifications
You must be signed in to change notification settings - Fork 0
lkmokadam/TFIDF_Hadoop
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Run: javac TFIDF.java jar cf TFIDF.jar TFIDF*.class hadoop jar TFIDF.jar TFIDF input &> hadoop_output.txt ########################################################################## Implementation Explanation ########################################################################## WCMapper: WCMapper just accesses the document name and lines in the files. Then it splits the lines using tokenizer and emits the key by combining the token and document name as token +"@"+document name with value as 1. WCReducer: WCReducer receives the key value in the format (word@document , 1 ) It just adds all values in the value iterator for a given word@documnet DSMapper: DSMapper is doing nothing but the rearrangement of the data from file lines. It first splits the document lines with tab. This provides separated (word@document) and wordCount Then splitting first part with '@' gives us separated word and document Now just concat the strings for the required format value as word + '=' + wordCount DSReducer: if one loop, word and wordcount are separated from the value pair using the split function of string. Also, it calculate total words in a document in total variable. Then in the second loop, it creates the new list from total words and wordcount as wordcount/total Then just emits the ((word@document) , (wordCount/docSize)) by iterating over the list and concatenating the required strings TFIDFMapper: TFIDFMapper is doing nothing but the rearrangement of the data from file lines. It splits the line in the input file using split to a seperate word, document name, word count and docSize. Then it just concats the strings to form key and values as word and document+ "=" + wordCount+"/"+docSize respectively. TFIDFReducer: TFIDFReducer first creates the list of all values by iterating over the values iterator also calculates the numDocsWithWord by just incrementing a counter. Then in next loop, we perform the split operation on the values to get sepearted document, wordCount and docSize. Then tfidf value is calculated from these values using formula TFIDF = (wordCount/docSize) * ln(numDocs/numDocsWithWord). In the same loop we then just create the key by concatenating the strings document+ "@" + word and value as TFIDF value and store them in the map. ################################################## Job creation and setup ################################################## WordCount: In word count, we providede the WCMapper , WCReducer to setMapperClass and setReducerClass function The output key and output value class set using the respectiove functon. Input path and out put path is passed as provided. waitForCompletion function is used to make sure the previous job completes before launching the next job. In this came we make sure the word count job completes befor data count job starts. Job job = Job.getInstance(conf, "word count"); job.setJarByClass(TFIDF.class); job.setMapperClass(WCMapper.class); job.setCombinerClass(WCReducer.class); job.setReducerClass(WCReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, wcInputPath); FileOutputFormat.setOutputPath(job, wcOutputPath); job.waitForCompletion(true); Using same technique the job for data counr and tfidf value calculation has been set as follows. Data Size: Job jobc = Job.getInstance(conf, "data size"); jobc.setJarByClass(TFIDF.class); jobc.setMapperClass(DSMapper.class); jobc.setReducerClass(DSReducer.class); jobc.setOutputKeyClass(Text.class); jobc.setOutputValueClass(Text.class); FileInputFormat.addInputPath(jobc, dsInputPath); FileOutputFormat.setOutputPath(jobc, dsOutputPath); jobc.waitForCompletion(true); TFIDF: Job tfidfJob = Job.getInstance(conf, "tfidf"); tfidfJob.setJarByClass(TFIDF.class); tfidfJob.setMapperClass(TFIDFMapper.class); tfidfJob.setReducerClass(TFIDFReducer.class); tfidfJob.setOutputKeyClass(Text.class); tfidfJob.setOutputValueClass(Text.class); FileInputFormat.addInputPath(tfidfJob, tfidfInputPath); FileOutputFormat.setOutputPath(tfidfJob, tfidfOutputPath); tfidfJob.waitForCompletion(true); In case of word count, we also have provided the combiner which executes after mappper on each mapper computing node but before reducer. This help reducing the workload on the reducer system. WCReducer class is provided for combiner.
About
Implementation of TFIDF using Hadoop
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published