-
Notifications
You must be signed in to change notification settings - Fork 0
/
12.1_Hadoop_MapReduce_WordCount.txt
211 lines (200 loc) · 8.76 KB
/
12.1_Hadoop_MapReduce_WordCount.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
[cloudera@quickstart ~]$ vi test_file
[cloudera@quickstart ~]$ cat test_file
This is a first test file and this will be used for the first mapreduce program.
[cloudera@quickstart ~]$ vi mapper.py
[cloudera@quickstart ~]$ cat mapper.py
#!/usr/bin/env python
import sys
for line in sys.stdin:
#remove loading and trailing spaces
line = line.strip()
words = line.split()
for word in words:
print("%s\t%s"%(word,1))
[cloudera@quickstart ~]$ vi reducer.py
[cloudera@quickstart ~]$ cat reducer.py
import sys
from operator import itemgetter
current_word = None
current_count = 0
word = None
#input comes from console
for line in sys.stdin:
#remove leading and trailing whitespace
line = line.strip()
#parse the input we got from mapper.py
word, count = line.split('\t',1)
#convert count (currently a string) to int
try:
count= int(count)
except ValueError:
#ignore if the count is not a number
continue
#use the sorted map output
if current_word == word:
current_count+=count
else:
if current_word:
#write result in console
print("%s\t%s" %(current_word, current_count))
current_count = count
current_word = word
#finally output the word to the console
if current_word == word:
print("%s\t%s" %(current_word,current_count))
[cloudera@quickstart ~]$
[cloudera@quickstart ~]$ ls -lrt
total 68
drwxrwxr-x 4 cloudera cloudera 4096 Apr 23 2015 workspace
drwxrwxr-x 2 cloudera cloudera 4096 Apr 23 2015 lib
drwxrwxr-x 4 cloudera cloudera 4096 Apr 23 2015 Documents
drwxrwxr-x 2 cloudera cloudera 4096 Apr 23 2015 Desktop
drwxrwxr-x 2 cloudera cloudera 4096 Apr 23 2015 datasets
-rw-rw-r-- 1 cloudera cloudera 1092 Apr 23 2015 cm_api.sh
-rwxrwxr-x 1 cloudera cloudera 3978 Apr 23 2015 cloudera-manager
drwxr-xr-x 2 cloudera cloudera 4096 Apr 22 00:36 Videos
drwxr-xr-x 2 cloudera cloudera 4096 Apr 22 00:36 Templates
drwxr-xr-x 2 cloudera cloudera 4096 Apr 22 00:36 Public
drwxr-xr-x 2 cloudera cloudera 4096 Apr 22 00:36 Pictures
drwxr-xr-x 2 cloudera cloudera 4096 Apr 22 00:36 Music
drwxr-xr-x 2 cloudera cloudera 4096 Apr 22 00:36 Downloads
drwxrwsr-x 9 cloudera cloudera 4096 Apr 22 01:15 eclipse
-rw-rw-r-- 1 cloudera cloudera 81 Jul 4 04:24 test_file
-rw-rw-r-- 1 cloudera cloudera 186 Jul 4 04:25 mapper.py
-rw-rw-r-- 1 cloudera cloudera 767 Jul 4 04:25 reducer.py
[cloudera@quickstart ~]$ cat test_file | python mapper.py
This 1
is 1
a 1
first 1
test 1
file 1
and 1
this 1
will 1
be 1
used 1
for 1
the 1
first 1
mapreduce 1
program. 1
[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera
Found 3 items
-rw-r--r-- 1 cloudera cloudera 81 2021-07-04 03:34 /user/cloudera/test_file
drwxr-xr-x - cloudera cloudera 0 2021-07-04 03:46 /user/cloudera/wc_output01
drwxr-xr-x - cloudera cloudera 0 2021-07-04 04:20 /user/cloudera/wc_output02
[cloudera@quickstart ~]$ hdfs dfs -put test_file /user/cloudera
put: `/user/cloudera/test_file': File exists
[cloudera@quickstart ~]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -file /home/cloudera/mapper.py /home/cloudera/reducer.py -mapper "python mapper.py" -reducer "python reducer.py" -input /user/cloudera/test_file -output /user/cloudera/wc_output02
21/07/04 04:27:49 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/home/cloudera/mapper.py, /home/cloudera/reducer.py] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.4.0.jar] /tmp/streamjob5412678000816569877.jar tmpDir=null
21/07/04 04:27:53 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
21/07/04 04:27:55 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
21/07/04 04:27:55 WARN security.UserGroupInformation: PriviledgedActionException as:cloudera (auth:SIMPLE) cause:org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://quickstart.cloudera:8020/user/cloudera/wc_output02 already exists
21/07/04 04:27:55 WARN security.UserGroupInformation: PriviledgedActionException as:cloudera (auth:SIMPLE) cause:org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://quickstart.cloudera:8020/user/cloudera/wc_output02 already exists
21/07/04 04:27:55 ERROR streaming.StreamJob: Error Launching job : Output directory hdfs://quickstart.cloudera:8020/user/cloudera/wc_output02 already exists
Streaming Command Failed!
[cloudera@quickstart ~]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -file /home/cloudera/mapper.py /home/cloudera/reducer.py -mapper "python mapper.py" -reducer "python reducer.py" -input /user/cloudera/test_file -output /user/cloudera/wc_output03
21/07/04 04:28:07 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/home/cloudera/mapper.py, /home/cloudera/reducer.py] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.4.0.jar] /tmp/streamjob194540891403990113.jar tmpDir=null
21/07/04 04:28:12 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
21/07/04 04:28:12 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
21/07/04 04:28:23 INFO mapred.FileInputFormat: Total input paths to process : 1
21/07/04 04:28:24 INFO mapreduce.JobSubmitter: number of splits:2
21/07/04 04:28:26 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1625387400110_0003
21/07/04 04:28:27 INFO impl.YarnClientImpl: Submitted application application_1625387400110_0003
21/07/04 04:28:27 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1625387400110_0003/
21/07/04 04:28:27 INFO mapreduce.Job: Running job: job_1625387400110_0003
21/07/04 04:29:03 INFO mapreduce.Job: Job job_1625387400110_0003 running in uber mode : false
21/07/04 04:29:03 INFO mapreduce.Job: map 0% reduce 0%
21/07/04 04:30:18 INFO mapreduce.Job: map 50% reduce 0%
21/07/04 04:30:25 INFO mapreduce.Job: map 100% reduce 0%
21/07/04 04:31:48 INFO mapreduce.Job: map 100% reduce 100%
21/07/04 04:31:55 INFO mapreduce.Job: Job job_1625387400110_0003 completed successfully
21/07/04 04:31:56 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=151
FILE: Number of bytes written=341671
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=336
HDFS: Number of bytes written=105
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=158502
Total time spent by all reduces in occupied slots (ms)=75579
Total time spent by all map tasks (ms)=158502
Total time spent by all reduce tasks (ms)=75579
Total vcore-seconds taken by all map tasks=158502
Total vcore-seconds taken by all reduce tasks=75579
Total megabyte-seconds taken by all map tasks=162306048
Total megabyte-seconds taken by all reduce tasks=77392896
Map-Reduce Framework
Map input records=1
Map output records=16
Map output bytes=113
Map output materialized bytes=157
Input split bytes=214
Combine input records=0
Combine output records=0
Reduce input groups=15
Reduce shuffle bytes=157
Reduce input records=16
Reduce output records=15
Spilled Records=32
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=19256
CPU time spent (ms)=10690
Physical memory (bytes) snapshot=512466944
Virtual memory (bytes) snapshot=4509511680
Total committed heap usage (bytes)=283189248
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=122
File Output Format Counters
Bytes Written=105
21/07/04 04:31:56 INFO streaming.StreamJob: Output directory: /user/cloudera/wc_output03
[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera
Found 4 items
-rw-r--r-- 1 cloudera cloudera 81 2021-07-04 03:34 /user/cloudera/test_file
drwxr-xr-x - cloudera cloudera 0 2021-07-04 03:46 /user/cloudera/wc_output01
drwxr-xr-x - cloudera cloudera 0 2021-07-04 04:20 /user/cloudera/wc_output02
drwxr-xr-x - cloudera cloudera 0 2021-07-04 04:31 /user/cloudera/wc_output03
[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/wc_output03
Found 2 items
-rw-r--r-- 1 cloudera cloudera 0 2021-07-04 04:31 /user/cloudera/wc_output03/_SUCCESS
-rw-r--r-- 1 cloudera cloudera 105 2021-07-04 04:31 /user/cloudera/wc_output03/part-00000
[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/wc_output03/part*
-rw-r--r-- 1 cloudera cloudera 105 2021-07-04 04:31 /user/cloudera/wc_output03/part-00000
[cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/wc_output03/part*
This 1
a 1
and 1
be 1
file 1
first 2
for 1
is 1
mapreduce 1
program. 1
test 1
the 1
this 1
used 1
will 1
[cloudera@quickstart ~]$