GitHub - agrippa/hadoopcl: Automatic native execution of Hadoop computation on OpenCL devices

agrippa / hadoopcl Public

Notifications You must be signed in to change notification settings
Fork 0
Star 1

Automatic native execution of Hadoop computation on OpenCL devices

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 335 Commits
ant/org/apache/hadoop/ant		ant/org/apache/hadoop/ant
benchmarks		benchmarks
c++		c++
conf		conf
contrib		contrib
core		core
examples		examples
hdfs		hdfs
mapred		mapred
native		native
packages		packages
test		test
tools/org/apache/hadoop/tools		tools/org/apache/hadoop/tools
webapps		webapps
.gitignore		.gitignore
AutoGenerateArray.py		AutoGenerateArray.py
AutoGenerateIter.py		AutoGenerateIter.py
AutoGenerateKernel.py		AutoGenerateKernel.py
CleanAll.py		CleanAll.py
GenerateAll.py		GenerateAll.py
README		README
SupportedMR		SupportedMR
build.xml		build.xml
fixFontsPath.sh		fixFontsPath.sh
jpp.py		jpp.py
mapred-site.toset.sample		mapred-site.toset.sample
plot_timeline.py		plot_timeline.py
reconfigure-hadoop.sample		reconfigure-hadoop.sample
remap-launches.py		remap-launches.py
remap-recordings.py		remap-recordings.py
remap.sh		remap.sh
root-folder.zip		root-folder.zip
saveVersion.sh		saveVersion.sh

Repository files navigation

NOTE: This repository is maintained purely for reference (and sentimental
value). It is a very old fork of a very old Hadoop version and likely is no
longer usable. If you are interested in running data analytics applications
on GPUs, I recommend checking out my other work at:

https://github.com/agrippa/spark-swat.

===========================================================================
===========================================================================
============ Old notes below, not guaranteed to be up-to-date =============
===========================================================================
===========================================================================

HadoopCL: Automatic native execution of Hadoop computation on OpenCL devices

author: Max Grossman, [email protected]

sponsors: Rice University, AMD

Most of the core logic for HadoopCL is located inside the 
org.apache.hadoop.mapreduce package. This includes static code
and code generated dynamically by several Python scripts.

The control flow of a HadoopCL diverges from that of a normal
Hadoop job when the mapper or reducer run() method is called.
HadoopCL replaces Hadoop Mapper/Reducer classes with
OpenCLMapper and OpenCLReducer. These classes perform some
basic type checking on the inputs to the mapper or reducer
task and then pass control to the OpenCLDriver.

The OpenCLDriver is useful in that it allows both HadoopCL
mapper and reducer computation to share large amounts of code.
This is possible because both mappers and reducers have
generally similar execution: they loop over the input keys
and values. The only real difference is that the keys for
mappers are a single element, while reducer keys are lists
of elements. At a high level, OpenCLDriver performs some
setup of the HadoopCL environment (such as information on the
execution mode to use, how much data to buffer at a time, 
launching some worker threads) and then loops through the
inputs. It hands the actual buffering and processing of these
inputs down to HadoopCLBuffer and HadoopCLKernel objects.

One quick but important thing to note about OpenCLDriver is that
there are two main paths a mapper or reducer task can take in
OpenCLDriver. It can either be designated for OpenCL processing
or Java processing. You can see the call for Java processing
very early on in the run() method of OpenCLDriver, javaRun().

HadoopCLBuffer and HadoopCLKernel objects represent the data
storage and data processing elements of HadoopCL. A
HadoopCLBuffer aggregates inputs from the input Hadoop context
and stores it in member arrays. When those arrays are full,
a HadoopCLKernel object is initialized from their contents
and launches its user-defined computation on the buffered
inputs. When that computation is finished, the outputs stored in
the HadoopCLBuffer object are then dumped back into Hadoop.

This pipeline from buffering -> processing -> dumping
is actually implemented by a pipeline of dedicated threads for
each process. The main thread in OpenCLDriver performs the
buffering in to HadoopCLBuffer objects. The logic for
running HadoopCLKernel logic on HadoopCLBuffer contents is
contained in ToOpenCLThread. The logic for writing outputs
back to Hadoop is contained in ToHadoopThread. You'll notice
there is some complicated logic for detecting termination
because I added support for intermediate reduction in HadoopCL
before realizing it was trivial to just support Combiners.
This should probably all get taken out at some point. You can
picture this pipeline as threads passing HadoopCLBuffer objects
off to each other.

Once you get below the HadoopCLKernel and HadoopCLBuffer layer of
abstraction, we start getting into computation and type-specific
classes. The HadoopCLKernel class is extended by HadoopCLMapperKernel
and HadoopCLReducerKernel for mapper- and reducer-specific
operations. The HadoopCLBuffer class is extended by HadoopCLMapperBuffer
and HadoopCLReducerBuffer for mapper- and reducer-specific buffering
and output.

Extending these mapper- and reducer-specific classes are type-specific
classes. All type-specific classes are auto-generated. An example
of a type-specific mapper kernel class would be
IntFloatIntFloatHadoopCLMapperKernel. This class implements everything
but the user-implemented logic for a mapper which takes as input
a (int, float) key-value pair and outputs (int, float) key-value pairs.
The kernel and buffer classes are all auto-generated by AutoGenerateKernel.py,
which is called from GenerateAll.py. GenerateAll.py simply iterates over
the supported mapper and reducer types in SupportedMR to generate the
necessary buffer and kernel classes for each. The AutoGenerateKernel.py
code is very ugly at the moment, so I would recommend just running:

    python GenerateAll.py

and taking a look at the generated files in mapred/org/apache/hadoop/mapreduce,
rather than trying to figure out the AutoGenerateKernel.py code itself.

One added feature of HadoopCL is globals. There's not really any need to go
into much details on these at the moment as they aren't central to HadoopCL,
just some useful functionality. Basically, they allow the user to specify
globally visible, constant sparse vectors which are assigned a unique integer
ID and can be accessed from HadoopCL kernels. Anywhere you see the word global
in a variable name, that's probably what that's referring to.

Anywhere you see a class called HadoopCLResizable*Array, that's an
auto-generated class which is a dynamically sized Java array. This allows
for buffering of inputs whose size we may not know, while also allowing
efficient transfer to OpenCL devices because the object is backed by a
primitive array. These classes are located under core/org/apache/hadoop/io
and are auto-generated by AutoGenerateArray.py.