-
Notifications
You must be signed in to change notification settings - Fork 0
agrippa/hadoopcl
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
NOTE: This repository is maintained purely for reference (and sentimental value). It is a very old fork of a very old Hadoop version and likely is no longer usable. If you are interested in running data analytics applications on GPUs, I recommend checking out my other work at: https://github.com/agrippa/spark-swat. =========================================================================== =========================================================================== ============ Old notes below, not guaranteed to be up-to-date ============= =========================================================================== =========================================================================== HadoopCL: Automatic native execution of Hadoop computation on OpenCL devices author: Max Grossman, [email protected] sponsors: Rice University, AMD Most of the core logic for HadoopCL is located inside the org.apache.hadoop.mapreduce package. This includes static code and code generated dynamically by several Python scripts. The control flow of a HadoopCL diverges from that of a normal Hadoop job when the mapper or reducer run() method is called. HadoopCL replaces Hadoop Mapper/Reducer classes with OpenCLMapper and OpenCLReducer. These classes perform some basic type checking on the inputs to the mapper or reducer task and then pass control to the OpenCLDriver. The OpenCLDriver is useful in that it allows both HadoopCL mapper and reducer computation to share large amounts of code. This is possible because both mappers and reducers have generally similar execution: they loop over the input keys and values. The only real difference is that the keys for mappers are a single element, while reducer keys are lists of elements. At a high level, OpenCLDriver performs some setup of the HadoopCL environment (such as information on the execution mode to use, how much data to buffer at a time, launching some worker threads) and then loops through the inputs. It hands the actual buffering and processing of these inputs down to HadoopCLBuffer and HadoopCLKernel objects. One quick but important thing to note about OpenCLDriver is that there are two main paths a mapper or reducer task can take in OpenCLDriver. It can either be designated for OpenCL processing or Java processing. You can see the call for Java processing very early on in the run() method of OpenCLDriver, javaRun(). HadoopCLBuffer and HadoopCLKernel objects represent the data storage and data processing elements of HadoopCL. A HadoopCLBuffer aggregates inputs from the input Hadoop context and stores it in member arrays. When those arrays are full, a HadoopCLKernel object is initialized from their contents and launches its user-defined computation on the buffered inputs. When that computation is finished, the outputs stored in the HadoopCLBuffer object are then dumped back into Hadoop. This pipeline from buffering -> processing -> dumping is actually implemented by a pipeline of dedicated threads for each process. The main thread in OpenCLDriver performs the buffering in to HadoopCLBuffer objects. The logic for running HadoopCLKernel logic on HadoopCLBuffer contents is contained in ToOpenCLThread. The logic for writing outputs back to Hadoop is contained in ToHadoopThread. You'll notice there is some complicated logic for detecting termination because I added support for intermediate reduction in HadoopCL before realizing it was trivial to just support Combiners. This should probably all get taken out at some point. You can picture this pipeline as threads passing HadoopCLBuffer objects off to each other. Once you get below the HadoopCLKernel and HadoopCLBuffer layer of abstraction, we start getting into computation and type-specific classes. The HadoopCLKernel class is extended by HadoopCLMapperKernel and HadoopCLReducerKernel for mapper- and reducer-specific operations. The HadoopCLBuffer class is extended by HadoopCLMapperBuffer and HadoopCLReducerBuffer for mapper- and reducer-specific buffering and output. Extending these mapper- and reducer-specific classes are type-specific classes. All type-specific classes are auto-generated. An example of a type-specific mapper kernel class would be IntFloatIntFloatHadoopCLMapperKernel. This class implements everything but the user-implemented logic for a mapper which takes as input a (int, float) key-value pair and outputs (int, float) key-value pairs. The kernel and buffer classes are all auto-generated by AutoGenerateKernel.py, which is called from GenerateAll.py. GenerateAll.py simply iterates over the supported mapper and reducer types in SupportedMR to generate the necessary buffer and kernel classes for each. The AutoGenerateKernel.py code is very ugly at the moment, so I would recommend just running: python GenerateAll.py and taking a look at the generated files in mapred/org/apache/hadoop/mapreduce, rather than trying to figure out the AutoGenerateKernel.py code itself. One added feature of HadoopCL is globals. There's not really any need to go into much details on these at the moment as they aren't central to HadoopCL, just some useful functionality. Basically, they allow the user to specify globally visible, constant sparse vectors which are assigned a unique integer ID and can be accessed from HadoopCL kernels. Anywhere you see the word global in a variable name, that's probably what that's referring to. Anywhere you see a class called HadoopCLResizable*Array, that's an auto-generated class which is a dynamically sized Java array. This allows for buffering of inputs whose size we may not know, while also allowing efficient transfer to OpenCL devices because the object is backed by a primitive array. These classes are located under core/org/apache/hadoop/io and are auto-generated by AutoGenerateArray.py.
About
Automatic native execution of Hadoop computation on OpenCL devices
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published