Skip to content

How to import data

devproof edited this page Sep 12, 2014 · 9 revisions

There are three ways to import data:

  1. Write a custom Hadoop import job
  2. Import data with the command line interface
  3. Import data with the generic Hadoop import job

I am going to explain only point 2 and 3 in this section. Writing a custom Hadoop job ist described in another section.

Goto: http://repository-comsysto.forge.cloudbees.com/release/org/jumbodb/database/

Download: jumbodb-<version>.zip

This bundle contains the database, a CLI client and Hadoop import job. The CLI contains a full embedded Hadoop.

Both need the same JSON-Import configuration file which could look like:

{
    "description": "Twitter Test Import",
    "deliveryChunkKey": "first_delivery",
    "hosts": [
        {"host": "localhost", "port": 12001}
    ],
    "datePattern": "EEE MMM dd HH:mm:ss Z yyyy",
    "output": "/my/absolute/output/path/output",
    "numberOfDataFiles": 10,
    "hadoop": []
    "checksum": "NONE",
    "importCollection": [
        {
            "input": "/my/absolute/twitter/input/path",
            "collectionName": "twitter",
            "sort": ["created_at"],
            "sortType": "DATETIME",
            "dataStrategy": "JSON_SNAPPY_V1",
            "sortDatePattern": "EEE MMM dd HH:mm:ss Z yyyy",
            "indexes": [
                {
                    "indexName": "screen_name",
                    "fields": ["user.screen_name"],
                    "indexStrategy": "HASHCODE32_SNAPPY_V1"
                },
                {
                    "indexName": "created_at",
                    "fields": ["created_at"],
                    "indexStrategy": "DATETIME_SNAPPY_V1"
                },
                {
                    "indexName": "user_created_at",
                    "fields": ["user.created_at"],
                    "indexStrategy": "DATETIME_SNAPPY_V1"
                },
                {
                    "indexName": "coordinates",
                    "fields": ["geo.coordinates"],
                    "indexStrategy": "GEOHASH_SNAPPY_V1"
                },
                {
                    "indexName": "user_followers_count",
                    "fields": ["user.followers_count"],
                    "indexStrategy": "INTEGER_SNAPPY_V1"
                },
                {
                    "indexName": "user_statuses_count",
                    "fields": ["user.statuses_count"],
                    "indexStrategy": "INTEGER_SNAPPY_V1"
                }
            ]
        }
    ]
}

The example above should work with twitter data, you only have to adjust the input and output paths.

Paths for input and ouput could also contain prefixes like:

  • file://
  • hdfs://
  • maprfs://

Embedded field names are represented through the dot separation notation. The hadoop field allows you to override specific Hadoop configuration.

The index strategies are described here: https://github.com/comsysto/jumbodb/wiki/Storage-and-Index-Format

How to run the CLI?

After downloading and unzipping the jumbodb-<version>.zip file, run:

jumbodb-<version>/tools/import-cli/bin/jumboimport /path/to/my/import/twitter_import_config.json 
# see json file above

How to run the it on Hadoop?

After downloading and unzipping the jumbodb-<version>.zip file, copy the following file to your Hadoop cluster (everything is included in the jar):

jumbodb-<version>/tools/import-hadoop/hadoop-json-<version>.jar

hadoop jar /path/to/my/jar/hadoop-json-<version>.jar hdfs:///twitter/twitter_import_config.json
# see json file above

The configuration file must be available on a distributed file system so that every instance can access this file.

After importing data

You can query over the data by using the Java Driver or the Browse Web-Interface: http://localhost:9000 . You should also the a new delivery under delivery.