-
Notifications
You must be signed in to change notification settings - Fork 1
How to import data
There are three ways to import data:
- Write a custom Hadoop import job
- Import data with the command line interface
- Import data with the generic Hadoop import job
I am going to explain only point 2 and 3 in this section. Writing a custom Hadoop job ist described in another section.
Goto: http://repository-comsysto.forge.cloudbees.com/release/org/jumbodb/database/
Download: jumbodb-<version>.zip
This bundle contains the database, a CLI client and Hadoop import job. The CLI contains a full embedded Hadoop.
Both need the same JSON-Import configuration file which could look like:
{
"description": "Twitter Test Import",
"deliveryChunkKey": "first_delivery",
"hosts": [
{"host": "localhost", "port": 12001}
],
"datePattern": "EEE MMM dd HH:mm:ss Z yyyy",
"output": "/my/absolute/output/path/output",
"numberOfDataFiles": 10,
"hadoop": []
"checksum": "NONE",
"importCollection": [
{
"input": "/my/absolute/twitter/input/path",
"collectionName": "twitter",
"sort": ["created_at"],
"sortType": "DATETIME",
"dataStrategy": "JSON_SNAPPY_V1",
"sortDatePattern": "EEE MMM dd HH:mm:ss Z yyyy",
"indexes": [
{
"indexName": "screen_name",
"fields": ["user.screen_name"],
"indexStrategy": "HASHCODE32_SNAPPY_V1"
},
{
"indexName": "created_at",
"fields": ["created_at"],
"indexStrategy": "DATETIME_SNAPPY_V1"
},
{
"indexName": "user_created_at",
"fields": ["user.created_at"],
"indexStrategy": "DATETIME_SNAPPY_V1"
},
{
"indexName": "coordinates",
"fields": ["geo.coordinates"],
"indexStrategy": "GEOHASH_SNAPPY_V1"
},
{
"indexName": "user_followers_count",
"fields": ["user.followers_count"],
"indexStrategy": "INTEGER_SNAPPY_V1"
},
{
"indexName": "user_statuses_count",
"fields": ["user.statuses_count"],
"indexStrategy": "INTEGER_SNAPPY_V1"
}
]
}
]
}
The example above should work with twitter data, you only have to adjust the input and output paths.
Paths for input and ouput could also contain prefixes like:
file://
hdfs://
maprfs://
Embedded field names are represented through the dot separation notation.
The hadoop
field allows you to override specific Hadoop configuration.
The index strategies are described here: https://github.com/comsysto/jumbodb/wiki/Storage-and-Index-Format
How to run the CLI?
After downloading and unzipping the jumbodb-<version>.zip
file, run:
jumbodb-<version>/tools/import-cli/bin/jumboimport /path/to/my/import/twitter_import_config.json
# see json file above
How to run the it on Hadoop?
After downloading and unzipping the jumbodb-<version>.zip
file, copy the following file to your Hadoop cluster (everything is included in the jar):
jumbodb-<version>/tools/import-hadoop/hadoop-json-<version>.jar
hadoop jar /path/to/my/jar/hadoop-json-<version>.jar hdfs:///twitter/twitter_import_config.json
# see json file above
The configuration file must be available on a distributed file system so that every instance can access this file.
After importing data
You can query over the data by using the Java Driver or the Browse
Web-Interface: http://localhost:9000 .
You should also the a new delivery under delivery
.