-
Notifications
You must be signed in to change notification settings - Fork 1
Data Delivery Concept
Immutable data has the benefit that the database does not have to reorganize the data continuously. It is possible to pre-organize data by the use case. This means calculating indexes and sorting data in a parallelized environment like Hadoop is possible. In general you need other amounts of resources for organizing data than reading them. For instance there is an index which gets read by 80% of all cases, you should sort the data by this index. This allows the database to read multiple datasets in less read actions. Avoiding writes on single data sets bring great benefits in reading. In big data environments the data usually does not change and single datasets does not get updated, right? Here are some examples:
- Log files: Get written once, the data does not change once it was written
- Sensoric data: Data gets collected and maybe pre aggregated, but no single data sets will change
- Telecommunication data
The data which is used in JumboDB is immutable. In big data concepts data often gets pre-aggregated, calculated, corrected and filtered. So mistakes and errors can occur and happen. These mistakes and errors affect all datasets in a delivery - not only a single dataset. Therefore you are going to replace a whole delivered dataset by recalculating everything and not only a single entry. Furthermore a delivery can contain multiple different datasets, which only work together. All datasets get available when the import is finished successfully. This has the advantage that you always work on a consistent and applicable data.
The concept has three parts: Collections, Delivery Chunk Key and Delivery Version.
A collection is like a table name in relational databases. The data gets queried over this name like in SQL. The same notation is used in MongoDB. Additionally a collection is part of a delivery which contains a delivery chunk key and version. Every collection has the delivery chunks and version in it.
This key represents a group name for a delivery. Deliveries with the same chunk key replace other deliveries. A delivery chunk contains several versions and only one version can be active. The delivery chunk key is defined by the user. For example: The key could be "january" for a delivery which contains log file aggregations for the january. If you have an error in the delivery you can redeliver it under the same chunk key. If you have new data, e.g. for "february", you can extend the data with that different key. When you query over a collection the data gets searched in the delivery chunk "january" and "february" over the current active version.
Delivery version is the unique identifier generated by the system. Different collections can have the same chunk key and version. This makes it possible to group multiple collection to the same delivery. A delivery version cannot exist in two different chunk keys, because a delivery can only have one chunk key and the version is unique per import. A version replaces another version, what means there can be multiple versions existent, but only one version per delivery chunk can be active.
There are two different perspectives on the data 'delivery oriented view' and 'collection oriented view'. The data is always organized the same way.