////Very interesting; but, give an example (one, or two) of how this extends to equivalent real-world examples. Amy////
The world is a big place, but we don’t use all of it the same. Most of the world is water. Lots of it is Siberia. Half the tiles at zoom level 2 have only a few thousand inhabitants[1].
Suppose you wanted to store a "what country am I in" dataset — a geo-joinable decomposition of the region boundaries of every country. You’ll immediately note that Monaco fits easily within on one zoom-level 12 quadtile; Russia spans two zoom-level 1 quadtiles. Without multiscaling, to cover the globe at 1-km scale and 64-kB records would take 70 terabytes — and 1-km is not all that satisfactory. Huge parts of the world would be taken up by grid cells holding no border that simply said "Yep, still in Russia".
There’s a simple modification of the grid system that lets us very naturally describe multiscale data.
The figures (REF: multiscale images) show the quadtiles covering Japan at ZL=7. For reasons you’ll see in a bit, we will split everything up to at least that zoom level; we’ll show the further decomposition down to ZL=9.
Already six of the 16 tiles shown don’t have any land coverage, so you can record their values:
1330000xx { Pacific Ocean } 1330011xx { Pacific Ocean } 1330013xx { Pacific Ocean } 1330031xx { Pacific Ocean } 1330033xx { Pacific Ocean } 1330032xx { Pacific Ocean }
Pad out each of the keys with `x’s to meet our lower limit of ZL=9.
The quadkey 1330011xx
means "I carry the information for grids 133001100
, 133001101
, 133001110
, 133001111
, ".
You should uniformly decompose everything to some upper zoom level so that if you join on something uniformly distributed across the globe you don’t have cripplingly large skew in data size sent to each partition. A zoom level of 7 implies 16,000 tiles — a small quantity given the exponential growth of tile sizes
With the upper range as your partition key, and the whole quadkey is the sort key, you can now do joins. In the reducer,
-
read keys on each side until one key is equal to or a prefix of the other.
-
emit combined record using the more specific of the two keys
-
read the next record from the more-specific column, until there’s no overlap
Take each grid cell; if it needs subfeatures, divide it else emit directly.
You must emit high-level grid cells with the lsb filled with XX or something that sorts after a normal cell; this means that to find the value for a point,
-
Find the corresponding tile ID,
-
Index into the table to find the first tile whose ID is larger than the given one.
00.00.00 00.00.01 00.00.10 00.00.11 00.01.-- 00.10.-- 00.11.00 00.11.01 00.11.10 00.11.11 01.--.-- 10.00.-- 10.01.-- 10.10.01 10.10.10 10.10.11 10.10.00 10.11.--
You can look at quadtiles is as a tree structure. Each branch splits the plane exactly in half by area, and only leaf nodes hold data.
The first quadtile scheme required we develop every branch of the tree to the same depth. The multiscale quadtile scheme effectively says "hey, let’s only expand each branch to its required depth". Our rule to break up a quadtile if any section of it needs development preserves the "only leaf nodes hold data". Breaking tiles always exactly in two makes it easy to assign features to their quadtile and facilitates joins betweeen datasets that have never met. There are other ways to make these tradeoffs, though — read about K-D trees in the "keep exploring" section at end of chapter.