Skip to content
salivian edited this page Sep 13, 2014 · 41 revisions

Further Details on Nanocubes

These notes are for versions >=2.0

Note: current nanocube server implementation only works on 64-bit architectures

Suppose we want to create a nanocube for the following table in csv format:

latitude,longitude,time,device,make
44.00124456,-73.74023438,2013-02-21T12:49,android,Samsung
42.0090331,-74.79492188,2013-04-11T13:58,iphone,Apple
45.68366959,-94.04296875,2013-02-28T17:33,iphone,Apple
37.97076194,-85.69335938,2013-04-17T05:04,android,LG
...

Imagine for instance that this table stores check-in reports of a social website like foursquare or the defunct brightkite. In addition to location and time we have a categorical variable indicating which device was used to report the check-in. The goal we have in mind is to create a nanocube back-end for this dataset that will respond (at interactive rates) the queries needed on a front-end that has pannable and zoomable map to visualize a heatmap of counts of reports in the different regions and different times. We also want the user to filter these count reports by device.

To load this csv file into a nanocube you can use the csv2Nanocube.py script.

   python csv2Nanocube.py --latcol=latitude --loncol=longitude --timecol=time --catcol=device,make example.csv | ncserve ...

The csv2Nanocube.py attempts to parse the csv file by reading the first 10000 lines of the csv file. The script summarizes categorical variables and infer date formats, then convert the data into a binary format that is readable by the nanocube server.

For streaming data, rather than reading from files, the script takes the - parameter to read from stdin. Therefore you can pipe the output of a program to nanocubes. For example:

  cat *.csv  | python csv2Nanocube.py --latcol=latitude --loncol=longitude --timecol=time --catcol=device,make example.csv | ncserve ...

Now we have a binary .dmp file called example.dmp. Note again that this .dmp file is not ncserve compatible. To start a nanocube with 25-levels quadtree with hourly resolution to count the number of checkins we run the command.

cat example.dmp |                    \
ncdmp --encoding=b                   \
dim-dmq=pos,latitude,longitude,25    \
dim-cat=device,checkin_device        \
dim-tbin=time,checkin_time,2013_1h,2 \
var-one=count,4 | ncserve

Note that for this command to work there needs to be a program named nc_q25_c1_u2_u4 in the folder pointed by the environment variable NANOCUBE_BIN. This folder is where all the specific nanocube programs are stored. If such a file doesn't exist one can set the environment variable and run the script ncbuild

export NANOCUBE_BIN=$HOME/local/bin
ncbuild q25 c1 u2 u4

Once the nanocube start to get new data it also starts listening to some port which should be reported on the console. Through that port we can query the nanocube right away. For example if you open a browser and put the url (for port 29512)

http://localhost:29512/query

you get a json answer

{ "levels":[  ], "root":{ "addr":"0", "value":50.000000 } }

and the number 50.0 corresponds to the total number of records we have in the nanocube. If we want to drill down in device we can query

http://localhost:29512/query/@device=255+1

and get a json answer

{ "levels":[ "device" ], "root":{ "addr":"0", "children":[ { "addr":"0", "value":26.000000 }, { "addr":"1", "value":22.000000 }, { "addr":"2", "value":2.000000 } ] } }

The numbers 26, 22 and 2 are respectively the number of iphone, android and windows checkins. Note the 255+1 in the query is an ugly syntax for now, but the idea of this traversal rule is to indicate a dimension address plus a number of levels down we want. The symbol @ indicates that we want to drilldown in that dimension (every result is going to have a device component in this case). The number 255 is the address of the root node of a flat categorical tree of 1-byte or c1 dimension. If we had a c2 dimension the root would be 65535. We can use the same syntax of (address)+(offset) to query a 256x256 pixels tile for the whole world. Here is the query

http://localhost:29512/query/@device=0+8

and here is the json result

{ "levels":[ "pos" ], "root":{ "addr":"0", "children":[ { "addr":"200000130000002d", "value":1.000000 }, { "addr":"20000013e000004b", "value":1.000000 }, { "addr":"200000136000002d", "value":1.000000 }, { "addr":"200000144000004b", "value":1.000000 }, { "addr":"20000013e000004c", "value":1.000000 }, { "addr":"2000001600000026", "value":1.000000 }, { "addr":"2000001380000029", "value":2.000000 }, { "addr":"20000010a0000051", "value":1.000000 }, { "addr":"2000001280000036", "value":1.000000 }, { "addr":"20000012e000003c", "value":1.000000 }, { "addr":"20000012a0000032", "value":1.000000 }, { "addr":"20000013a0000037", "value":1.000000 }, { "addr":"20000012e0000031", "value":1.000000 }, { "addr":"200000132000003e", "value":1.000000 }, { "addr":"20000013e000003b", "value":1.000000 }, { "addr":"2000001440000034", "value":1.000000 }, { "addr":"20000014c0000031", "value":1.000000 }, { "addr":"200000144000002c", "value":1.000000 }, { "addr":"200000140000004d", "value":1.000000 }, { "addr":"200000142000003d", "value":1.000000 }, { "addr":"200000142000003e", "value":1.000000 }, { "addr":"200000148000003d", "value":1.000000 }, { "addr":"200000140000004a", "value":1.000000 }, { "addr":"2000001380000032", "value":1.000000 }, { "addr":"2000001200000036", "value":1.000000 }, { "addr":"200000140000004c", "value":1.000000 }, { "addr":"20000013a0000028", "value":1.000000 }, { "addr":"200000146000004f", "value":1.000000 }, { "addr":"20000014c0000056", "value":1.000000 }, { "addr":"2000001580000054", "value":1.000000 }, { "addr":"2000001540000040", "value":1.000000 }, { "addr":"20000014c000002d", "value":1.000000 }, { "addr":"2000001300000044", "value":1.000000 }, { "addr":"200000150000004b", "value":1.000000 }, { "addr":"2000001300000036", "value":1.000000 }, { "addr":"20000013a000003f", "value":1.000000 }, { "addr":"2000001460000036", "value":1.000000 }, { "addr":"20000013a0000043", "value":1.000000 }, { "addr":"2000001340000047", "value":1.000000 }, { "addr":"2000001180000050", "value":1.000000 }, { "addr":"200000130000002c", "value":1.000000 }, { "addr":"20000014c0000058", "value":2.000000 }, { "addr":"20000011c0000049", "value":1.000000 }, { "addr":"20000015c000003d", "value":1.000000 }, { "addr":"2000001520000028", "value":1.000000 }, { "addr":"20000015c0000031", "value":1.000000 }, { "addr":"200000136000002a", "value":1.000000 }, { "addr":"2000001180000039", "value":1.000000 } ] } }

The addresses are 64-bit numbers written in hex that can be decoded into a quadtree node address with the following code snippet

x     = address & 0x1FFFFFFF
y     = (address >> 29) & 0x1FFFFFFF
level = (address >> 58)

In other words the quadtrees are limited to 29 levels (which is enough for google maps like pan-and-zoom maps). The zero on 0+8 is the address of the root node in any quadtree q1 ... q29 dimension (note that it is different from the root of the categorical trees c1 ... c8).

If we want only an image of the top-left quadrant of the world we can write

http://localhost:29512/query/@pos=qaddr(0,1,1)+8/device=0+0

and here is the result (all these values should sum up to 26)

{ "levels":[ "pos" ], "root":{ "addr":"0", "children":[ { "addr":"240000242000006d", "value":1.000000 }, { "addr":"24000025e0000062", "value":1.000000 }, { "addr":"240000264000007d", "value":1.000000 }, { "addr":"24000028a0000068", "value":1.000000 }, { "addr":"240000260000006d", "value":1.000000 }, { "addr":"2400002560000064", "value":1.000000 }, { "addr":"2400002820000095", "value":1.000000 }, { "addr":"240000276000006e", "value":1.000000 }, { "addr":"24000028a0000059", "value":1.000000 }, { "addr":"24000028e000006d", "value":1.000000 }, { "addr":"240000292000007a", "value":1.000000 }, { "addr":"240000284000007a", "value":1.000000 }, { "addr":"24000023200000a1", "value":1.000000 }, { "addr":"24000026c000005b", "value":1.000000 }, { "addr":"240000284000007d", "value":1.000000 }, { "addr":"24000029a00000ad", "value":1.000000 }, { "addr":"240000282000009a", "value":1.000000 }, { "addr":"2400002a80000080", "value":1.000000 }, { "addr":"24000028e000009f", "value":1.000000 }, { "addr":"2400002620000089", "value":1.000000 }, { "addr":"240000262000005a", "value":1.000000 }, { "addr":"24000025c0000078", "value":1.000000 }, { "addr":"2400002b80000063", "value":1.000000 }, { "addr":"24000021600000a3", "value":1.000000 }, { "addr":"2400002620000059", "value":1.000000 }, { "addr":"2400002320000073", "value":1.000000 } ] } }

The constraints in time are special. To for a daily timeseries for the first 20 days of 2013 we query

http://localhost:29512/query/@time=0:24:20

and the result is

{ "levels":[ "time" ], "root":{ "addr":"0", "children":[ { "addr":"d8000000f0", "value":0.000000 }, { "addr":"c0000000d8", "value":1.000000 }, { "addr":"a8000000c0", "value":1.000000 }, { "addr":"90000000a8", "value":0.000000 }, { "addr":"7800000090", "value":0.000000 }, { "addr":"6000000078", "value":0.000000 }, { "addr":"4800000060", "value":0.000000 }, { "addr":"3000000048", "value":0.000000 }, { "addr":"1800000030", "value":0.000000 }, { "addr":"18", "value":0.000000 }, { "addr":"f000000108", "value":0.000000 }, { "addr":"10800000120", "value":1.000000 }, { "addr":"12000000138", "value":0.000000 }, { "addr":"13800000150", "value":0.000000 }, { "addr":"15000000168", "value":0.000000 }, { "addr":"16800000180", "value":0.000000 }, { "addr":"18000000198", "value":1.000000 }, { "addr":"198000001b0", "value":0.000000 }, { "addr":"1b0000001c8", "value":0.000000 }, { "addr":"1c8000001e0", "value":0.000000 } ] } }

The 64-bit time addresses are decoded as an interval [a,b) intervals the first 32 bits is for b and the most significal bits are for a. Here is a code snippet in python for the first bucket that shows up in the result with address "d8000000f0" (note these buckets don't come in order).

>>> addr = 0xd8000000f0
>>> b = addr & 0xFFFFFFFF
>>> a = (addr >> 32) & 0xFFFFFFFF
>>> a
216
>>> b
240

So it is the bucket that counts all events from hour 216 of year 2013 to hour 239 of 2013.

Nanocube building times are much much faster if the records are sorted in time (which we haven't done in this small example).

Clone this wiki locally