Skip to content
Lauro Lins edited this page Apr 15, 2014 · 41 revisions

Further Details on Nanocubes

These notes are for versions >=2.0

Note: current nanocube server implementation only works on 64-bit architectures

Suppose we want to create a nanocube for the following table in csv format:

latitude,longitude,time,device
44.00124456,-73.74023438,2013-02-21T12:49,android
42.0090331,-74.79492188,2013-04-11T13:58,iphone
45.68366959,-94.04296875,2013-02-28T17:33,iphone
37.97076194,-85.69335938,2013-04-17T05:04,android
...

Imagine for instance that this table stores check-in reports of a social website like foursquare or the defunct brightkite. In addition to location and time we have a categorical variable indicating which device was used to report the check-in. The goal we have in mind is to create a nanocube back-end for this dataset that will respond (at interactive rates) the queries needed on a front-end that has pannable and zoomable map to visualize a heatmap of counts of reports in the different regions and different times. We also want the user to filter these count reports by device.

To load this csv file into a nanocube we first need to convert it into a file that is compatible with the nanocube tool called ncdmp. We call these files "dump" files and they use the extension ".dmp". A dump file has always a text header followed by the end-of-header marker (i.e. two line feeds: 0x0a, 0x0a) followed by data records. When we have a nanocube ready .dmp file this is how we start a nanocube:

cat datafile.dmp | ncserve

Unfortunately not all .dmp files are ready to be streamed into a nanocube. Only a subset of the possible .dmp files are compatible with the nanocube startup program ncserve. The tool ncdmp is used to filter a general .dmp file into one that is compatible with ncserve. So the more common way of starting a nanocube takes the general form of:

cat datafile.dmp | ncdmp (config parameters) | ncserve

In the ncdmp command above is where we commit to the final aspect of the nanocube we want to build. For example, latitude and longitude floating-point values can be stored into a .dmp file, but are not ready to be streamed into ncserve. We need to define a binning structure for space in, for example, the form of a k-levels quadtree. In the (config parameters) above is where we can declare such binning scheme and k.

So, let's see how to load a nanocube from the .csv file above. We first generate a general .dmp file by executing

python csv2dmp.py

using the python script below:

import time
import struct

# dictionary to map device names to numeric values 
# (<255 means we can fit it into a 1-byte categorical variable)
device_map = { "iphone":0, "android":1, "windows":2 }

# write header
output_file = open("example.dmp","w")
output_file.write("name: example\n")
output_file.write("encoding: binary\n")
output_file.write("field: latitude,float\n")
output_file.write("field: longitude,float\n")
output_file.write("field: checkin_time,uint64\n")
output_file.write("field: checkin_device,uint8\n")
for (k,v) in device_map.iteritems():
    output_file.write("valname: checkin_device," + str(v) + "," + str(k) +"\n")
output_file.write("\n")

# write records based on the csv lines
input_file = open("example.csv","r")
line_no = 0
while True:
    line_no += 1
    line = input_file.readline().strip()
    if not line:
        break
    if line_no == 1:
        continue
    tokens = line.split(",")
    latitude       = float(tokens[0])
    longitude      = float(tokens[1])
    checkin_time   = int(time.mktime(time.strptime(tokens[2],"%Y-%m-%dT%H:%M")))
    checkin_device = int(device_map[tokens[3]])
    # print latitude, longitude, checkin_time, checkin_device
    # little endian: 4 bytes float, 4 bytes float, 8 bytes unix time, 1 byte device
    data = struct.pack("<ffQB", latitude, longitude, checkin_time, checkin_device)
    output_file.write(data)

input_file.close()
output_file.close()

Now we have a binary .dmp file called example.dmp. Note again that this .dmp file is not ncserve compatible. To start a nanocube with 25-levels quadtree with hourly resolution to count the number of checkins we run the command.

cat example.dmp |                    \
ncdmp --encoding=b                   \
dim-dmq=pos,latitude,longitude,25    \
dim-cat=device,checkin_device        \
dim-tbin=time,checkin_time,2013_1h,2 \
var-one=count,4 | ncserve

Note that for this command to work there needs to be a program named nc_q25_c1_u2_u4 in the folder pointed by the environment variable NANOCUBE_BIN. This folder is where all the specific nanocube programs are stored. If such a file doesn't exist one can set the environment variable and run the script ncbuild

export NANOCUBE_BIN=$HOME/local/bin
ncbuild q25 c1 u2 u4

Once the nanocube start to get new data it also starts listening to some port which should be reported on the console. Through that port we can query the nanocube right away. For example if you open a browser and put the url (for port 29512)

http://localhost:29512/query

you get a json answer

{ "levels":[  ], "root":{ "addr":"0", "value":50.000000 } }

and the number 50.0 corresponds to the total number of records we have in the nanocube. If we want to drill down in device we can query

http://localhost:29512/query/@device=255+1

and get a json answer

{ "levels":[ "device" ], "root":{ "addr":"0", "children":[ { "addr":"0", "value":26.000000 }, { "addr":"1", "value":22.000000 }, { "addr":"2", "value":2.000000 } ] } }

The numbers 26, 22 and 2 are respectively the number of iphone, android and windows checkins. Note the 255+1 in the query is an ugly syntax for now, but the idea of this traversal rule is to indicate a dimension address plus a number of levels down we want. The symbol @ indicates that we want to drilldown in that dimension (every result is going to have a device component in this case). The number 255 is the address of the root node of a flat categorical tree of 1-byte or c1 dimension. If we had a c2 dimension the root would be 65535. We can use the same syntax of (address)+(offset) to query a 256x256 pixels tile for the whole world. Here is the query

http://localhost:29512/query/@device=0+8

and here is the json result

{ "levels":[ "pos" ], "root":{ "addr":"0", "children":[ { "addr":"200000130000002d", "value":1.000000 }, { "addr":"20000013e000004b", "value":1.000000 }, { "addr":"200000136000002d", "value":1.000000 }, { "addr":"200000144000004b", "value":1.000000 }, { "addr":"20000013e000004c", "value":1.000000 }, { "addr":"2000001600000026", "value":1.000000 }, { "addr":"2000001380000029", "value":2.000000 }, { "addr":"20000010a0000051", "value":1.000000 }, { "addr":"2000001280000036", "value":1.000000 }, { "addr":"20000012e000003c", "value":1.000000 }, { "addr":"20000012a0000032", "value":1.000000 }, { "addr":"20000013a0000037", "value":1.000000 }, { "addr":"20000012e0000031", "value":1.000000 }, { "addr":"200000132000003e", "value":1.000000 }, { "addr":"20000013e000003b", "value":1.000000 }, { "addr":"2000001440000034", "value":1.000000 }, { "addr":"20000014c0000031", "value":1.000000 }, { "addr":"200000144000002c", "value":1.000000 }, { "addr":"200000140000004d", "value":1.000000 }, { "addr":"200000142000003d", "value":1.000000 }, { "addr":"200000142000003e", "value":1.000000 }, { "addr":"200000148000003d", "value":1.000000 }, { "addr":"200000140000004a", "value":1.000000 }, { "addr":"2000001380000032", "value":1.000000 }, { "addr":"2000001200000036", "value":1.000000 }, { "addr":"200000140000004c", "value":1.000000 }, { "addr":"20000013a0000028", "value":1.000000 }, { "addr":"200000146000004f", "value":1.000000 }, { "addr":"20000014c0000056", "value":1.000000 }, { "addr":"2000001580000054", "value":1.000000 }, { "addr":"2000001540000040", "value":1.000000 }, { "addr":"20000014c000002d", "value":1.000000 }, { "addr":"2000001300000044", "value":1.000000 }, { "addr":"200000150000004b", "value":1.000000 }, { "addr":"2000001300000036", "value":1.000000 }, { "addr":"20000013a000003f", "value":1.000000 }, { "addr":"2000001460000036", "value":1.000000 }, { "addr":"20000013a0000043", "value":1.000000 }, { "addr":"2000001340000047", "value":1.000000 }, { "addr":"2000001180000050", "value":1.000000 }, { "addr":"200000130000002c", "value":1.000000 }, { "addr":"20000014c0000058", "value":2.000000 }, { "addr":"20000011c0000049", "value":1.000000 }, { "addr":"20000015c000003d", "value":1.000000 }, { "addr":"2000001520000028", "value":1.000000 }, { "addr":"20000015c0000031", "value":1.000000 }, { "addr":"200000136000002a", "value":1.000000 }, { "addr":"2000001180000039", "value":1.000000 } ] } }

The addresses are 64-bit numbers written in hex that can be decoded into a quadtree node address with the following code snippet

x     = address & 0x1FFFFFFF
y     = (address >> 29) & 0x1FFFFFFF
level = (address >> 58)

In other words the quadtrees are limited to 29 levels (which is enough for google maps like pan-and-zoom maps). The zero on 0+8 is the address of the root node in any quadtree q1 ... q29 dimension (note that it is different from the root of the categorical trees c1 ... c8).

If we want only an image of the top-left quadrant of the world we can write

http://localhost:29512/query/@pos=qaddr(0,1,1)+8/device=0+0

and here is the result (all these values should sum up to 26)

{ "levels":[ "pos" ], "root":{ "addr":"0", "children":[ { "addr":"240000242000006d", "value":1.000000 }, { "addr":"24000025e0000062", "value":1.000000 }, { "addr":"240000264000007d", "value":1.000000 }, { "addr":"24000028a0000068", "value":1.000000 }, { "addr":"240000260000006d", "value":1.000000 }, { "addr":"2400002560000064", "value":1.000000 }, { "addr":"2400002820000095", "value":1.000000 }, { "addr":"240000276000006e", "value":1.000000 }, { "addr":"24000028a0000059", "value":1.000000 }, { "addr":"24000028e000006d", "value":1.000000 }, { "addr":"240000292000007a", "value":1.000000 }, { "addr":"240000284000007a", "value":1.000000 }, { "addr":"24000023200000a1", "value":1.000000 }, { "addr":"24000026c000005b", "value":1.000000 }, { "addr":"240000284000007d", "value":1.000000 }, { "addr":"24000029a00000ad", "value":1.000000 }, { "addr":"240000282000009a", "value":1.000000 }, { "addr":"2400002a80000080", "value":1.000000 }, { "addr":"24000028e000009f", "value":1.000000 }, { "addr":"2400002620000089", "value":1.000000 }, { "addr":"240000262000005a", "value":1.000000 }, { "addr":"24000025c0000078", "value":1.000000 }, { "addr":"2400002b80000063", "value":1.000000 }, { "addr":"24000021600000a3", "value":1.000000 }, { "addr":"2400002620000059", "value":1.000000 }, { "addr":"2400002320000073", "value":1.000000 } ] } }

The constraints in time are special. To for a daily timeseries for the first 20 days of 2013 we query

http://localhost:29512/query/@time=0:24:20

and the result is

{ "levels":[ "time" ], "root":{ "addr":"0", "children":[ { "addr":"d8000000f0", "value":0.000000 }, { "addr":"c0000000d8", "value":1.000000 }, { "addr":"a8000000c0", "value":1.000000 }, { "addr":"90000000a8", "value":0.000000 }, { "addr":"7800000090", "value":0.000000 }, { "addr":"6000000078", "value":0.000000 }, { "addr":"4800000060", "value":0.000000 }, { "addr":"3000000048", "value":0.000000 }, { "addr":"1800000030", "value":0.000000 }, { "addr":"18", "value":0.000000 }, { "addr":"f000000108", "value":0.000000 }, { "addr":"10800000120", "value":1.000000 }, { "addr":"12000000138", "value":0.000000 }, { "addr":"13800000150", "value":0.000000 }, { "addr":"15000000168", "value":0.000000 }, { "addr":"16800000180", "value":0.000000 }, { "addr":"18000000198", "value":1.000000 }, { "addr":"198000001b0", "value":0.000000 }, { "addr":"1b0000001c8", "value":0.000000 }, { "addr":"1c8000001e0", "value":0.000000 } ] } }

The 64-bit time addresses are decoded as an interval [a,b) intervals the first 32 bits is for b and the most significal bits are for a. Here is a code snippet in python for the first bucket that shows up in the result with address "d8000000f0" (note these buckets don't come in order).

>>> addr = 0xd8000000f0
>>> b = addr & 0xFFFFFFFF
>>> a = (addr >> 32) & 0xFFFFFFFF
>>> a
216
>>> b
240

So it is the bucket that counts all events from hour 216 of year 2013 to hour 239 of 2013.

Nanocube building times are much much faster if the records are sorted in time (which we haven't done in this small example).

Clone this wiki locally