Skip to content
Fabio Markus Miranda edited this page Jul 14, 2014 · 2 revisions

ncdistribute - Nanocube Distribute process

Nanocube distribute abstracts multiple nanocube processes running somewhere into a single "larger" nanocube. The queries are handled by a master process, that dispatches each query to its children nanocubes and combines the results of all these dispatches.

A distribution rule file must be created in order to distribute the data through multiple nanocubes. The format of this file is:

address:port [q] [X] 

where address:port is the address and the port of the machine running a daemon process, q indicates a query-only machine and X is the dataset porcentage going to that host (0 <= X <= 1).

Each one of the machines in the distribution rule file must be running a nanocube-daemon process. This process will create a nanocube child and receive the data from the distribute process. To run a daemon process:

./nanocube daemon [p]

where p is the port that the daemon process will wait for the connection message from the distribute process.

Once all daemons are running and a distribution rule file is created, the distribution process can be started, as following:

./nanocube distribute [-t <threads>] [-o] [-q <query-port>] [-b <block-size>] -h <hosts-filename>

where:

-t <threads>,  --threads <threads>
Number of threads for querying (mongoose)

-o,  --query-only
Only offer query.

-q <query-port>,  --query <query-port>
Query port.

-b <block-size>,  --block <block-size>
Block size, to be sent to each host.

-h <hosts-filename>,  --hosts <hosts-filename>
(required)  Distribution rule file

After the distribute process is started and all the dataset is distributed through all the nanocube children, the nanocube can be queries in the same fashion as a single nanocube process.

We can use the crime50k dataset as an example of how to run ncdistribute with two nanocubes on the local machine. First, we run two daemon processes:

./nanocube daemon 29512
./nanocube daemon 29513

We then create a distribution rule file (dist.txt), with each line containing the address (localhost in this example), port (29512 and 29513) of the nanocubes, and percentage:

localhost:29512 0.1
localhost:29513 0.9

Once we run nanocube distribute, it will distribute the crime50k dataset between the hosts, according to the distribution rule file, and handle the queries. To run nanocube distribute:

./nanocube distribute -q 5000 -h dist.txt < ../scripts/crime50k.nano

We can issue queries to the nanocubes distribute process:

http://localhost:5000/query

The reply:

{ "levels":[  ], "root":{ "addr":"0", "value":49186.000000 } }

This means that, even thought the dataset is distributed among the two nanocubes, the master process handled the merge successfully.

Clone this wiki locally