ncdistribute

ncdistribute - Nanocube Distribute process

Nanocube distribute abstracts multiple nanocube processes running somewhere into a single "larger" nanocube. The queries are handled by a master process, that dispatches each query to its children nanocubes and combines the results of all these dispatches.

A distribution rule file must be created in order to distribute the data through multiple nanocubes. The format of this file is:

address:port [q] [X]

where address:port is the address and the port of the machine running a daemon process, q indicates a query-only machine and X is the dataset porcentage going to that host (0 <= X <= 1).

Each one of the machines in the distribution rule file must be running a nanocube-daemon process. This process will create a nanocube child and receive the data from the distribute process. To run a daemon process:

./nanocube daemon [p]

where p is the port that the daemon process will wait for the connection message from the distribute process.

Once all daemons are running and a distribution rule file is created, the distribution process can be started, as following:

./nanocube distribute [-t <threads>] [-o] [-q <query-port>] [-b <block-size>] -h <hosts-filename>

where:

-t <threads>,  --threads <threads>
Number of threads for querying (mongoose)

-o,  --query-only
Only offer query.

-q <query-port>,  --query <query-port>
Query port.

-b <block-size>,  --block <block-size>
Block size, to be sent to each host.

-h <hosts-filename>,  --hosts <hosts-filename>
(required)  Distribution rule file

After the distribute process is started and all the dataset is distributed through all the nanocube children, the nanocube can be queries in the same fashion as a single nanocube process.

We can use the crime50k dataset as an example of how to run ncdistribute with two nanocubes on the local machine. First, we run two daemon processes:

./nanocube daemon 29512
./nanocube daemon 29513

We then create a distribution rule file (dist.txt), with each line containing the address (localhost in this example), port (29512 and 29513) of the nanocubes, and percentage:

localhost:29512 0.1
localhost:29513 0.9

Once we run nanocube distribute, it will distribute the crime50k dataset between the hosts, according to the distribution rule file, and handle the queries. To run nanocube distribute:

./nanocube distribute -q 5000 -h dist.txt < ../scripts/crime50k.nano

We can issue queries to the nanocubes distribute process:

http://localhost:5000/query

The reply:

{ "levels":[  ], "root":{ "addr":"0", "value":49186.000000 } }

This means that, even thought the dataset is distributed among the two nanocubes, the master process handled the merge successfully.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ncdistribute

ncdistribute - Nanocube Distribute process

Clone this wiki locally