-
Notifications
You must be signed in to change notification settings - Fork 162
ncdistribute
Nanocube distribute abstracts multiple nanocube processes running somewhere into a single "larger" nanocube. The queries are handled by a master process, that dispatches each query to its children nanocubes and combines the results of all these dispatches.
A distribution rule file must be created in order to distribute the data through multiple nanocubes. The format of this file is:
address:port [q] [X]
where address:port is the address and the port of the machine running a daemon process, q indicates a query-only machine and X is the dataset porcentage going to that host (0 <= X <= 1).
Each one of the machines in the distribution rule file must be running a nanocube-daemon process. This process will create a nanocube child and receive the data from the distribute process. To run a daemon process:
./nanocube daemon [p]
where p is the port that the daemon process will wait for the connection message from the distribute process.
Once all daemons are running and a distribution rule file is created, the distribution process can be started, as following:
./nanocube distribute [-t <threads>] [-o] [-q <query-port>] [-b <block-size>] -h <hosts-filename>
where:
-t <threads>, --threads <threads>
Number of threads for querying (mongoose)
-o, --query-only
Only offer query.
-q <query-port>, --query <query-port>
Query port.
-b <block-size>, --block <block-size>
Block size, to be sent to each host.
-h <hosts-filename>, --hosts <hosts-filename>
(required) Distribution rule file
After the distribute process is started and all the dataset is distributed through all the nanocube children, the nanocube can be queries in the same fashion as a single nanocube process.
We can use the crime50k dataset as an example of how to run ncdistribute with two nanocubes on the local machine. First, we run two daemon processes:
./nanocube daemon 29512
./nanocube daemon 29513
We then create a distribution rule file (dist.txt), with each line containing the address (localhost in this example), port (29512 and 29513) of the nanocubes, and percentage:
localhost:29512 0.1
localhost:29513 0.9
Once we run nanocube distribute, it will distribute the crime50k dataset between the hosts, according to the distribution rule file, and handle the queries. To run nanocube distribute:
./nanocube distribute -q 5000 -h dist.txt < ../scripts/crime50k.nano
We can issue queries to the nanocubes distribute process:
http://localhost:5000/query
The reply:
{ "levels":[ ], "root":{ "addr":"0", "value":49186.000000 } }
This means that, even thought the dataset is distributed among the two nanocubes, the master process handled the merge successfully.