If you have a cluster already deployed you can skip this section. To deploy a 3 node chameleon cluster, use the following:
cm reset
pip uninstall cloudmesh_client
pip install -U cloudmesh_client
cm key add --ssh
cm refresh on
cm cluster define --count 3 \
--image CC-Ubuntu14.04 --flavor m1.medium
cm hadoop define spark pig
cm hadoop sync
cm hadoop deploy
cm cluster cross_ssh
git clone https://github.com/cloudmesh/cloudmesh.word2vec.git
(Note: On the cluster, first node becomes master)
cd ansible-word2vec
vi hosts
cd ansible-word2vec
./run.sh
run script will:
- deploy code
- run the crawler for collecting the training set
- submit the job on spark
To cleanup installation from cluster
cd ansible-word2vec
ansible-playbook word2vec_cleanup.yaml
-
If the installation
run.sh
script fails in middle due to some reason, execute the cleanup script before re-triggering run script again. The run script may fail due to variety of reasons like failed to shh, hadoop not available etc -
If the run script fails due to spark memory errors, you can modify the spark memory setting in
ansible-word2vec/setupvariables.yml
executeansible-word2vec/word2vec_cleanup.yaml
andansible-word2vec/run.sh
script. -
To test intermediate code changes, you can push the code to a git feature branch for example
spark_test
. Modifyword2vec_setup.yaml
git sectionversion=master
to point tospark_test
branch and execute theansible-word2vec/run.sh
script. -
If hadoop goes into safe mode, goto the cluster namenode and execute
/opt/hadoop$ bin/hadoop dfsadmin -safemode leave
This will remove the cluster from safe mode.
- Configure various properties in config.properties. For crawler, the important parameters are A)data_location B)seed_list C)max_pages
- configure the seed pages in wiki_crawl_seedlist.csv
- create folder 'crawldb' under 'code' (or the location specified in the config file)
- execute crawler
python wikicrawl.py
- configure various properties in config.properties. For news crawler, the important parameters are A)data_location B)news_seed_list C)Google custom search API keys
- configure the seed pages in news_crawl_seedlist.csv
- get Google customer search API keys by following instructions in https://developers.google.com/custom-search/
- create folder 'crawldb' under 'code' (or the location specified in config file)
- execute crawler
python newscrawl.py
- create folder 'model' under 'code' for saving the model
- configure various properties in config.properties
- start your spark cluster
- configure the spark-master URL in create-word2vec-model.sh.
- execute the create-word2vec-model
bash create-word2vec-model.sh
- execute use_word2vec_model.py
- the test file location can be defined in config.properties (synonym_test_file). The default value is stest.csv
- The generated result file will be stored in stestresult.csv by default. It can be changed using config.properties
bash use_word2vec_model.sh
- execute find_relations.py
- the test file location can be defined in config.properties (relations_test_file). The default value is relationstest.csv
- The generated result file will be stored in relationsresult.csv by default. It can be changed using config.properties
bash find_relations.sh
login(ssh) to cluster and run the following
cd /opt
sudo unlink spark
sudo rm -rf spark-2.1.0-bin-hadoop2.6*
sudo ln -s /opt/spark-1.6.0-bin-hadoop2.6 spark
sudo rm -rf /tmp/word2vec*
sudo rm -rf /opt/word2vec*