Fully Distributed installation of hadoop ecosystem on GCP IaaS.
- Apache Hadoop (hdfs): java8 (playbook-hdfs)
- Apache Hadoop (yarn, mapreduce)
- Apache Zookeeper: java8 (playbook-zookeeper)
- Apache Hbase: java8, hdfs, Zookeeper (playbook-hbase)
- Apache Spark: java8, Zookeeper
- Apache Kafka: java8, Zookeeper
all playbooks are consistent within related tools
Generally,
- choose which playbook do you need (hdfs, hdfs + hbase, zookeeper etc..)
- design inrastructure arch. on .env files in related playbook folders.
- you can also combine your own playbook (good for trainings or POCs)
- create machines on GCP, and establish passwordless ssh from master to workers
- and configure products
- then start servers
Create a GCP account and billing account etc..., Then
- Configure your local for gcloud CLI or use gcloud-shell in gcp console, after cloning the git-reporitory.
- for local, run
gcloud auth list
to check active gcp account. Andgcloud auth login
if necessary
- for local, run
git clone https://github.com/tansudasli/hadoop-backbone-boilerplate.git
,- Then
cd hadoop-backbone-boilerplate
- Edit
.gcp.env
and update (service account, project, region etc...)
- Then
- Run
./create-gcp-project.sh
to create project, and to link your billing account - Run
./create-firewall-rule.sh
to create fw rules, so that you can reach via web consoles
- More optimized and parametric scripts (env files etc.)
- Use less static-IPs (just for masters etc.)
- Dynamic machine-Types regarding to purposes (diff. CPU and RAM configs)
- Dynamic port management (open it only for masters)
- Nodes should be dedicated to hdfs, hbase, spark etc... So it becomes fully distributed
- Shared zookeeper (instead hbase managed ..)
- Adjust file and process limits in linux (ulimit -n, -u)
- JVM optimizations
- Better disk architecture (local ssd disks etc.)
- Backup to network attached disks (hdfs full image ..)
- More hadoop security (kerberos etc.)
- More network layer security (diff. subnets etc.)
- Add rsync to crontab to sync conf files
- Better log management (esp. for zk ..)
- Central DNS management (not hostname update)
- General optimizations related to other ecosystem tools (hbase changes some hdfs parameters ...)
and also consider
- free cloudera distribution for better hadoop management,
- and ansible for on-premise configuration management and provisioning