Spark can run on clusters managed by Kubernetes. This feature makes use of native Kubernetes scheduler that has been added to Spark.
The Kubernetes scheduler is currently experimental. In future versions, there may be behavioral changes around configuration, container images and entrypoints.
kubectl cluster-info
Kubernetes master is running at https://192.168.7.70/clusters/c-8bhgq
KubeDNS is running at https://192.168.7.70/k8s/clusters/c-8bhgq/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
- Build Apache Spark Docker image (e.g. Apache Spark 2.4.3).
# Download a pre-built Spark
cd examples/Spark/
wget http://archive.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz
# Extract Spark
tar xzf spark-2.4.3-bin-hadoop2.7.tgz
# Build Spark Docker image
cd spark-2.4.3-bin-hadoop2.7
./bin/docker-image-tool.sh -r dwarfcu -t 2.4.3 build
- Upload Spark Docker image to a repository (e.g. DockerHub).
docker login
docker push dwarfcu/spark:2.4.3
- Create Namespace, Service Account and Role Binding.
kubectl create -f 00-env.yaml
- Submit Spark application (Pi example).
kubectl run spark --image=dwarfcu/spark:2.4.3 -it -n spark --serviceaccount=spark-sa --restart='Never' --rm=true \
-- /opt/spark/bin/spark-submit \
--name spark-pi \
--master k8s://https://192.168.7.70:6443 \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.driver.pod.name=pi-driver-1 \
--conf spark.kubernetes.namespace=spark \
--conf spark.executor.instances=3 \
--conf spark.kubernetes.container.image=dwarfcu/spark:2.4.3 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa \
--conf spark.kubernetes.authenticate.submission.oauthTokenFile=/var/run/secrets/kubernetes.io/serviceaccount/token \
--conf spark.kubernetes.authenticate.submission.caCertFile=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.3.jar 10000
Notice that in the above example it specifies a jar with a specific URI with a scheme of local://. This URI is the location of the example jar that is already in the Docker image.
- Check logs.
kubectl logs -n spark pi-driver-1
...
Pi is roughly 3.141641147141641
...
- Delete pi-driver-1 pod.
kubectl delete pod -n spark pi-driver-1
- Delete Namespace, Service Account and Role Binding.
kubectl delete -f 00-env.yaml
- Compile a Spark (JAVA) project, e.g. WordCount.
cd wordCount
mvn clean package
cd ..
- Create Namespace, Service Account and Role Binding.
kubectl create -f 00-env.yaml
- Create a ConfigMap from a binary file, i.e. from the JAR file.
kubectl create configmap -n spark wordcount-jar --from-file=wordCount-1.0.jar.file=wordCount/target/wordCount-1.0.jar
# Check that configmap has been created correctly
kubectl get configmaps -n spark wordcount-jar -o yaml
apiVersion: v1
binaryData:
wordCount-1.0.jar.file: UEsDBAoACA...
...
- Create a pod for submitting the Spark (JAVA) application (ConfigMaps/Volumes are also mounted).
kubectl create -f 20-spark-wordCount.yaml
- Delete environment.
kubectl delete pod -n spark wordcount-driver-1
kubectl delete -f 20-spark-wordCount.yaml
kubectl delete -f 00-env.yaml
Or you can delete only the spark namespace and then all contained resources will also be deleted.
kubectl delete namespaces spark