Skip to content

DwarfCu/spark-k8s

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark can run on clusters managed by Kubernetes. This feature makes use of native Kubernetes scheduler that has been added to Spark.

The Kubernetes scheduler is currently experimental. In future versions, there may be behavioral changes around configuration, container images and entrypoints.

Prerequisites

A running Kubernetes cluster at version +1.6

kubectl cluster-info
Kubernetes master is running at https://192.168.7.70/clusters/c-8bhgq
KubeDNS is running at https://192.168.7.70/k8s/clusters/c-8bhgq/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

A runnable distribution of Spark +2.3

  1. Build Apache Spark Docker image (e.g. Apache Spark 2.4.3).
# Download a pre-built Spark
cd examples/Spark/
wget http://archive.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz

# Extract Spark
tar xzf spark-2.4.3-bin-hadoop2.7.tgz

# Build Spark Docker image
cd spark-2.4.3-bin-hadoop2.7
./bin/docker-image-tool.sh -r dwarfcu -t 2.4.3 build
  1. Upload Spark Docker image to a repository (e.g. DockerHub).
docker login

docker push dwarfcu/spark:2.4.3

Submitting Applications to Kubernetes

Cluster Mode

  1. Create Namespace, Service Account and Role Binding.
kubectl create -f 00-env.yaml
  1. Submit Spark application (Pi example).
kubectl run spark --image=dwarfcu/spark:2.4.3 -it -n spark --serviceaccount=spark-sa --restart='Never' --rm=true \
 -- /opt/spark/bin/spark-submit \
 --name spark-pi \
 --master k8s://https://192.168.7.70:6443 \
 --deploy-mode cluster \
 --class org.apache.spark.examples.SparkPi \
 --conf spark.kubernetes.driver.pod.name=pi-driver-1 \
 --conf spark.kubernetes.namespace=spark \
 --conf spark.executor.instances=3 \
 --conf spark.kubernetes.container.image=dwarfcu/spark:2.4.3 \
 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa \
 --conf spark.kubernetes.authenticate.submission.oauthTokenFile=/var/run/secrets/kubernetes.io/serviceaccount/token \
 --conf spark.kubernetes.authenticate.submission.caCertFile=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
 local:///opt/spark/examples/jars/spark-examples_2.11-2.4.3.jar 10000

Notice that in the above example it specifies a jar with a specific URI with a scheme of local://. This URI is the location of the example jar that is already in the Docker image.

  1. Check logs.
kubectl logs -n spark pi-driver-1
...
Pi is roughly 3.141641147141641
...
  1. Delete pi-driver-1 pod.
kubectl delete pod -n spark pi-driver-1
  1. Delete Namespace, Service Account and Role Binding.
kubectl delete -f 00-env.yaml

Submitting a personal Spark (JAVA) project

  1. Compile a Spark (JAVA) project, e.g. WordCount.
cd wordCount
mvn clean package
cd ..
  1. Create Namespace, Service Account and Role Binding.
kubectl create -f 00-env.yaml
  1. Create a ConfigMap from a binary file, i.e. from the JAR file.
kubectl create configmap -n spark wordcount-jar --from-file=wordCount-1.0.jar.file=wordCount/target/wordCount-1.0.jar

# Check that configmap has been created correctly
kubectl get configmaps -n spark wordcount-jar -o yaml
apiVersion: v1
binaryData:
  wordCount-1.0.jar.file: UEsDBAoACA...
...
  1. Create a pod for submitting the Spark (JAVA) application (ConfigMaps/Volumes are also mounted).
kubectl create -f 20-spark-wordCount.yaml
  1. Delete environment.
kubectl delete pod -n spark wordcount-driver-1

kubectl delete -f 20-spark-wordCount.yaml

kubectl delete -f 00-env.yaml

Or you can delete only the spark namespace and then all contained resources will also be deleted.

kubectl delete namespaces spark

About

Running Spark on Kubernetes (K8s)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages