The goal of the Advanced Apache Spark for Developers Workshop is to build the deeper understanding of the internals of Apache Spark (Spark Core) and the modules in Apache Spark 2 (Spark SQL, Spark Structured Streaming and Spark MLlib). The workshop will teach you how to do performance tuning of Apache Spark applications and the more advanced features of Apache Spark 2.
NOTE The workshop uses the latest and greatest Apache Spark 2.2.0 and is particularly well-suited to Spark developers who worked with Apache Spark 1.x.
The workshop follows a very intense learn-by-doing approach in which the modules start with just enough knowledge to get you going and quickly move on to applying the concepts in practical exercises.
The workshop includes many practical sessions that should meet (and quite likely exceed) expectations of software developers with a significant experience in Apache Spark and a good knowledge of Scala, senior administrators, operators, devops, and senior support engineers.
CAUTION: The workshop is very hands-on and practical, i.e. not for faint-hearted. Seriously! After just a couple of days your mind, eyes, and hands will all be trained to recognise the patterns how to set up and operate Spark infrastructure for your Big Data and Predictive Analytics projects.
5 days
- Experienced Software Developers
- Good knowledge of Scala
- Significant experience in Apache Spark 1.x
- Senior Administrators
- Senior Support Engineers
- Anatomy of Spark Core Data Processing
SparkContext
andSparkConf
- Transformations and Actions
- Units of Physical Execution: Jobs, Stages, Tasks and Job Groups
- RDD Lineage
- DAG View of RDDs
- Logical Execution Plan
- Spark Execution Engine
- DAGScheduler
- TaskScheduler
- Scheduler Backends
- Executor Backends
- Partitions and Partitioning
- Shuffle
- Caching and Persistence
- Checkpointing
- Elements of Spark Runtime Environment
- The Driver and Executors
- Deploy Modes
- Spark Clusters
- Master and Workers
- Spark Tools
spark-shell
spark-submit
spark-class
- Troubleshooting and Monitoring
- web UI
- Log Files
SparkListeners
StatsReportListener
- Event Logging using
EventLoggingListener
and History Server - Exercise: Event Logging using
EventLoggingListener
- Exercise: Developing Custom SparkListener
- Spark Metrics System
- Tuning Spark Infrastructure
- Exercise: Configuring CPU and Memory for Driver and Executors
- Scheduling Modes: FIFO and FAIR
- Exercise: Configuring Pools in FAIR Scheduling Mode
SparkSession
- Dataset, DataFrame and Encoders
QueryExecution
— Query Execution of Dataset- Exercise: Debugging Query Execution
- web UI
- DataSource API
- Columns, Operators, Standard Functions and UDFs
- Joins
- Basic Aggregation
- groupBy and groupByKey operators
- Case Study: Number of Partitions for groupBy Aggregation
- Windowed Aggregation
- Multi-Dimensional Aggregation
- Caching and Persistence
- Catalyst — Tree Manipulation Framework
- Expressions, LogicalPlans and SparkPlans
- Logical and Physical Operators
- Analyzer — Logical Query Plan Analyzer
- SparkOptimizer — Logical Query Optimizer
- Logical Plan Optimizations
- SparkPlanner — Query Planner with no Hive Support
- Execution Planning Strategies
- Physical Plan Preparations Rules
- Tungsten Execution Backend (aka Project Tungsten)
- Whole-Stage Code Generation (aka Whole-Stage CodeGen)
- InternalRow and UnsafeRow
- ML Pipelines and PipelineStages (spark.ml)
- ML Pipeline Components
- Transformers
- Estimators
- Models
- Evaluators
- CrossValidator
- Params (and ParamMaps)
- Supervised and Unsupervised Learning with Spark MLlib
- Classification and Regression
- Clustering
- Collaborative Filtering
- Model Selection and Tuning
- ML Persistence — Saving and Loading Models and Pipelines
- Training classes are best for groups up to 12 participants
- Participants have decent computers, preferably with Linux or Mac OS operating systems
- There are issues with running Spark on Windows (mostly with Spark SQL / Hive).
- Participants should install the following packages:
- Apache Spark 2.2
- Java SE Development Kit 8
- IntelliJ IDEA Community Edition with the Scala plugin
- sbt
- Apache Kafka 0.11.0.1
- PostgreSQL 10 or any other relational database
- Participants should download the following packages: