A collection of useful copy-pasteable standalone PySpark code snippets with corresponding output explaining behavior of commonly used functions.
The source code of snippets is rendered as HTML and hosted at http://isabekov.github.io/pyspark-cookbook/.
Emacs with org-mode
is used as a development environment. Compared to Jupyter notebooks, the source code is easier to keep in a version control system since it is just a plain text.
Software | Version | Comment |
---|---|---|
Emacs | 29.2 | main development environment |
Python | 3.11.6 | works with pyspark >= 3.4.0, (see discussion) |
python-pyspark | 3.4.0 | Python API for Spark (large-scale data processing library) |
python-py4j | 0.10.9.7 | enables Python programs to dynamically access Java (dependency of PySpark) |
python-pandas | 2.0.2 | Python data analysis library |
python-pyarrow | 15.0.0 | bindings to Apache Arrow (dependency of PySpark) |
python-tabulate | 0.9.0 | needed to convert dataframes into org-table format |
Java Runtime Environment | 17.0.10 | newer version do not work with PySpark 3.4.0 |
PYNT (Emacs package) | >1.0 | interactive kernel for Python in Emacs, read installation instructions at (see repository) |
org-export | 64ac299 | command line tool needed for HTML export, requires Emacs (see repository) |
GNU readline | 8.2.13 | library needed for correct invocation of Python in Emacs on MacOS |
Depending on the operating system, install Python
and packages py4j, pyspark, pandas, pyarrow, tabulate
using corresponding package manager and pip
.
Install GNU readline:
$ pip install gnureadline
Replace libedit~ with readline
:
python -m override_readline
Details can be found here.
Install the codebook module with pip
package manager:
$ pip install git+https://github.com/ebanner/pynt
On ArchLinux, pip is not allowed to install by default, so pass an extra argument:
$ pip install --break-system-packages git+https://github.com/ebanner/pynt
Open Emacs. Install pynt
in Emacs through MELPA.
M-x package-install RET pynt
where RET is just the “Enter” key.
To fix the following error during evaluation of code blocks:
ModuleNotFoundError: No module named 'notebook.services'
Find the installation of PyNT:
$ grep -i kernelmanager /usr/lib/python3.11/site-packages/codebook/manager.py
from jupyter_server.services.kernels.kernelmanager import MappingKernelManager
which is defined in the source code and change that line in manager.py
to
from jupyter_server.services.kernels.kernelmanager import MappingKernelManager
PySpark Cookbook’s recipes were tested in Emacs IDE using Java Runtime environment: 17.0.10.
. Set it as default:
$ export JAVA_HOME=/usr/lib/jvm/java-17-openjdk
$ sudo ln -s /usr/lib/jvm/java-17-openjdk /usr/lib/jvm/default
Newer versions of Java are not compatible with PySpark v3.4.0.
$ git clone https://github.com/nhoffman/org-export.git
$ cd org-export
$ sudo install -D -m 755 org-export* /usr/local/bin
To produce HTML page with PySpark code snippets, run:
$ make index.html
To render examples of converting PySpark tables displayed in pretty
format to orgtbl
format (see tabulate package describing the formats), run:
$ make test_ps2org.html
Navigate to any snippet outside “Functions”~ chapter (which is meant to provide only service functions for post-processing the output). Make sure that the cursor is inside a Python code block:
#+begin_src python :post pretty2orgtbl(data=*this*) ... #+end_src
Press C-c C-c
(i.e. Ctrl-c
twice). Emacs will execute the source code block inside a Python session and display the output.