Skip to content

Commit

Permalink
Add README.org
Browse files Browse the repository at this point in the history
  • Loading branch information
isabekov committed Mar 29, 2024
1 parent 8e87193 commit 12c72a6
Show file tree
Hide file tree
Showing 2 changed files with 96 additions and 0 deletions.
96 changes: 96 additions & 0 deletions README.org
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
* PySpark Cookbook
A collection of useful copy-pasteable standalone PySpark code snippets with corresponding output explaining behavior of commonly used functions.

The source code of snippets is rendered as HTML and hosted at http://isabekov.github.io/pyspark-cookbook/.

* Development Environment
Emacs with ~org-mode~ is used as a development environment. Compared to Jupyter notebooks, the source code is easier to keep in a version control system since it is just a plain text.
#+CAPTION: Emacs and org-mode can replace Jupyter notebooks.
#+NAME: fig:example
[[./screenshots/example.png]]

* Development dependencies
| Software | Version | Comment |
|--------------------------+----------+--------------------------------------------------------------------------------------------|
| Emacs | 29.2 | main development environment |
| Python | 3.11.6 | works with pyspark >= 3.4.0, [[https://stackoverflow.com/questions/75048688/picklingerror-could-not-serialize-object-indexerror-tuple-index-out-of-range][(see discussion)]] |
| python-pyspark | 3.4.0 | Python API for Spark (large-scale data processing library) |
| python-py4j | 0.10.9.7 | enables Python programs to dynamically access Java (dependency of PySpark) |
| python-pandas | 2.0.2 | Python data analysis library |
| python-pyarrow | 15.0.0 | bindings to Apache Arrow (dependency of PySpark) |
| python-tabulate | 0.9.0 | needed to convert dataframes into org-table format |
| Java Runtime Environment | 17.0.10 | newer version do not work with PySpark 3.4.0 |
| PYNT (Emacs package) | >1.0 | interactive kernel for Python in Emacs, read installation instructions at (see [[https://github.com/ebanner/pynt][repository]]) |
| org-export | 64ac299 | command line tool needed for HTML export, requires Emacs (see [[https://github.com/nhoffman/org-export/tree/64ac299c041877620c2cadba83ded44f46c4e124][repository]]) |

* Install Python and Python Packages
Depending on the operating system, install ~Python~ and packages ~py4j, pyspark, pandas, pyarrow, tabulate~ using corresponding package manager and ~pip~.

* PYNT Installation
Install the codebook module with ~pip~ package manager:
#+begin_src shell
$ pip install git+https://github.com/ebanner/pynt
#+end_src

On ArchLinux, pip is not allowed to install by default, so pass an extra argument:
#+begin_src shell
$ pip install --break-system-packages git+https://github.com/ebanner/pynt
#+end_src

Open Emacs. Install ~pynt~ in Emacs through MELPA.
#+begin_src emacs-lisp
M-x package-install RET pynt
#+end_src
where RET is just the "Enter" key.

To fix the following error during evaluation of code blocks:
#+begin_src text
ModuleNotFoundError: No module named 'notebook.services'
#+end_src

Find the installation of PyNT:
#+begin_src shell
$ grep -i kernelmanager /usr/lib/python3.11/site-packages/codebook/manager.py
from jupyter_server.services.kernels.kernelmanager import MappingKernelManager
#+end_src
which is defined in the [[https://github.com/ebanner/pynt/blob/86cf9ce78d34f92bfd0764c9cbb75427ebd429e6/codebook/manager.py#L15][source code]] and change that line in ~manager.py~ to
#+begin_src python
from jupyter_server.services.kernels.kernelmanager import MappingKernelManager
#+end_src

* Java Runtime Installation
PySpark Cookbook's recipes were tested in Emacs IDE using ~Java Runtime environment: 17.0.10.~. Set it as default:
#+begin_src shell
$ export JAVA_HOME=/usr/lib/jvm/java-17-openjdk
$ sudo ln -s /usr/lib/jvm/java-17-openjdk /usr/lib/jvm/default
#+end_src
Newer versions of Java are not compatible with PySpark v3.4.0.

* Install org-export
#+begin_src shell
$ git clone https://github.com/nhoffman/org-export.git
$ cd org-export
$ sudo install -D -m 755 org-export* /usr/local/bin
#+end_src

* Export to HTML
To produce [[http://isabekov.github.io/pyspark-cookbook/][HTML page with PySpark code snippets]], run:
#+begin_src shell
$ make index.html
#+end_src

To render examples of converting PySpark tables displayed in ~pretty~ format to ~orgtbl~ format (see [[https://pypi.org/project/tabulate/0.3/][tabulate package]] describing the formats), run:
#+begin_src shell
$ make test_ps2org.html
#+end_src

* Execution of Code Blocks in org-mode
Navigate to any snippet *outside* "Functions"~ chapter (which is meant to provide only service functions for post-processing the output).
Make sure that the cursor is inside a Python code block:
#+begin_src
,#+begin_src python :post pretty2orgtbl(data=*this*)
...
,#+end_src
#+end_src

Press ~C-c C-c~ (i.e. ~Ctrl-c~ twice). Emacs will execute the source code block inside a Python session and display the output.
Binary file added screenshots/example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 12c72a6

Please sign in to comment.