Supplementary Material for Lectures

YouTube Channel

The PMPP Book: Programming Massively Parallel Processors: A Hands-on Approach (Amazon link)

Lecture 1: Profiling and Integrating CUDA kernels in PyTorch

Speaker: Mark Saroufim
Notebook and slides in lecture_001 folder

Lecture 2: Recap Ch. 1-3 from the PMPP book

Speaker: Andreas Koepf
Slides: The powerpoint file lecture_002/cuda_mode_lecture2.pptx can be found in the root directory of this repository. Alternatively here as Google docs presentation.

Lecture 3: Getting Started With CUDA

Speaker: Jeremy Howard
Notebook: See the lecture_003 folder, or run the Colab version

Lecture 4: Intro to Compute and Memory Architecture

Speaker: Thomas Viehmann
Notebook and slides in the lecture_004 folder.

Lecture 5: Going Further with CUDA for Python Programmers

Speaker: Jeremy Howard
Notebook in the lecture_005 folder.

Lecture 6: Optimizing PyTorch Optimizers

Speaker: Jane Xu
Slides

Lecture 7: Advanced Quantization

Speaker: Charles Hernandez
Slides

Lecture 8: CUDA Performance Checklist

Speaker: Mark Saroufim
Code in the lecture_008 folder
Slides

Lecture 9: Reductions

Speaker: Mark Saroufim
Code in the lecture_009 folder
Slides

Lecture 10: Build a Prod Ready CUDA Library

Speaker: Oscar Amoros Huguet
slides

Lecture 11: Sparsity

Speaker: Jesse Cai
Slides

Lecture 12: Flash Attention

Speaker: Thomas Viehmann

Lecture 13: Ring Attention

Speaker: Andreas Koepf
Slides

Lecture 14: Practitioner's Guide to Triton

Date: 2024-04-13, Speaker: Umer Adil
Notebook

Lecture 15: CUTLASS

Speaker: Eric Auld

Lecture 16: On Hands profiling

Speaker: Taylor Robbie

Bonus Lecture: CUDA C++ llm.cpp

Speaker: Jake Hemstad & Georgii Evtushenko
Slides

Lecture 17: GPU Collective Communication (NCCL)

Speaker: Dan Johnson
Code in the lecture_017 folder

Lecture 18: Fused Kernels

Speaker: Kapil Sharma
Code in the lecture_018 folder

Lecture 19: Data Processing on GPUs

Speaker: Devavret Makkar

Lecture 20: Scan Algorithm

Speaker: Izzat El Haj
Slides

Lecture 21: Scan Algorithm Part 2

Speaker: Izzat El Haj
Slides

Lecture 22: Hacker's Guide to Speculative Decoding in VLLM

Speaker: Cade Daniel
Slides

Lecture 23: Tensor Cores

Speaker: Vijay Thakkar & Pradeep Ramani
Slides

Lecture 24: Scan at the Speed of Light

Speaker: Jake Hemstad & Georgii Evtushenko

Lecture 25: Speaking Composable Kernel

Speaker: Haocong Wang
Slides

Lecture 26: SYCL MODE (Intel GPU)

Speaker: Patric Zhao
Slides

Lecture 27: gpu.cpp

Speaker: Austin Huang
Slides

Lecture 28: Liger Kernel

Lecture 29: Triton Internals

Speaker: Kapil Sharma
Code/presentation in the lecture_029 folder

Lecture 30: Quantized training

Speaker: Thien Tran
Code/presentation in the lecture_030 folder

Lecture 31: Beginners Guide to Metal Kernels

Speaker: Nikita Shulga
Code/presentation in the lecture_031 folder

Lecture 32: Unsloth - LLM Systems Engineering

Speaker: Daniel Han
Slides

Lecture 33: BitBLAS

Speaker: Wang Lei
Code/presentation in the lecture_033 folder

Lecture 34: Low Bit Triton Kernels

Speaker: Hicham Badri
Slides

Lecture 35: SGLang Performance Optimization

Speaker: Yineng Zhang
Slides

Lecture 36: CUTLASS and Flash ATtention 3

Speaker: Jay Shah
Slides

Lecture 37: Introduction to SASS & GPU Microarchitecture

Speaker: Arun Demeure
Slides

Lecture 38: Lowbit kernels for ARM CPU

Speaker: Scott Roy
Slides

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
lecture_001		lecture_001
lecture_002		lecture_002
lecture_003		lecture_003
lecture_004		lecture_004
lecture_005		lecture_005
lecture_008		lecture_008
lecture_009		lecture_009
lecture_011		lecture_011
lecture_013		lecture_013
lecture_014		lecture_014
lecture_017		lecture_017
lecture_018		lecture_018
lecture_025		lecture_025
lecture_029		lecture_029
lecture_030		lecture_030
lecture_031		lecture_031
lecture_033		lecture_033
lecture_035		lecture_035
lecture_036		lecture_036
lecture_037		lecture_037
lecture_038		lecture_038
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
utils.py		utils.py

License

gpu-mode/lectures

Folders and files

Latest commit

History

Repository files navigation

Supplementary Material for Lectures

Lecture 1: Profiling and Integrating CUDA kernels in PyTorch

Lecture 2: Recap Ch. 1-3 from the PMPP book

Lecture 3: Getting Started With CUDA

Lecture 4: Intro to Compute and Memory Architecture

Lecture 5: Going Further with CUDA for Python Programmers

Lecture 6: Optimizing PyTorch Optimizers

Lecture 7: Advanced Quantization

Lecture 8: CUDA Performance Checklist

Lecture 9: Reductions

Lecture 10: Build a Prod Ready CUDA Library

Lecture 11: Sparsity

Lecture 12: Flash Attention

Lecture 13: Ring Attention

Lecture 14: Practitioner's Guide to Triton

Lecture 15: CUTLASS

Lecture 16: On Hands profiling

Bonus Lecture: CUDA C++ llm.cpp

Lecture 17: GPU Collective Communication (NCCL)

Lecture 18: Fused Kernels

Lecture 19: Data Processing on GPUs

Lecture 20: Scan Algorithm

Lecture 21: Scan Algorithm Part 2

Lecture 22: Hacker's Guide to Speculative Decoding in VLLM

Lecture 23: Tensor Cores

Lecture 24: Scan at the Speed of Light

Lecture 25: Speaking Composable Kernel

Lecture 26: SYCL MODE (Intel GPU)

Lecture 27: gpu.cpp

Lecture 28: Liger Kernel

Lecture 29: Triton Internals

Lecture 30: Quantized training

Lecture 31: Beginners Guide to Metal Kernels

Lecture 32: Unsloth - LLM Systems Engineering

Lecture 33: BitBLAS

Lecture 34: Low Bit Triton Kernels

Lecture 35: SGLang Performance Optimization

Lecture 36: CUTLASS and Flash ATtention 3

Lecture 37: Introduction to SASS & GPU Microarchitecture

Lecture 38: Lowbit kernels for ARM CPU

About

Resources

License

Stars

Watchers

Forks

Contributors 26

Languages