git clone
pocket-ai -- A Portable Toolkit for deploying Edge AI and HPC.
cux -- An experimental framework for performance analysis and optimization of CUDA kernel functions.
https://github.com/cjmcv/hpc/tree/master/0-frameworks/cux
tag: cuda / simd / openmp.
mrpc -- Mini-RPC, based on asio.
https://github.com/cjmcv/hpc/tree/master/0-frameworks/mrpc
tag: distributed computing.
DEPRECATED
hcs A heterogeneous computing system for multi-task scheduling optimization.
vky A Vulkan-based computing framework
"hcs" and "vky" have been moved to pocket-ai and renamed as graph and vk respectively.
mpi/mpi4py
- alg_matrix_multiply : gemm: C = A * B.
- base_broadcast_scatter_gather : Record the basic usage of Bcast, Scatter, Gather and Allgather.
- base_group : Group communication.
- base_hello_world : Environment Management Routines.
- base_reduce_alltoall_scan : Record the basic usage of Reduce, Allreduce, Alltoall, Scan and Exscan.
- base_send_recv : Record the basic usage of MPI_Send/MPI_Recv and MPI_ISend/MPI_IRecv.
- base_type_contiguous : Send and receive custom types of data by using MPI_Type_contiguous.
- base_type_struct : Send and receive custom types of data by using MPI_Type_struct.
- util_bandwidth_test : Test bandwidth by point-to-point communications.
- py_base_broadcast_scatter_gather : Record the basic usage of Bcast, Scatter, Gather and Allgather.
- py_base_reduce_scan : Record the basic usage of Reduce and Scan.
- py_base_send_recv : Record the basic usage of Send and Recv.
cuda
- cuda_util : Utility functions.
- alg_histogram : histogram, mainly introduce atomicAdd.
- alg_matrix_multiply : gemm: C = A * B.
- alg_vector_add : Vector addition: C = A + B.
- alg_vector_dot_product : Vector dot product: h_result = SUM(A * B).
- alg_vector_scan : Scan. Prefix Sum.
- base_aligned_memory_access : An experiment on aligned memory access.
- base_bank_conflict : An experiment on Bank Conflict in Shared Memory.
- base_coalesced_memory_access : An experiment on coalesced memory access.
- base_float2half : Record the basic usage of float2half.
- base_graph : Record the basic usage of cuda graph.
- base_hyperQ : Demonstrate how HyperQ allows supporting devices to avoid false dependencies between kernels in different streams.
- base_kernel_layout : Record the basic execution configuration of kernel.
- base_occupancy : Record the basic usage of cudaOccupancyMaxPotentialBlockSize.
- base_texture : Record the basic usage of Texture Memory.
- base_unified_memory : A simple task consumer using threads and streams with all data in Unified Memory.
- base_zero_copy : Record the basic usage of Zero Copy.
- cub_block_reduce : Simple demonstration of cub::BlockReduce.
- cub_block_scan : Simple demonstration of cub::BlockScan.
- cub_device_reduce : Simple demonstration of DeviceScan::Sum.
- cub_device_scan : Simple demonstration of DeviceScan::ExclusiveSum.
- cub_warp_reduce : Simple demonstration of cub::WarpReduce.
- cub_warp_scan : Simple demonstration of cub::WarpScan.
- cublas_gemm_float16 : gemm: C = A * B. Use cublas with half-precision.
- thrust_iterators : Record the basic usage of Iterators in Thrust.
- thrust_sort : Sort arrays with Thrust.
- thrust_transformations : Some of the parallel vector operations in Thrust.
- thrust_vector : Record the basic usage of Vector in Thrust.
vulkan
opencl
- ocl_util : Utility functions.
- alg_dot_product : Vector dot product, h_result = SUM(A * B).
- alg_vector_add : Vector addition: C = A + B.
- base_platform_info : Query OpenCL platform information.
std
- alg_quick_sort: Quick sort using std::thread.
- alg_vector_dot_product: Vector dot product: h_result = SUM(A * B). Record the basic usage of std::tread and std::sync.
- base_async: Record the basic usage of std::async.
- util_blocking_queue: Blocking queue. Mainly implemented by thread, queue and condition_variable.
- util_internal_thread: Internal Thread. Mainly implemented by std::thread.
- util_thread_pool: Thread Pool. Mainly implemented by thread, queue, future and condition_variable.
openmp
- alg_matrix_multiply : gemm: C = A * B.
- alg_pi_calculate : Calculate PI using parallel, for and reduction.
- base_flush : Records the basic usage of flush.
- base_mutex : Mutex operation in openmp, including critical, atomic, lock.
- base_parallel_for : Parallel and For.
- base_schedule : Records the basic usage of schedule.
- base_sections_single : Records the basic usage of Sections and Single.
- base_synchronous : Synchronous operation in openmp, including barrier, ordered and master.
tbb
- base_allocator : The basic use of allocator.
- base_atomic : The basic use of atomic.
- base_concurrent_hash_map : The basic use of concurrent_hash_map.
- base_concurrent_queue : The basic use of concurrent queue.
- base_mutex : The basic use of mutex in tbb.
- base_parallel_for : The basic use of parallel_for.
- base_parallel_reduce : The basic use of parallel_reduce.
- base_parallel_scan : The basic use of parallel_scan.
- base_parallel_sort : The basic use of base_parallel_sort.
- base_task_scheduler : The basic use of base_task_scheduler.
- count_strings : Count strings. Use the concurrent_hash_map.
libco
asyncio
- base_future: Record the basic usage of future.
- base_gather: Use gather to execute tasks in parallel.
- base_hello_world: Hello world. Record the basic usage of async, await and loop.
- base_loop_chain: Executes nested coroutines.
sse/avx
- matrix_multiply : Matrix Multiplication.
- matrix_transpose : Matrix Transpose.
- vector_dot_product : Vector dot product: result = SUM(A * B).
- vector_scan : Scan. Prefix Sum.
neon
- matrix_multiply : Matrix Multiplication.
- matrix_transpose : Matrix Transpose.