Thank you, for making Python fast like C/C++. It's real :) #437
marioroy
started this conversation in
Show and tell
Replies: 2 comments
-
The parallel C and Codon demonstrations for counting and printing prime numbers live inside the demos folder. https://github.com/marioroy/mce-sandbox
|
Beta Was this translation helpful? Give feedback.
0 replies
-
Xuedong Luo's practical sieve (Algorithm3) works well on the CPU and GPU. Thank you, for making Codon.
The GPU (NVIDIA GeForce RTX 3070) performs comparably to 12 CPU cores (AMD 3000 series).
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
A friend mentioned Codon. I tried a small demo on the CPU and GPU including C++ for comparison. Specifying the number of threads via the environment variable
OMP_NUM_THREADS
works too. I had to build Codon from source for success running on the GPU.On the CPU, the Python solution runs as fast as C++. The performance on the GPU is mind boggling. It does chunking, allowing one to specify a large N (e.g. 1 billion) and not worry about depleting GPU memory.
t_cpu.py
t_cpp.cc
t_gpu.py
I'm pre-allocating an array on the GPU (outside the loop) and retrieving the array after the loop is completed. The GPU threads increment the array element by 1.
Is Codon able to determine from the code that I prefer to do the memory transfer once, for chunk_id == 0 only? For example, the array is constructed outside the loop. Ditto for sum (e.g. lazily transfer the memory from the GPU to host). This is what I was aiming for.
Beta Was this translation helpful? Give feedback.
All reactions