Thank you, for making Python fast like C/C++. It's real :) #437

marioroy · 2023-08-04T00:44:32Z

marioroy
Aug 4, 2023

A friend mentioned Codon. I tried a small demo on the CPU and GPU including C++ for comparison. Specifying the number of threads via the environment variable OMP_NUM_THREADS works too. I had to build Codon from source for success running on the GPU.

On the CPU, the Python solution runs as fast as C++. The performance on the GPU is mind boggling. It does chunking, allowing one to specify a large N (e.g. 1 billion) and not worry about depleting GPU memory.

t_cpu.py

# Count prime numbers on the CPU.
# codon build -release t_cpu.py
# OMP_NUM_THREADS=8 ./t_cpu 1000000

from math import floor, sqrt
from sys import argv

def is_prime(n):
    if n == 2 or n == 3:
        return 1
    if n % 2 == 0 or n % 3 == 0 or n <= 1:
        return 0

    q = int(floor(sqrt(n)))
    for i in range(5, q + 1, 6):
        if n % i == 0 or n % (i + 2) == 0:
            return 0

    return 1

def main():
    limit = int(argv[1]) if len(argv) > 1 else 1000
    total = 1 if limit >= 2 else 0

    @par(schedule='dynamic', chunk_size=10000)
    for i in range(3, limit + 1, 2):
        if is_prime(i):
            total += 1

    print(total)

if __name__ == "__main__":
    main()

t_cpp.cc

// Count prime numbers on the CPU.
// g++ -o t_cpp -std=c++20 -fopenmp -Wall -O3 t_cpp.cc
// OMP_NUM_THREADS=8 ./t_cpp 1000000

#include <cmath>
#include <cstdint>
#include <iostream>

bool is_prime(uint64_t n)
{
    if (n == 2 || n == 3)
        return true;
    if (n % 2 == 0 || n % 3 == 0 || n <= 1)
        return false;

    uint64_t q = std::floor(std::sqrt(n));
    for (uint64_t i = 5; i <= q; i += 6)
        if (n % i == 0 || n % (i + 2) == 0)
            return false;

    return true;
}

int main(int argc, char* argv[])
{
    uint64_t limit = (argc >= 2) ? std::strtoull(argv[1], NULL, 10) : 1000UL;
    uint64_t total = (limit >= 2UL) ? 1UL : 0UL;

    #pragma omp parallel for schedule(dynamic, 10000) reduction(+:total)
    for (uint64_t i = 3; i <= limit; i += 2)
        if (is_prime(i))
            total += 1;

    std::cout << total << '\n';

    return 0;
}

t_gpu.py

# Count prime numbers on the GPU.
# codon build -release t_gpu.py
# ./t_gpu 1000000

from math import floor, sqrt
from sys import argv
import gpu

def is_prime(n):
    if n == 2 or n == 3:
        return 1
    if n % 2 == 0 or n % 3 == 0 or n <= 1:
        return 0

    q = int(floor(sqrt(n)))
    for i in range(5, q + 1, 6):
        if n % i == 0 or n % (i + 2) == 0:
            return 0

    return 1

@gpu.kernel
def count_primes(arr, start, limit):
    i = (gpu.block.x * gpu.block.dim.x) + gpu.thread.x
    if start + i >= limit:
        return
    if is_prime(start + i):
        arr[i // 2 + (i % 2)] += 1

def divide_up(dividend, divisor):
    """
    Get the next up value for integer division.
    """
    if dividend % divisor:
        return dividend // divisor + 1
    else:
        return dividend // divisor

def main():
    limit = int(argv[1]) if len(argv) > 1 else 1000
    total = 1 if limit >= 2 else 0

    chunk_size = 30720 * 512
    num_chunks = divide_up(limit + 1, chunk_size)
    arr = [0] * (chunk_size // 2 + 1) # factor out even numbers

    for chunk_id in range(num_chunks):
        start = chunk_size * chunk_id if chunk_id > 0 else 3
        count_primes(arr, start, limit + 1, grid=chunk_size//128, block=128)

    total += sum(arr)
    print(total)

if __name__ == "__main__":
    main()

I'm pre-allocating an array on the GPU (outside the loop) and retrieving the array after the loop is completed. The GPU threads increment the array element by 1.

Is Codon able to determine from the code that I prefer to do the memory transfer once, for chunk_id == 0 only? For example, the array is constructed outside the loop. Ditto for sum (e.g. lazily transfer the memory from the GPU to host). This is what I was aiming for.

marioroy · 2023-08-28T05:40:05Z

marioroy
Aug 28, 2023
Author

The parallel C and Codon demonstrations for counting and printing prime numbers live inside the demos folder.

https://github.com/marioroy/mce-sandbox

demos/
   primes1.c      Algorithm3 in C with OpenMP directives.
   primes2.codon  Algorithm3 in Codon, a Python-like language.
   primes3.c      Using libprimesieve C API in C
   primes4.codon  Using libprimesieve C API in Codon

0 replies

marioroy · 2023-09-03T06:06:22Z

marioroy
Sep 3, 2023
Author

Xuedong Luo's practical sieve (Algorithm3) works well on the CPU and GPU. Thank you, for making Codon.

examples/
   prangesieve.c      parallel rangesieve in C
   cpusieve.codon     parallel rangesieve in Codon (CPU)
   gpusieve.codon     parallel rangesieve in Codon (GPU)

The GPU (NVIDIA GeForce RTX 3070) performs comparably to 12 CPU cores (AMD 3000 series).

$ OMP_NUM_THREADS=12 ./prangesieve 1e16 1.00001e16
Primes found: 2714336584
Seconds: 27.291

$ OMP_NUM_THREADS=12 ./cpusieve 1e16 1.00001e16
Primes found: 2714336584
Seconds: 26.079

$ ./gpusieve 1e16 1.00001e16
Primes found: 2714336584
Seconds: 27.558

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thank you, for making Python fast like C/C++. It's real :) #437

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Thank you, for making Python fast like C/C++. It's real :) #437

marioroy Aug 4, 2023

Replies: 2 comments

marioroy Aug 28, 2023 Author

marioroy Sep 3, 2023 Author

marioroy
Aug 4, 2023

marioroy
Aug 28, 2023
Author

marioroy
Sep 3, 2023
Author