A Coding Implementation to Master GPU Computing with CuPy, Custom CUDA Kernels, Streams, Sparse Matrices, and Profiling

In this tutorial, we delve into CuPy as a powerful GPU-accelerated alternative to NumPy for high-performance numerical computing in Python. We start by inspecting the available CUDA device, checking the CuPy version, runtime details, GPU memory, and compute capability so that we understand the hardware environment before running heavy computations. Then, we compare NumPy and CuPy on large matrix multiplication and FFT workloads to see how GPU acceleration changes execution speed. Also, we work with memory pools, custom elementwise kernels, reduction kernels, raw CUDA kernels, CUDA streams, sparse matrices, dense linear solvers, GPU image processing, DLPack interoperability, event-based profiling, cupyx.jit, and kernel fusion. Through these examples, we build a practical understanding of how CuPy lets us write familiar Python code while still accessing advanced CUDA-level performance features.

Copy CodeCopiedUse a different Browser

import sys, time, subprocess
try:
   import cupy as cp
except ImportError:
   subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "cupy-cuda12x"])
   import cupy as cp
import numpy as np
import matplotlib.pyplot as plt
from cupyx.scipy import sparse as cps
from cupyx.scipy import ndimage as cdi
from cupyx import jit
def header(t): print("n" + "="*64 + f"n{t}n" + "="*64)
def bench(fn, *args, n=5, warmup=2, gpu=True):
   for _ in range(warmup): fn(*args)
   if gpu: cp.cuda.Stream.null.synchronize()
   t0 = time.perf_counter()
   for _ in range(n): r = fn(*args)
   if gpu: cp.cuda.Stream.null.synchronize()
   return (time.perf_counter() - t0) / n
header("1. GPU INTROSPECTION")
props = cp.cuda.runtime.getDeviceProperties(0)
print(f"CuPy version       : {cp.__version__}")
print(f"CUDA runtime       : {cp.cuda.runtime.runtimeGetVersion()}")
print(f"Device             : {props['name'].decode()}")
print(f"Compute capability : {props['major']}.{props['minor']}")
print(f"SMs                : {props['multiProcessorCount']}")
print(f"Global memory      : {props['totalGlobalMem']/1e9:.2f} GB")
header("2. NUMPY vs CUPY BENCHMARK")
N = 4096
A_np = np.random.rand(N, N).astype(np.float32)
B_np = np.random.rand(N, N).astype(np.float32)
A_cp, B_cp = cp.asarray(A_np), cp.asarray(B_np)
t_np = bench(np.matmul, A_np, B_np, n=2, gpu=False)
t_cp = bench(cp.matmul, A_cp, B_cp, n=3, gpu=True)
print(f"Matmul {N}x{N}  NumPy={t_np*1000:7.1f} ms  CuPy={t_cp*1000:7.1f} ms  ({t_np/t_cp:.1f}x)")
x_np = np.random.rand(2**21).astype(np.complex64)
x_cp = cp.asarray(x_np)
t_np = bench(np.fft.fft, x_np, n=3, gpu=False)
t_cp = bench(cp.fft.fft, x_cp, n=5, gpu=True)
print(f"FFT 2^21        NumPy={t_np*1000:7.1f} ms  CuPy={t_cp*1000:7.1f} ms  ({t_np/t_cp:.1f}x)")

We begin by setting up CuPy, NumPy, Matplotlib, sparse utilities, image-processing tools, and JIT support so the tutorial has all the required GPU-computing components. We define helper functions for section headers and reliable benchmarking, then inspect the available CUDA device to understand the GPU environment. We also compare NumPy and CuPy for large matrix multiplication and FFT operations to observe the performance difference between CPU-based and GPU-accelerated computation.

Copy CodeCopiedUse a different Browser

header("3. MEMORY POOL")
pool = cp.get_default_memory_pool()
pinned = cp.get_default_pinned_memory_pool()
print(f"Used  : {pool.used_bytes()/1e6:8.2f} MB")
print(f"Total : {pool.total_bytes()/1e6:8.2f} MB")
del A_cp, B_cp, x_cp
pool.free_all_blocks(); pinned.free_all_blocks()
print(f"After free_all_blocks → Used: {pool.used_bytes()/1e6:.2f} MB")
header("4. ELEMENTWISE KERNEL")
robust_norm = cp.ElementwiseKernel(
   in_params ='float32 x, float32 y, float32 eps',
   out_params='float32 z',
   operation ='z = sqrtf((x - y)*(x - y) + eps)',
   name      ='robust_norm')
x = cp.random.rand(2_000_000, dtype=cp.float32)
y = cp.random.rand(2_000_000, dtype=cp.float32)
z = robust_norm(x, y, cp.float32(1e-6))
print(f"Output shape={z.shape}  mean={float(z.mean()):.5f}")
header("5. REDUCTION KERNEL — L2 NORM")
l2 = cp.ReductionKernel(
   in_params  ='T x',
   out_params ='T y',
   map_expr   ='x * x',
   reduce_expr='a + b',
   post_map_expr='y = sqrt(a)',
   identity   ='0',
   name       ='l2norm')
v = cp.random.rand(5_000_000, dtype=cp.float32)
print(f"Custom : {float(l2(v)):.6f}")
print(f"cupy   : {float(cp.linalg.norm(v)):.6f}")

We examine CuPy’s memory pool to understand how GPU memory is allocated, reused, and released during execution. We then create a custom ElementwiseKernel to perform a per-element robust distance calculation directly on the GPU. After that, we define a custom ReductionKernel for L2 norm computation and compare its result with CuPy’s built-in linear algebra function.

Copy CodeCopiedUse a different Browser

header("6. RAW CUDA KERNEL — MANDELBROT")
mandel = cp.RawKernel(r'''
extern "C" __global__
void mandel(float xmin, float xmax, float ymin, float ymax,
           int W, int H, int max_iter, int* out) {
   int ix = blockDim.x * blockIdx.x + threadIdx.x;
   int iy = blockDim.y * blockIdx.y + threadIdx.y;
   if (ix >= W || iy >= H) return;
   float cx = xmin + (xmax - xmin) * ix / (W - 1);
   float cy = ymin + (ymax - ymin) * iy / (H - 1);
   float zx = 0.f, zy = 0.f;
   int it = 0;
   while (zx*zx + zy*zy < 4.f && it < max_iter) {
       float t = zx*zx - zy*zy + cx;
       zy = 2.f*zx*zy + cy;
       zx = t; ++it;
   }
   out[iy*W + ix] = it;
}
''', 'mandel')
W, H, ITER = 1024, 1024, 400
img = cp.zeros((H, W), dtype=cp.int32)
threads = (16, 16)
blocks = ((W + 15)//16, (H + 15)//16)
mandel(blocks, threads,
      (cp.float32(-2.0), cp.float32(1.0),
       cp.float32(-1.5), cp.float32(1.5),
       W, H, ITER, img))
cp.cuda.Stream.null.synchronize()
print(f"Mandelbrot done. max iter reached={int(img.max())}")
plt.figure(figsize=(6,6))
plt.imshow(cp.asnumpy(cp.log1p(img)), cmap='twilight_shifted', extent=[-2,1,-1.5,1.5])
plt.title("Mandelbrot set — computed with a CuPy RawKernel")
plt.axis('off'); plt.show()
header("7. CUDA STREAMS")
s1, s2 = cp.cuda.Stream(non_blocking=True), cp.cuda.Stream(non_blocking=True)
with s1:
   a1 = cp.random.rand(2000, 2000, dtype=cp.float32)
   b1 = cp.random.rand(2000, 2000, dtype=cp.float32)
   c1 = a1 @ b1
with s2:
   a2 = cp.random.rand(2000, 2000, dtype=cp.float32)
   b2 = cp.random.rand(2000, 2000, dtype=cp.float32)
   c2 = a2 @ b2
s1.synchronize(); s2.synchronize()
print(f"Stream-1 mean={float(c1.mean()):.4f}")
print(f"Stream-2 mean={float(c2.mean()):.4f}")

We use a raw CUDA C kernel through CuPy’s RawKernel interface to compute the Mandelbrot set directly on the GPU. We launch the kernel with custom thread and block dimensions, synchronize execution, and visualize the resulting fractal using Matplotlib. We also explore CUDA streams by running two independent matrix multiplications concurrently and checking the output means from both streams.

Copy CodeCopiedUse a different Browser

header("8. SPARSE LINEAR ALGEBRA")
N, density = 8000, 5e-4
nnz = int(N*N*density)
data = cp.random.rand(nnz, dtype=cp.float32)
rows = cp.random.randint(0, N, nnz)
cols = cp.random.randint(0, N, nnz)
A_sp = cps.csr_matrix((data, (rows, cols)), shape=(N, N))
xv   = cp.random.rand(N, dtype=cp.float32)
print(f"NNZ           : {A_sp.nnz}")
print(f"Sparse matvec : {bench(lambda: A_sp @ xv)*1000:.3f} ms")
A_dense = A_sp.toarray()
print(f"Dense  matvec : {bench(lambda: A_dense @ xv)*1000:.3f} ms")
header("9. LINEAR SYSTEM Ax = b")
N = 2000
M = cp.random.rand(N, N, dtype=cp.float32)
A = M @ M.T + N * cp.eye(N, dtype=cp.float32)
b = cp.random.rand(N, dtype=cp.float32)
x_sol = cp.linalg.solve(A, b)
res   = cp.linalg.norm(A @ x_sol - b) / cp.linalg.norm(b)
print(f"Solved {N}x{N} SPD system. Relative residual = {float(res):.2e}")
header("10. GAUSSIAN FILTER ON GPU")
big = cp.random.rand(4096, 4096, dtype=cp.float32)
t = bench(cdi.gaussian_filter, big, 5.0, n=3)
print(f"4096x4096 Gaussian σ=5  →  {t*1000:.2f} ms")
header("11. INTEROP & ZERO-COPY (DLPack)")
g = cp.arange(8, dtype=cp.float32)
h = cp.asnumpy(g)
back = cp.asarray(h)
dl = g.toDlpack()
restored = cp.from_dlpack(dl)
print(f"NumPy view : {h}")
print(f"DLPack RT  : {restored}")

We use sparse linear algebra by generating a random sparse CSR matrix and comparing sparse matrix-vector multiplication with dense multiplication. We then solve a large symmetric positive definite linear system using CuPy’s dense linear algebra tools and verify the solution through a relative residual. Finally, we apply a Gaussian filter to a large image-like array on the GPU and demonstrate interoperability between NumPy, CuPy, and DLPack for data movement and zero-copy exchange.

Copy CodeCopiedUse a different Browser

header("12. CUDA EVENTS")
A = cp.random.rand(4000, 4000, dtype=cp.float32)
B = cp.random.rand(4000, 4000, dtype=cp.float32)
e0, e1 = cp.cuda.Event(), cp.cuda.Event()
e0.record(); C = A @ B; e1.record(); e1.synchronize()
print(f"4000x4000 matmul = {cp.cuda.get_elapsed_time(e0, e1):.3f} ms (CUDA events)")
header("13. cupyx.jit — SAXPY")
@jit.rawkernel()
def saxpy(a, x, y, out, n):
   tid = jit.blockIdx.x * jit.blockDim.x + jit.threadIdx.x
   if tid < n:
       out[tid] = a * x[tid] + y[tid]
n = 2_000_000
xv = cp.random.rand(n, dtype=cp.float32)
yv = cp.random.rand(n, dtype=cp.float32)
out = cp.empty_like(xv)
TPB = 256
blocks = (n + TPB - 1) // TPB
saxpy((blocks,), (TPB,), (cp.float32(2.5), xv, yv, out, n))
print("Correctness:", bool(cp.allclose(out, 2.5*xv + yv)))
header("14. KERNEL FUSION with @cp.fuse")
@cp.fuse()
def fused(x, y, z):
   return cp.sqrt(x*x + y*y + z*z) * cp.exp(-0.5*(x+y+z))
def unfused(x, y, z):
   return cp.sqrt(x*x + y*y + z*z) * cp.exp(-0.5*(x+y+z))
n = 4_000_000
x = cp.random.rand(n, dtype=cp.float32)
y = cp.random.rand(n, dtype=cp.float32)
z = cp.random.rand(n, dtype=cp.float32)
fused(x, y, z)
t1 = bench(unfused, x, y, z)
t2 = bench(fused,   x, y, z)
print(f"Unfused : {t1*1e3:6.3f} ms")
print(f"Fused   : {t2*1e3:6.3f} ms   (speedup {t1/t2:.2f}x)")
print("n" + "="*64)
print("DONE — explore: cupy.linalg, cupyx.scipy.signal, cupy.cuda.Graph")
print("="*64)

We profile a large GPU matrix multiplication using CUDA events to obtain accurate device-side timing. We then write a SAXPY kernel with cupyx.jit, launch it manually, and verify its correctness against the equivalent CuPy expression. Also, we use @cp.fuse to combine multiple array operations into a fused kernel and compare its speed with the unfused version.

In conclusion, we gained a complete hands-on overview of CuPy’s advanced GPU computing capabilities. We learned how to benchmark GPU operations correctly, manage GPU memory, create custom kernels, run concurrent CUDA streams, process sparse and dense numerical problems, and apply GPU acceleration to image filtering and scientific workloads. We also explored interoperability through NumPy and DLPack, profile operations using CUDA events, and improved performance with JIT kernels and fused computations. Also, we saw how CuPy provides NumPy-like syntax while allowing us to delve deeper into CUDA programming when we need more control, speed, and scalability for real-world numerical and scientific computing tasks.

Check out the Full Codes and Notebook here. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post A Coding Implementation to Master GPU Computing with CuPy, Custom CUDA Kernels, Streams, Sparse Matrices, and Profiling appeared first on MarkTechPost.