Writing Your First GPU Kernel in Python with Numba and CUDA

August 19, 2025

104

Writing Your First GPU Kernel in Python with Numba and CUDA

Picture by Creator | Ideogram

GPUs are nice for duties the place you should do the identical operation throughout completely different items of knowledge. This is called the Single Instruction, A number of Information (SIMD) method. In contrast to CPUs, which solely have a number of highly effective cores, GPUs have hundreds of smaller ones that may run these repetitive operations all of sudden. You will notice this sample quite a bit in machine studying, for instance when including or multiplying giant vectors, as a result of every calculation is impartial. That is the perfect state of affairs for utilizing GPUs to hurry up duties with parallelism.

NVIDIA created CUDA as a approach for builders to jot down applications that run on the GPU as a substitute of the CPU. It’s based mostly on C and allows you to write particular capabilities known as kernels that may run many operations on the identical time. The issue is that writing CUDA in C or C++ isn’t precisely beginner-friendly. It’s important to take care of issues like handbook reminiscence allocation, thread coordination, and understanding how the GPU works at a low stage. This may be overwhelming particularly for those who’re used to writing code in Python.

That is the place Numba may also help you. It permits writing CUDA kernels with Python utilizing the LLVM (Low Degree Digital Machine) compiler infrastructure to immediately compile your Python code to CUDA-compatible kernels. With just-in-time (JIT) compilation, you possibly can annotate your capabilities with a decorator, and Numba handles the whole lot else for you.

On this article, we’ll use a standard instance of vector addition, and convert easy CPU code to a CUDA kernel with Numba. Vector addition is a perfect instance of parallelism, as addition throughout a single index is impartial of different indices. That is the right SIMD state of affairs so all indices could be added concurrently to finish vector addition in a single operation.

Word that you’ll require a CUDA GPU to comply with this text. You need to use Colab’s free T4 GPU or a neighborhood GPU with NVIDIA toolkit and NVCC put in.

# Setting Up the Surroundings and Putting in Numba

Numba is on the market as a Python bundle, and you’ll set up it with pip. Furthermore, we’ll use numpy for vector operations. Arrange the Python setting utilizing the next instructions:

python3 -m venv venv
supply venv/bin/activate
pip set up numba-cuda numpy

# Vector Addition on the CPU

Let’s take a easy instance of vector addition. For 2 given vectors, we add the corresponding values from every index to get the ultimate worth. We’ll use numpy to generate random float32 vectors and generate the ultimate output utilizing a for loop.

import numpy as np 

N = 10_000_000 # 10 million components 
a = np.random.rand(N).astype(np.float32) 
b = np.random.rand(N).astype(np.float32) 
c = np.zeros_like(a) # Output array 

def vector_add_cpu(a, b, c): 
    """Add two vectors on CPU""" 
    for i in vary(len(a)): 
        c[i] = a[i] + b[i]

Here’s a breakdown of the code:

Initialize two vectors every with 10 million random floating-point numbers
We additionally create an empty vector c to retailer the end result
The vector_add_cpu operate merely loops by way of every index and provides the weather from a and b, storing the lead to c

It is a serial operation; every addition occurs one after one other. Whereas this works high-quality, it is not essentially the most environment friendly method, particularly for big datasets. Since every addition is impartial of the others, this can be a excellent candidate for parallel execution on a GPU.

Within the subsequent part, you will note how you can convert this identical operation to run on the GPU utilizing Numba. By distributing every element-wise addition throughout hundreds of GPU threads, we are able to full the duty considerably quicker.

# Vector Addition on the GPU with Numba

You’ll now use Numba to outline a Python operate that may run on CUDA, and execute it inside Python. We’re doing the identical vector addition operation however now it will possibly run in parallel for every index of the numpy array, resulting in quicker execution.

Right here is the code for writing the kernel:

from numba import config

# Required for newer CUDA variations to allow linking instruments. 
# Prevents CUDA toolkit and NVCC model mismatches.
config.CUDA_ENABLE_PYNVJITLINK = 1

from numba import cuda, float32

@cuda.jit
def vector_add_gpu(a, b, c):
	"""Add two vectors utilizing CUDA kernel"""
	# Thread ID within the present block
	tx = cuda.threadIdx.x
	# Block ID within the grid
	bx = cuda.blockIdx.x
	# Block width (variety of threads per block)
	bw = cuda.blockDim.x

	# Calculate the distinctive thread place
	place = tx + bx * bw

	# Be certain that we do not exit of bounds
	if place < len(a):
    	    c[position] = a[position] + b[position]

def gpu_add(a, b, c):
	# Outline the grid and block dimensions
	threads_per_block = 256
	blocks_per_grid = (N + threads_per_block - 1) // threads_per_block

	# Copy knowledge to the machine
	d_a = cuda.to_device(a)
	d_b = cuda.to_device(b)
	d_c = cuda.to_device(c)

	# Launch the kernel
	vector_add_gpu[blocks_per_grid, threads_per_block](d_a, d_b, d_c)

	# Copy the end result again to the host
	d_c.copy_to_host(c)

def time_gpu():
	c_gpu = np.zeros_like(a)
	gpu_add(a, b, c_gpu)
	return c_gpu

Let’s break down what is occurring above.

// Understanding the GPU Operate

The @cuda.jit decorator tells Numba to deal with the next operate as a CUDA kernel; a particular operate that can run in parallel throughout many threads on the GPU. At runtime, Numba will compile this operate to CUDA-compatible code and deal with the C-API transpilation for you.

@cuda.jit
def vector_add_gpu(a, b, c):
	...

This operate will run on hundreds of threads on the identical time. However we’d like a approach to determine which a part of the information every thread ought to work on. That’s what the subsequent few strains do:

tx is the thread’s ID inside its block
bx is the block’s ID throughout the grid
bw is what number of threads there are in a block

We mix these to calculate a singular place, which tells every thread which aspect of the arrays it ought to add. Word that the threads and blocks won’t at all times present a legitimate index, as they function in powers of two. This may increasingly result in invalid indices when the vector size will not be conforming to the underlying structure. Due to this fact, we add a guard situation to validate the index, earlier than we carry out the vector addition. This prevents any out-of-bound runtime error when accessing the array.

As soon as we all know the distinctive place, we are able to now add the values identical to we did for the CPU implementation. The next line will match the CPU implementation:

c[position] = a[position] + b[position]

// Launching the Kernel

The gpu_add operate units issues up:

It defines what number of threads and blocks to make use of. You’ll be able to experiment with completely different values of block and thread sizes, and print the corresponding values within the GPU kernel. This may also help you perceive how underlying GPU indexing works.
It copies the enter arrays (a, b, and c) from the CPU reminiscence to the GPU reminiscence, so the vectors are accessible within the GPU RAM.
It runs the GPU kernel with vector_add_gpu[blocks_per_grid, threads_per_block].
Lastly, it copies the end result again from the GPU into the c array, so we are able to entry the values on the CPU.

# Evaluating the Implementations and Potential Speedup

Now that we now have each the CPU and GPU variations of vector addition, it’s time to see how they evaluate. It is very important confirm the outcomes and the execution enhance we are able to get with CUDA parallelism.

import timeit

c_cpu = time_cpu()
c_gpu = time_gpu()

print("Outcomes match:", np.allclose(c_cpu, c_gpu))

cpu_time = timeit.timeit("time_cpu()", globals=globals(), quantity=3) / 3
print(f"CPU implementation: {cpu_time:.6f} seconds")

gpu_time = timeit.timeit("time_gpu()", globals=globals(), quantity=3) / 3
print(f"GPU implementation: {gpu_time:.6f} seconds")

speedup = cpu_time / gpu_time
print(f"GPU speedup: {speedup:.2f}x")

First, we run each implementations and test if their outcomes match. That is vital to verify our GPU code is working accurately and the output needs to be the identical because the CPU’s.

Subsequent, we use Python’s built-in timeit module to measure how lengthy every model takes. We run every operate a number of occasions and take the common to get a dependable timing. Lastly, we calculate what number of occasions quicker the GPU model is in comparison with the CPU. It is best to see a giant distinction as a result of the GPU can do many operations without delay, whereas the CPU handles them one by one in a loop.

Right here is the anticipated output on NVIDIA’s T4 GPU on Colab. Word that the precise speedup can differ based mostly on CUDA variations and the underlying {hardware}.

Outcomes match: True
CPU implementation: 4.033822 seconds
GPU implementation: 0.047736 seconds
GPU speedup: 84.50x

This straightforward take a look at helps exhibit the ability of GPU acceleration and why it’s so helpful for duties involving giant quantities of knowledge and parallel work.

# Wrapping Up

And that’s it. You will have now written your first CUDA kernel with Numba, with out truly writing any C or CUDA code. Numba permits a easy interface for utilizing the GPU by way of Python, and it makes it a lot less complicated for Python engineers to get began with CUDA programming.

Now you can use the identical template to jot down superior CUDA algorithms, that are prevalent in machine studying and deep studying. In case you discover an issue following the SIMD paradigm, it’s at all times a good suggestion to make use of GPU to enhance execution.

The entire code is on the market on Colab pocket book you could entry right here. Be happy to check it out and make easy modifications to get a greater understanding of how CUDA indexing and execution works internally.

Kanwal Mehreen is a machine studying engineer and a technical author with a profound ardour for knowledge science and the intersection of AI with medication. She co-authored the e-book “Maximizing Productiveness with ChatGPT”. As a Google Era Scholar 2022 for APAC, she champions variety and tutorial excellence. She’s additionally acknowledged as a Teradata Variety in Tech Scholar, Mitacs Globalink Analysis Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having based FEMCodes to empower girls in STEM fields.

Writing Your First GPU Kernel in Python with Numba and CUDA

# Setting Up the Surroundings and Putting in Numba

# Vector Addition on the CPU

# Vector Addition on the GPU with Numba

// Understanding the GPU Operate

// Launching the Kernel

# Evaluating the Implementations and Potential Speedup

# Wrapping Up

Related Articles

Fortinet Exploits, RedLine Clipjack, NTLM Crack, Copilot Assault & Extra

Humanoid and Siemens proof of idea exhibits the way in which to industrial deployments

Time Collection Isn’t Sufficient: How Graph Neural Networks Change Demand Forecasting

LEAVE A REPLY Cancel reply

Latest Articles

Fortinet Exploits, RedLine Clipjack, NTLM Crack, Copilot Assault & Extra

Humanoid and Siemens proof of idea exhibits the way in which to industrial deployments

Time Collection Isn’t Sufficient: How Graph Neural Networks Change Demand Forecasting

Yuko Oka and Her Studio Are Rethinking Structure with 3D Printing – 3DPrint.com

5 Chrome Extensions Used to Hijack Enterprise HR and ERP Techniques

About US