I used to be engaged on a script the opposite day, and it was driving me nuts. It labored, certain, but it surely was simply… sluggish. Actually sluggish. I had that feeling that this may very well be a lot quicker if I might work out the place the hold-up was.
My first thought was to begin tweaking issues. I might optimise the information loading. Or rewrite that for loop? However I ended myself. I’ve fallen into that lure earlier than, spending hours “optimising” a chunk of code solely to seek out it made barely any distinction to the general runtime. Donald Knuth had some extent when he mentioned, “Untimely optimisation is the foundation of all evil.”
I made a decision to take a extra methodical strategy. As a substitute of guessing, I used to be going to seek out out for certain. I wanted to profile the code to acquire laborious information on precisely which capabilities have been consuming the vast majority of the clock cycles.
On this article, I’ll stroll you thru the precise course of I used. We’ll take a intentionally sluggish Python script and use two unbelievable instruments to pinpoint its bottlenecks with surgical precision.
The primary of those instruments is named cProfile, a robust profiler constructed into Python. The opposite is named snakeviz, a sensible device that transforms the profiler’s output into an interactive visible map.
Organising a growth atmosphere
Earlier than we begin coding, let’s arrange our growth atmosphere. The most effective follow is to create a separate Python atmosphere the place you may set up any essential software program and experiment, figuring out that something you do received’t influence the remainder of your system. I’ll be utilizing conda for this, however you should utilize any technique with which you’re acquainted.
#create our take a look at atmosphere
conda create -n profiling_lab python=3.11 -y
# Now activate it
conda activate profiling_lab
Now that we have now the environment arrange, we have to set up snakeviz for our visualisations and numpy for the instance script. cProfile is already included with Python, so there’s nothing extra to do there. As we’ll be operating our scripts with a Jupyter Pocket book, we’ll additionally set up that.
# Set up our visualization device and numpy
pip set up snakeviz numpy jupyter
Now kind in jupyter pocket book into your command immediate. You need to see a jupyter pocket book open in your browser. If that doesn’t occur robotically, you’ll doubtless see a screenful of knowledge after the jupyter pocket book command. Close to the underside of that, there will likely be a URL that it’s best to copy and paste into your browser to provoke the Jupyter Pocket book.
Your URL will likely be completely different to mine, but it surely ought to look one thing like this:-
http://127.0.0.1:8888/tree?token=3b9f7bd07b6966b41b68e2350721b2d0b6f388d248cc69da
With our instruments prepared, it’s time to have a look at the code we’re going to repair.
Our “Drawback” Script
To correctly take a look at our profiling instruments, we want a script that displays clear efficiency points. I’ve written a easy program that simulates processing issues with reminiscence, iteration and CPU cycles, making it an ideal candidate for our investigation.
# run_all_systems.py
import time
import math
# ===================================================================
CPU_ITERATIONS = 34552942
STRING_ITERATIONS = 46658100
LOOP_ITERATIONS = 171796964
# ===================================================================
# --- Process 1: A Calibrated CPU-Sure Bottleneck ---
def cpu_heavy_task(iterations):
print(" -> Operating CPU-bound job...")
outcome = 0
for i in vary(iterations):
outcome += math.sin(i) * math.cos(i) + math.sqrt(i)
return outcome
# --- Process 2: A Calibrated Reminiscence/String Bottleneck ---
def memory_heavy_string_task(iterations):
print(" -> Operating Reminiscence/String-bound job...")
report = ""
chunk = "report_item_abcdefg_123456789_"
for i in vary(iterations):
report += f"|{chunk}{i}"
return report
# --- Process 3: A Calibrated "Thousand Cuts" Iteration Bottleneck ---
def simulate_tiny_op(n):
move
def iteration_heavy_task(iterations):
print(" -> Operating Iteration-bound job...")
for i in vary(iterations):
simulate_tiny_op(i)
return "OK"
# --- Principal Orchestrator ---
def run_all_systems():
print("--- Beginning FINAL SLOW Balanced Showcase ---")
cpu_result = cpu_heavy_task(iterations=CPU_ITERATIONS)
string_result = memory_heavy_string_task(iterations=STRING_ITERATIONS)
iteration_result = iteration_heavy_task(iterations=LOOP_ITERATIONS)
print("--- FINAL SLOW Balanced Showcase Completed ---")
Step 1: Accumulating the Knowledge with cProfile
Our first device, cProfile, is a deterministic profiler constructed into Python. We will run it from code to execute our script and document detailed statistics about each perform name.
import cProfile, pstats, io
pr = cProfile.Profile()
pr.allow()
# Run the perform you wish to profile
run_all_systems()
pr.disable()
# Dump stats to a string and print the highest 10 by cumulative time
s = io.StringIO()
ps = pstats.Stats(pr, stream=s).sort_stats("cumtime")
ps.print_stats(10)
print(s.getvalue())
Right here is the output.
--- Beginning FINAL SLOW Balanced Showcase ---
-> Operating CPU-bound job...
-> Operating Reminiscence/String-bound job...
-> Operating Iteration-bound job...
--- FINAL SLOW Balanced Showcase Completed ---
275455984 perform calls in 30.497 seconds
Ordered by: cumulative time
Checklist diminished from 47 to 10 resulting from restriction <10>
ncalls tottime percall cumtime percall filename:lineno(perform)
2 0.000 0.000 30.520 15.260 /house/tom/.native/lib/python3.10/site-packages/IPython/core/interactiveshell.py:3541(run_code)
2 0.000 0.000 30.520 15.260 {built-in technique builtins.exec}
1 0.000 0.000 30.497 30.497 /tmp/ipykernel_173802/1743829582.py:41(run_all_systems)
1 9.652 9.652 14.394 14.394 /tmp/ipykernel_173802/1743829582.py:34(iteration_heavy_task)
1 7.232 7.232 12.211 12.211 /tmp/ipykernel_173802/1743829582.py:14(cpu_heavy_task)
171796964 4.742 0.000 4.742 0.000 /tmp/ipykernel_173802/1743829582.py:31(simulate_tiny_op)
1 3.891 3.891 3.892 3.892 /tmp/ipykernel_173802/1743829582.py:22(memory_heavy_string_task)
34552942 1.888 0.000 1.888 0.000 {built-in technique math.sin}
34552942 1.820 0.000 1.820 0.000 {built-in technique math.cos}
34552942 1.271 0.000 1.271 0.000 {built-in technique math.sqrt}
We’ve got a bunch of numbers that may be troublesome to interpret. That is the place snakeviz comes into its personal.
Step 2: Visualising the bottleneck with snakeviz
That is the place the magic occurs. Snakeviz takes the output of our profiling file and converts it into an interactive, browser-based chart, making it simpler to seek out bottlenecks.
So let’s use that device to visualise what we have now. As I’m utilizing a Jupyter Pocket book, we have to load it first.
%load_ext snakeviz
And we run it like this.
%%snakeviz
essential()
The output is available in two elements. First is a visualisation like this.
What you see is a top-down “icicle” chart. From the highest to the underside, it represents the decision hierarchy.
On the very high: Python is executing our script (
Subsequent: the script’s __main__ execution (
The memory-intensive processing half isn’t labelled on the chart. That’s as a result of the proportion of time related to this job is way smaller than the occasions apportioned to the opposite two intensive capabilities. In consequence, we see a a lot smaller, unlabelled block to the suitable of the cpu_heavy_task block.
Be aware that, for evaluation, there may be additionally a Snakeviz chart model known as a Sunburst chart. It seems to be a bit like a pie chart besides it comprises a set of more and more giant concentric circles and arcs. The thought beng that the time taken by capabilities to run is represented by the angular extent of the arc measurement of the circle. The basis perform is a circle in the course of viz. The basis perform runs by calling the sub-functions beneath it and so forth. We wont be that show kind on this article.
Visible affirmation, like this, could be a lot extra impactful than looking at a desk of numbers. I didn’t must guess anymore the place to look; the information was staring me proper within the face.
The visualisation is shortly adopted by a block of textual content detailing the timings for numerous elements of your code, very like the output of the cprofile device. I’m solely displaying the primary dozen or so strains of this, as there have been 30+ in whole.
ncalls tottime percall cumtime percall filename:lineno(perform)
----------------------------------------------------------------
1 9.581 9.581 14.3 14.3 1062495604.py:34(iteration_heavy_task)
1 7.868 7.868 12.92 12.92 1062495604.py:14(cpu_heavy_task)
171796964 4.717 2.745e-08 4.717 2.745e-08 1062495604.py:31(simulate_tiny_op)
1 3.848 3.848 3.848 3.848 1062495604.py:22(memory_heavy_string_task)
34552942 1.91 5.527e-08 1.91 5.527e-08 ~:0()
34552942 1.836 5.313e-08 1.836 5.313e-08 ~:0()
34552942 1.305 3.778e-08 1.305 3.778e-08 ~:0()
1 0.02127 0.02127 31.09 31.09 :1()
4 0.0001764 4.409e-05 0.0001764 4.409e-05 socket.py:626(ship)
10 0.000123 1.23e-05 0.0004568 4.568e-05 iostream.py:655(write)
4 4.594e-05 1.148e-05 0.0002735 6.838e-05 iostream.py:259(schedule)
...
...
...
Step 3: The Repair
After all, instruments like cprofiler and snakeviz don’t inform you how to kind out your efficiency points, however now that I knew precisely the place the issues have been, I might apply focused fixes.
# final_showcase_fixed_v2.py
import time
import math
import numpy as np
# ===================================================================
CPU_ITERATIONS = 34552942
STRING_ITERATIONS = 46658100
LOOP_ITERATIONS = 171796964
# ===================================================================
# --- Repair 1: Vectorization for the CPU-Sure Process ---
def cpu_heavy_task_fixed(iterations):
"""
Mounted through the use of NumPy to carry out the complicated math on a whole array
directly, in extremely optimized C code as a substitute of a Python loop.
"""
print(" -> Operating CPU-bound job...")
# Create an array of numbers from 0 to iterations-1
i = np.arange(iterations, dtype=np.float64)
# The identical calculation, however vectorized, is orders of magnitude quicker
result_array = np.sin(i) * np.cos(i) + np.sqrt(i)
return np.sum(result_array)
# --- Repair 2: Environment friendly String Becoming a member of ---
def memory_heavy_string_task_fixed(iterations):
"""
Mounted through the use of an inventory comprehension and a single, environment friendly ''.be a part of() name.
This avoids creating hundreds of thousands of intermediate string objects.
"""
print(" -> Operating Reminiscence/String-bound job...")
chunk = "report_item_abcdefg_123456789_"
# An inventory comprehension is quick and memory-efficient
elements = [f"|{chunk}{i}" for i in range(iterations)]
return "".be a part of(elements)
# --- Repair 3: Eliminating the "Thousand Cuts" Loop ---
def iteration_heavy_task_fixed(iterations):
"""
Mounted by recognizing the duty generally is a no-op or a bulk operation.
In a real-world situation, you'll discover a technique to keep away from the loop fully.
Right here, we display the repair by merely eradicating the pointless loop.
The objective is to indicate the price of the loop itself was the issue.
"""
print(" -> Operating Iteration-bound job...")
# The repair is to discover a bulk operation or eradicate the necessity for the loop.
# For the reason that authentic perform did nothing, the repair is to do nothing, however quicker.
return "OK"
# --- Principal Orchestrator ---
def run_all_systems():
"""
The principle orchestrator now calls the FAST variations of the duties.
"""
print("--- Beginning FINAL FAST Balanced Showcase ---")
cpu_result = cpu_heavy_task_fixed(iterations=CPU_ITERATIONS)
string_result = memory_heavy_string_task_fixed(iterations=STRING_ITERATIONS)
iteration_result = iteration_heavy_task_fixed(iterations=LOOP_ITERATIONS)
print("--- FINAL FAST Balanced Showcase Completed ---")
Now we are able to rerun the cprofiler on our up to date code.
import cProfile, pstats, io
pr = cProfile.Profile()
pr.allow()
# Run the perform you wish to profile
run_all_systems()
pr.disable()
# Dump stats to a string and print the highest 10 by cumulative time
s = io.StringIO()
ps = pstats.Stats(pr, stream=s).sort_stats("cumtime")
ps.print_stats(10)
print(s.getvalue())
#
# begin of output
#
--- Beginning FINAL FAST Balanced Showcase ---
-> Operating CPU-bound job...
-> Operating Reminiscence/String-bound job...
-> Operating Iteration-bound job...
--- FINAL FAST Balanced Showcase Completed ---
197 perform calls in 6.063 seconds
Ordered by: cumulative time
Checklist diminished from 52 to 10 resulting from restriction <10>
ncalls tottime percall cumtime percall filename:lineno(perform)
2 0.000 0.000 6.063 3.031 /house/tom/.native/lib/python3.10/site-packages/IPython/core/interactiveshell.py:3541(run_code)
2 0.000 0.000 6.063 3.031 {built-in technique builtins.exec}
1 0.002 0.002 6.063 6.063 /tmp/ipykernel_173802/1803406806.py:1()
1 0.402 0.402 6.061 6.061 /tmp/ipykernel_173802/3782967348.py:52(run_all_systems)
1 0.000 0.000 5.152 5.152 /tmp/ipykernel_173802/3782967348.py:27(memory_heavy_string_task_fixed)
1 4.135 4.135 4.135 4.135 /tmp/ipykernel_173802/3782967348.py:35()
1 1.017 1.017 1.017 1.017 {technique 'be a part of' of 'str' objects}
1 0.446 0.446 0.505 0.505 /tmp/ipykernel_173802/3782967348.py:14(cpu_heavy_task_fixed)
1 0.045 0.045 0.045 0.045 {built-in technique numpy.arange}
1 0.000 0.000 0.014 0.014 <__array_function__ internals>:177(sum)
That’s a unbelievable outcome that demonstrates the ability of profiling. We spent our effort on the elements of the code that mattered. To be thorough, I additionally ran snakeviz on the mounted script.
%%snakeviz
run_all_systems()

Essentially the most notable change is the discount in whole runtime, from roughly 30 seconds to roughly 6 seconds. This can be a 5x speedup, achieved by addressing the three essential bottlenecks that have been seen within the “earlier than” profile.
Let’s take a look at each individually.
1. The iteration_heavy_task
Earlier than (The Drawback)
Within the first picture, the big bar on the left, iteration_heavy_task, is the only greatest bottleneck, consuming 14.3 seconds.
- Why was it sluggish? This job was a traditional “demise by a thousand cuts.” The perform simulate_tiny_op did nearly nothing, but it surely was known as hundreds of thousands of occasions from inside a pure Python for loop. The immense overhead of the Python interpreter beginning and stopping a perform name repeatedly was the complete supply of the slowness.
The Repair
The mounted model, iteration_heavy_task_fixed, recognised that the objective may very well be achieved with out the loop. In our showcase, this meant eradicating the pointless loop fully. In a real-world utility, this could contain discovering a single “bulk” operation to interchange the iterative one.
After (The Consequence)
Within the second picture, the iteration_heavy_task bar is utterly gone. It’s now so quick that its runtime is a tiny fraction of a second and is invisible on the chart. We efficiently eradicated a 14.3-second downside.
2. The cpu_heavy_task
Earlier than (The Drawback)
The second main bottleneck, clearly seen as the big orange bar on the suitable, is cpu_heavy_task, which took 12.9 seconds.
- Why was it sluggish? Just like the iteration job, this perform was additionally restricted by the velocity of the Python for loop. Whereas the mathematics operations inside have been quick, the interpreter needed to course of every of the hundreds of thousands of calculations individually, which is very inefficient for numerical duties.
The Repair
The repair was vectorisation utilizing the NumPy library. As a substitute of utilizing a Python loop, cpu_heavy_task_fixed created a NumPy array and carried out all of the mathematical operations (np.sqrt, np.sin, and many others.) on the complete array concurrently. These operations are executed in extremely optimised, pre-compiled C code, utterly bypassing the sluggish Python interpreter loop.
After (The Consequence).
Identical to the primary bottleneck, the cpu_heavy_task bar has vanished from the “after” diagram. Its runtime was diminished from 12.9 seconds to some milliseconds.
3. The memory_heavy_string_task
Earlier than (The Drawback):
Within the first diagram, the memory-heavy_string_task was operating, however its runtime was small in comparison with the opposite two bigger points, so it was relegated to the small, unlabeled sliver of area on the far proper. It was a comparatively minor situation.
The Repair
The repair for this job was to interchange the inefficient report += “…” string concatenation with a way more environment friendly technique: constructing an inventory of all of the string elements after which calling “”.be a part of() a single time on the finish.
After (The Consequence)
Within the second diagram, we see the results of our success. Having eradicated the 2 10+ second bottlenecks, the memory-heavy-string-task-fixed is now the new dominant bottleneck, accounting for 4.34 seconds of the overall 5.22-second runtime.
Snakeviz even lets us look inside this mounted perform. The brand new most vital contributor is the orange bar labelled
Abstract
This text gives a hands-on information to figuring out and resolving efficiency points in Python code, arguing that builders ought to utilise profiling instruments to measure efficiency as a substitute of counting on instinct or guesswork to pinpoint the supply of slowdowns.
I demonstrated a methodical workflow utilizing two key instruments:-
- cProfile: Python’s built-in profiler, used to assemble detailed information on perform calls and execution occasions.
- snakeviz: A visualisation device that turns cProfile’s information into an interactive “icicle” chart, making it simple to visually establish which elements of the code are consuming probably the most time.
The article makes use of a case examine of a intentionally sluggish script engineered with three distinct and vital bottlenecks:
- An iteration-bound job: A perform known as hundreds of thousands of occasions in a loop, showcasing the efficiency value of Python’s perform name overhead (“demise by a thousand cuts”).
- A CPU-bound job: A for loop performing hundreds of thousands of math calculations, highlighting the inefficiency of pure Python for heavy numerical work.
- A memory-bound job: A big string constructed inefficiently utilizing repeated += concatenation.
By analysing the snakeviz output, I pinpointed these three issues and utilized focused fixes.
- The iteration bottleneck was mounted by eliminating the pointless loop.
- The CPU bottleneck was resolved with vectorisation utilizing NumPy, which executes mathematical operations in quick, compiled C code.
- The reminiscence bottleneck was mounted by appending string elements to an inventory and utilizing a single, environment friendly “”.be a part of() name.
These fixes resulted in a dramatic speedup, lowering the script’s runtime from over 30 seconds to simply over 6 seconds. I concluded by demonstrating that, even after main points are resolved, the profiler can be utilized once more to establish new, smaller bottlenecks, illustrating that efficiency tuning is an iterative course of guided by measurement.
