Monday, November 24, 2025

Fashionable DataFrames in Python: A Fingers-On Tutorial with Polars and DuckDB


If with Python for information, you’ve most likely skilled the frustration of ready minutes for a Pandas operation to complete.

At first, every thing appears effective, however as your dataset grows and your workflows turn out to be extra complicated, your laptop computer instantly feels prefer it’s getting ready for lift-off.

A few months in the past, I labored on a mission analyzing e-commerce transactions with over 3 million rows of knowledge.

It was a reasonably attention-grabbing expertise, however more often than not, I watched easy groupby operations that usually ran in seconds instantly stretch into minutes.

At that time, I spotted Pandas is wonderful, however it isn’t all the time sufficient.

This text explores fashionable options to Pandas, together with Polars and DuckDB, and examines how they’ll simplify and enhance the dealing with of huge datasets.

For readability, let me be upfront about a number of issues earlier than we start.

This text just isn’t a deep dive into Rust reminiscence administration or a proclamation that Pandas is out of date.

As a substitute, it’s a sensible, hands-on information. You will note actual examples, private experiences, and actionable insights into workflows that may prevent time and sanity.


Why Pandas Can Really feel Gradual

Again once I was on the e-commerce mission, I keep in mind working with CSV information over two gigabytes, and each filter or aggregation in Pandas usually took a number of minutes to finish.

Throughout that point, I’d stare on the display screen, wishing I might simply seize a espresso or binge a number of episodes of a present whereas the code ran.

The primary ache factors I encountered had been velocity, reminiscence, and workflow complexity.

Everyone knows how massive CSV information eat monumental quantities of RAM, generally greater than what my laptop computer might comfortably deal with. On prime of that, chaining a number of transformations additionally made code tougher to keep up and slower to execute.

Polars and DuckDB deal with these challenges in several methods.

Polars, inbuilt Rust, makes use of multi-threaded execution to course of massive datasets effectively.

DuckDB, then again, is designed for analytics and executes SQL queries while not having you to load every thing into reminiscence.

Mainly, every of them has its personal superpower. Polars is the speedster, and DuckDB is sort of just like the reminiscence magician.

And the most effective half? Each combine seamlessly with Python, permitting you to reinforce your workflows with out a full rewrite.

Setting Up Your Surroundings

Earlier than we begin coding, be sure your atmosphere is prepared. For consistency, I used Pandas 2.2.0, Polars 0.20.0, and DuckDB 1.9.0.

Pinning variations can prevent complications when following tutorials or sharing code.

pip set up pandas==2.2.0 polars==0.20.0 duckdb==1.9.0

In Python, import the libraries:

import pandas as pd
import polars as pl
import duckdb
import warnings
warnings.filterwarnings("ignore")

For instance, I’ll use an e-commerce gross sales dataset with columns equivalent to order ID, product ID, area, nation, income, and date. You may obtain comparable datasets from Kaggle or generate artificial information.

Loading Information

Loading information effectively units the tone for the remainder of your workflow. I keep in mind a mission the place the CSV file had almost 5 million rows.

Pandas dealt with it, however the load occasions had been lengthy, and the repeated reloads throughout testing had been painful.

It was a type of moments the place you want your laptop computer had a “quick ahead” button.

Switching to Polars and DuckDB fully improved every thing, and instantly, I might entry and manipulate the info nearly immediately, which truthfully made the testing and iteration processes way more pleasing.

With Pandas:

df_pd = pd.read_csv("gross sales.csv")
print(df_pd.head(3))

With Polars:

df_pl = pl.read_csv("gross sales.csv")
print(df_pl.head(3))

With DuckDB:

con = duckdb.join()
df_duck = con.execute("SELECT * FROM 'gross sales.csv'").df()
print(df_duck.head(3))

DuckDB can question CSVs instantly with out loading the whole datasets into reminiscence, making it a lot simpler to work with massive information.

Filtering Information

The issue right here is that filtering in Pandas may be gradual when coping with hundreds of thousands of rows. I as soon as wanted to research European transactions in an enormous gross sales dataset. Pandas took minutes, which slowed down my evaluation.

With Pandas:

filtered_pd = df_pd[df_pd.region == "Europe"]

Polars is quicker and might course of a number of filters effectively:

filtered_pl = df_pl.filter(pl.col("area") == "Europe")

DuckDB makes use of SQL syntax:

filtered_duck = con.execute("""
    SELECT *
    FROM 'gross sales.csv'
    WHERE area = 'Europe'
""").df()

Now you’ll be able to filter via massive datasets in seconds as a substitute of minutes, leaving you extra time to give attention to the insights that actually matter.

Aggregating Giant Datasets Shortly

Aggregation is usually the place Pandas begins to really feel gradual. Think about calculating complete income per nation for a advertising and marketing report.

In Pandas:

agg_pd = df_pd.groupby("nation")["revenue"].sum().reset_index()

In Polars:

agg_pl = df_pl.groupby("nation").agg(pl.col("income").sum())

In DuckDB:

agg_duck = con.execute("""
    SELECT nation, SUM(income) AS total_revenue
    FROM 'gross sales.csv'
    GROUP BY nation
""").df()

I keep in mind operating this aggregation on a ten million-row dataset. In Pandas, it took almost half an hour. Polars accomplished the identical operation in underneath a minute.

The sense of aid was nearly like ending a marathon and realizing your legs nonetheless work.

Becoming a member of Datasets at Scale

Becoming a member of datasets is a type of issues that sounds easy till you might be truly knee-deep within the information.

In actual tasks, your information often lives in a number of sources, so it’s a must to mix them utilizing shared columns like buyer IDs.

I discovered this the arduous means whereas engaged on a mission that required combining hundreds of thousands of buyer orders with an equally massive demographic dataset.

Every file was sufficiently big by itself, however merging them felt like making an attempt to pressure two puzzle items collectively whereas your laptop computer begged for mercy.

Pandas took so lengthy that I started timing the joins the identical means individuals time how lengthy it takes their microwave popcorn to complete.

Spoiler: the popcorn received each time.

Polars and DuckDB gave me a means out.

With Pandas:

merged_pd = df_pd.merge(pop_df_pd, on="nation", how="left")

Polars:

merged_pl = df_pl.be part of(pop_df_pl, on="nation", how="left")

DuckDB:

merged_duck = con.execute("""
    SELECT *
    FROM 'gross sales.csv' s
    LEFT JOIN 'pop.csv' p
    USING (nation)
""").df()

Joins on massive datasets that used to freeze your workflow now run easily and effectively.

Lazy Analysis in Polars

One factor I didn’t recognize early in my information science journey was how a lot time will get wasted whereas operating transformations line by line.

Polars approaches this in a different way.

It makes use of a way referred to as lazy analysis, which primarily waits till you’ve accomplished defining your transformations earlier than executing any operations.

It examines the whole pipeline, determines probably the most environment friendly path, and executes every thing concurrently.

It’s like having a good friend who listens to your complete order earlier than strolling to the kitchen, as a substitute of 1 who takes every instruction individually and retains going backwards and forwards.

This TDS article indepthly explains lazy analysis.

Right here’s what the move appears like:

Pandas:

df = df[df["amount"] > 100]
df = df.groupby("section").agg({"quantity": "imply"})
df = df.sort_values("quantity")

Polars Lazy Mode:

import polars as pl

df_lazy = (
    pl.scan_csv("gross sales.csv")
      .filter(pl.col("quantity") > 100)
      .groupby("section")
      .agg(pl.col("quantity").imply())
      .type("quantity")
)

outcome = df_lazy.accumulate()

The primary time I used lazy mode, it felt unusual not seeing prompt outcomes. However as soon as I ran the ultimate .accumulate(), the velocity distinction was apparent.

Lazy analysis received’t magically clear up each efficiency situation, but it surely brings a stage of effectivity that Pandas wasn’t designed for.


Conclusion and takeaways

Working with massive datasets doesn’t must really feel like wrestling together with your instruments.

Utilizing Polars and DuckDB confirmed me that the issue wasn’t all the time the info. Typically, it was the instrument I used to be utilizing to deal with it.

If there’s one factor you are taking away from this tutorial, let it’s this: you don’t must abandon Pandas, however you’ll be able to attain for one thing higher when your datasets begin pushing their limits.

Polars offers you velocity in addition to smarter execution, then DuckDB enables you to question big information like they’re tiny. Collectively, they make working with massive information really feel extra manageable and fewer tiring.

If you wish to go deeper into the concepts explored on this tutorial, the official documentation of Polars and DuckDB are good locations to begin.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles

PHP Code Snippets Powered By : XYZScripts.com