Wednesday, October 15, 2025

Leveraging Pandas and SQL Collectively for Environment friendly Knowledge Evaluation


What is pandasqlPicture by Writer | Canva

 

Pandas and SQL are each efficient for information evaluation, however what if we may merge their energy? With pandasql, you’ll be able to write SQL queries straight inside a Jupyter pocket book. This integration seamlessly permits us to mix SQL logic with Python for efficient information evaluation.

On this article, we’ll use each pandas and SQL collectively on an information venture from Uber. Let’s get began!

 

What Is pandasql?

 
Pandasql could be built-in with any DataFrame via an in-memory SQLite engine, so you’ll be able to write pure SQL inside a Python surroundings.

 

Benefits of Utilizing Pandas and SQL Collectively

 
 
Advantages of Using Pandas and SQL TogetherAdvantages of Using Pandas and SQL Together
 

SQL is helpful for simply filtering rows, aggregating information, or making use of multi-condition logic.
Python, alternatively, presents superior instruments for statistical evaluation and customized computations, in addition to set-based operations, which lengthen past SQL’s capabilities.
When used collectively, SQL simplifies information choice, whereas Python provides analytical flexibility.

 

The right way to Run pandasql Inside a Jupyter Pocket book?

 
To run pandasql inside a Jupyter Pocket book, begin with the next code.

import pandas as pd
from pandasql import sqldf
run = lambda q: sqldf(q, globals())

 

Subsequent, you’ll be able to run your SQL code like this:

run("""
SELECT *
FROM df
LIMIT 10;
""")

 

We’ll use the SQL code with out exhibiting the run perform every time on this article.
 
How to run pandasql inside Jupyter Notebook?How to run pandasql inside Jupyter Notebook?
 

Let’s see how utilizing SQL and Pandas collectively works in a real-life venture from Uber.

 

Actual-World Venture: Analyzing Uber Driver Efficiency Knowledge

 

Real-World Project: Analyzing Uber Driver Performance DataReal-World Project: Analyzing Uber Driver Performance Data
Picture by Writer

 

On this information venture, Uber asks us to investigate driver efficiency information and consider bonus methods.

 

// Knowledge Exploration and Analytics

Now, let’s discover the datasets. First, we’ll load the information.

 

// Preliminary Dataset Loading

Let’s load the dataset through the use of simply pandas.

import pandas as pd
import numpy as np
df = pd.read_csv('dataset_2.csv')

 

// Exploring the Knowledge

Now let’s overview the dataset.

 

The output appears like this:
 
Data Exploration and AnalyticsData Exploration and Analytics
 

Now we have now a glimpse of the information.
As you’ll be able to see, the dataset contains every driver’s identify, the variety of journeys they accomplished, their acceptance charge (i.e., the proportion of journey requests accepted), whole provide hours (the full hours spent on-line), and their common ranking.
Let’s confirm the column names earlier than beginning the information evaluation so we are able to use them appropriately.

 

Right here is the output.

 
Data Exploration and AnalyticsData Exploration and Analytics
 

As you’ll be able to see, our dataset has 5 totally different columns, and there are not any lacking values.
Let’s now reply the questions utilizing each SQL and Python.

 

Query 1: Who Qualifies for Bonus Possibility 1?

 
Within the first query, we’re requested to find out the full bonus payout for Possibility 1, which is:

$50 for every driver that’s on-line not less than 8 hours, accepts 90% of requests, completes 10 journeys, and has a ranking of 4.7 or higher throughout the timeframe.

 

 

// Step 1: Filtering the Qualifying Drivers with SQL (pandasql)

On this step, we’ll begin utilizing pandasql.

Within the following code, we have now chosen all drivers who meet the circumstances for the Possibility 1 bonus utilizing the WHERE clause and the AND operator for linking a number of circumstances. To learn to use WHERE and AND, discuss with this documentation.

opt1_eligible = run("""
    SELECT Title                -- maintain solely a reputation column for readability
    FROM   df
    WHERE  `Provide Hours`    >=  8
      AND  `Journeys Accomplished` >= 10
      AND  `Settle for Fee`     >= 90
      AND  Ranking            >= 4.7;
""")
opt1_eligible

 

Right here is the output.

 
Output showing drivers eligible for Option 1Output showing drivers eligible for Option 1
 

// Step 2: Ending in Pandas

After filtering the dataset utilizing SQL with pandasql, we swap to Pandas to carry out numerical calculations and finalize the evaluation. This hybrid approach, which mixes SQL and Python, enhances each readability and suppleness.

Subsequent, utilizing the next Python code, we calculate the full payout by multiplying the variety of certified drivers (utilizing len()) by the $50 bonus per driver. Take a look at the documentation to see how you should use the len() perform.

payout_opt1 = 50 * len(opt1_eligible)
print(f"Possibility 1 payout: ${payout_opt1:,}")

 

Right here is the output.

 
Finish in PandasFinish in Pandas
 

Query 2: Calculating the Complete Payout for Bonus Possibility 2

 
Within the second query, we’re requested to seek out the full bonus payout utilizing Possibility 2:

$4/journey for all drivers who full 12 journeys, and have a 4.7 or higher ranking.

 

 

// Step 1: Filtering the Qualifying Drivers with SQL (pandasql)

First, we use SQL to filter for drivers who meet the Possibility 2 standards: finishing not less than 12 journeys and sustaining a ranking of 4.7 or increased.

# Seize solely the rows that fulfill the Possibility-2 thresholds
opt2_drivers = run("""
    SELECT Title,
           `Journeys Accomplished`
    FROM   df
    WHERE  `Journeys Accomplished` >= 12
      AND  Ranking            >= 4.7;
""")
opt2_drivers.head()

 

Right here’s what we get.

 
Filter the qualifying drivers with SQL (pandasql)Filter the qualifying drivers with SQL (pandasql)
 

// Step 2: Ending the Calculation in Pure Pandas

Now let’s carry out the calculation utilizing Pandas. The code computes the full bonus by summing the Journeys Accomplished column with sum() after which multiplying the consequence by the $4 bonus per journey.

total_trips   = opt2_drivers["Trips Completed"].sum()
option2_bonus = 4 * total_trips
print(f"Complete journeys: {total_trips},  Possibility-2 payout: ${option2_bonus}")

 

Right here is the consequence.

 
Finish the calculation in pure PandasFinish the calculation in pure Pandas
 

Query 3: Figuring out Drivers Who Qualify for Possibility 1 However Not Possibility 2

 
Within the third query, we’re requested to depend the variety of drivers who qualify for Possibility 1 however not for Possibility 2.

 

// Step 1: Constructing Two Eligibility Tables with SQL (pandasql)

Within the following SQL code, we create two datasets: one for drivers who meet the Possibility 1 standards and one other for individuals who meet the Possibility 2 standards.

# All Possibility-1 drivers
opt1_drivers = run("""
    SELECT Title
    FROM   df
    WHERE  `Provide Hours`    >=  8
      AND  `Journeys Accomplished` >= 10
      AND  `Settle for Fee`     >= 90
      AND  Ranking            >= 4.7;
""")

# All Possibility-2 drivers
opt2_drivers = run("""
    SELECT Title
    FROM   df
    WHERE  `Journeys Accomplished` >= 12
      AND  Ranking            >= 4.7;
""")

 

// Step 2: Utilizing Python Set Logic to Spot the Distinction

Subsequent, we’ll use Python to determine the drivers who seem in Possibility 1 however not in Possibility 2, and we’ll use set operations for that.

Right here is the code:

only_opt1 = set(opt1_drivers["Name"]) - set(opt2_drivers["Name"])
count_only_opt1 = len(only_opt1)

print(f"Drivers qualifying for Possibility 1 however not Possibility 2: {count_only_opt1}")

 

Right here is the output.

 
Use Python set logic to spot the differenceUse Python set logic to spot the difference
 

By combining these strategies, we leverage SQL for filtering and Python’s set logic for evaluating the ensuing datasets.

 

Query 4: Discovering Low-Efficiency Drivers with Excessive Scores

 
In query 4, we’re requested to find out the proportion of drivers who accomplished fewer than 10 journeys, had an acceptance charge beneath 90%, and nonetheless maintained a ranking of 4.7 or increased.

 

// Step 1: Pulling the Subset with SQL (pandasql)

Within the following code, we choose all drivers who’ve accomplished fewer than 10 journeys, have an acceptance charge of lower than 90%, and maintain a ranking of not less than 4.7.

low_kpi_df = run("""
    SELECT *
    FROM   df
    WHERE  `Journeys Accomplished` < 10
      AND  `Settle for Fee`     < 90
      AND  Ranking            >= 4.7;
""")
low_kpi_df

 

Right here is the output.

 
Pull the subset with SQL (pandasql)Pull the subset with SQL (pandasql)
 

// Step 2: Calculating the Proportion in Plain Pandas

On this step, we’ll use Python to calculate the proportion of such drivers.

We merely divide the variety of filtered drivers by the full driver depend, then multiply by 100 to get the proportion.

Right here is the code:

num_low_kpi   = len(low_kpi_df)
total_drivers = len(df)
proportion    = spherical(100 * num_low_kpi / total_drivers, 2)

print(f"{num_low_kpi} out of {total_drivers} drivers ⇒ {proportion}%")

 

Right here is the output.
 
Calculate the percentage in plain PandasCalculate the percentage in plain Pandas
 

Query 5: Calculating Annual Revenue With out Partnering With Uber

 
Within the fifth query, we have to calculate the annual earnings of a taxi driver with out partnering with Uber, based mostly on the given price and income parameters.

 

// Step 1: Pulling Yearly Income and Bills with SQL (pandasql)

Through the use of SQL, we first calculate yearly income from day by day fares and subtract bills for gasoline, lease, and insurance coverage.

taxi_stats = run("""
SELECT
    200*6*(52-3)                      AS annual_revenue,
    ((200+500)*(52-3) + 400*12)       AS annual_expenses
""")
taxi_stats

 

Right here is the output.
 
Pulling yearly revenue and yearly expenses with SQL (pandasql)Pulling yearly revenue and yearly expenses with SQL (pandasql)
 

// Step 2: Deriving Revenue and Margin with Pandas

Within the subsequent step, we’ll use Python to compute the revenue and margin the drivers get when not partnering with Uber.

rev  = taxi_stats.loc[0, "annual_revenue"]
price = taxi_stats.loc[0, "annual_expenses"]

revenue  = rev - price
margin  = spherical(100 * revenue / rev, 2)

print(f"Income  : ${rev:,}")
print(f"Bills : ${price:,}")
print(f"Revenue   : ${revenue:,}    (margin: {margin}%)")

 

Right here’s what we get.

 
Pandas derives profit & margin from those SQL numbersPandas derives profit & margin from those SQL numbers
 

Query 6: Calculating the Required Fare Improve to Keep Profitability

 
Within the sixth query, we assume that the identical driver decides to purchase a City Automotive and associate with Uber.

The gasoline bills improve by 5%, insurance coverage decreases by 20%, and rental prices are eradicated, however the driver must cowl the $40,000 price of the automotive. We’re requested to calculate how a lot this driver’s weekly gross fares should improve within the first yr to each repay the automotive and preserve the identical annual revenue margin.

 

 

// Step 1: Constructing the New One-Yr Expense Stack with SQL

On this step, we’ll use SQL to calculate the brand new one-year bills with adjusted gasoline and insurance coverage and no rental charges, plus the automotive price.

new_exp = run("""
SELECT
    40000             AS automotive,
    200*1.05*(52-3)   AS gasoline,        -- +5 %
    400*0.80*12       AS insurance coverage   -- –20 %
""")
new_cost = new_exp.sum(axis=1).iloc[0]
new_cost

 

Right here is the output.
 
SQL builds the new one-year expense stackSQL builds the new one-year expense stack
 

// Step 2: Calculating the Weekly Fare Improve with Pandas

Subsequent, we use Python to calculate how rather more the motive force should earn per week to protect that margin after shopping for the automotive.

# Present values from Query 5
old_rev    = 58800
old_profit = 19700
old_margin = old_profit / old_rev
weeks      = 49

# new_cost was calculated within the earlier step (54130.0)

# We have to discover the brand new income (new_rev) such that the revenue margin stays the identical:
# (new_rev - new_cost) / new_rev = old_margin
# Fixing for new_rev provides: new_rev = new_cost / (1 - old_margin)
new_rev_required = new_cost / (1 - old_margin)

# The full improve in annual income wanted is the distinction
total_increase = new_rev_required - old_rev

# Divide by the variety of working weeks to get the required weekly improve
weekly_bump = spherical(total_increase / weeks, 2)

print(f"Required weekly gross-fare improve = ${weekly_bump}")

 

Right here’s what we get.
 
Pandas uses old profit-margin & algebra to find weekly bumpPandas uses old profit-margin & algebra to find weekly bump
 

Conclusion

 
Bringing collectively the strengths of SQL and Python, primarily via pandasql, we solved six totally different issues.

SQL helps in fast filtering and summarizing structured datasets, whereas Python is nice at superior computation and dynamic manipulation.

All through this evaluation, we leveraged each instruments to simplify the workflow and make every step extra interpretable.
 
 

Nate Rosidi is an information scientist and in product technique. He is additionally an adjunct professor educating analytics, and is the founding father of StrataScratch, a platform serving to information scientists put together for his or her interviews with actual interview questions from high firms. Nate writes on the newest tendencies within the profession market, provides interview recommendation, shares information science tasks, and covers every part SQL.



Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles

PHP Code Snippets Powered By : XYZScripts.com