
Pandas and SQL are each efficient for information evaluation, however what if we may merge their energy? With pandasql, you’ll be able to write SQL queries straight inside a Jupyter pocket book. This integration seamlessly permits us to mix SQL logic with Python for efficient information evaluation.
On this article, we’ll use each pandas and SQL collectively on an information venture from Uber. Let’s get began!
# What Is pandasql?
Pandasql could be built-in with any DataFrame via an in-memory SQLite engine, so you’ll be able to write pure SQL inside a Python surroundings.
# Benefits of Utilizing Pandas and SQL Collectively
SQL is helpful for simply filtering rows, aggregating information, or making use of multi-condition logic.
Python, alternatively, presents superior instruments for statistical evaluation and customized computations, in addition to set-based operations, which lengthen past SQL’s capabilities.
When used collectively, SQL simplifies information choice, whereas Python provides analytical flexibility.
# The right way to Run pandasql Inside a Jupyter Pocket book?
To run pandasql
inside a Jupyter Pocket book, begin with the next code.
import pandas as pd
from pandasql import sqldf
run = lambda q: sqldf(q, globals())
Subsequent, you’ll be able to run your SQL code like this:
run("""
SELECT *
FROM df
LIMIT 10;
""")
We’ll use the SQL code with out exhibiting the run
perform every time on this article.
Let’s see how utilizing SQL and Pandas collectively works in a real-life venture from Uber.
# Actual-World Venture: Analyzing Uber Driver Efficiency Knowledge


Picture by Writer
On this information venture, Uber asks us to investigate driver efficiency information and consider bonus methods.
// Knowledge Exploration and Analytics
Now, let’s discover the datasets. First, we’ll load the information.
// Preliminary Dataset Loading
Let’s load the dataset through the use of simply pandas.
import pandas as pd
import numpy as np
df = pd.read_csv('dataset_2.csv')
// Exploring the Knowledge
Now let’s overview the dataset.
The output appears like this:
Now we have now a glimpse of the information.
As you’ll be able to see, the dataset contains every driver’s identify, the variety of journeys they accomplished, their acceptance charge (i.e., the proportion of journey requests accepted), whole provide hours (the full hours spent on-line), and their common ranking.
Let’s confirm the column names earlier than beginning the information evaluation so we are able to use them appropriately.
Right here is the output.
As you’ll be able to see, our dataset has 5 totally different columns, and there are not any lacking values.
Let’s now reply the questions utilizing each SQL and Python.
# Query 1: Who Qualifies for Bonus Possibility 1?
Within the first query, we’re requested to find out the full bonus payout for Possibility 1, which is:
$50 for every driver that’s on-line not less than 8 hours, accepts 90% of requests, completes 10 journeys, and has a ranking of 4.7 or higher throughout the timeframe.
// Step 1: Filtering the Qualifying Drivers with SQL (pandasql)
On this step, we’ll begin utilizing pandasql
.
Within the following code, we have now chosen all drivers who meet the circumstances for the Possibility 1 bonus utilizing the WHERE
clause and the AND
operator for linking a number of circumstances. To learn to use WHERE
and AND
, discuss with this documentation.
opt1_eligible = run("""
SELECT Title -- maintain solely a reputation column for readability
FROM df
WHERE `Provide Hours` >= 8
AND `Journeys Accomplished` >= 10
AND `Settle for Fee` >= 90
AND Ranking >= 4.7;
""")
opt1_eligible
Right here is the output.
// Step 2: Ending in Pandas
After filtering the dataset utilizing SQL with pandasql
, we swap to Pandas to carry out numerical calculations and finalize the evaluation. This hybrid approach, which mixes SQL and Python, enhances each readability and suppleness.
Subsequent, utilizing the next Python code, we calculate the full payout by multiplying the variety of certified drivers (utilizing len()
) by the $50 bonus per driver. Take a look at the documentation to see how you should use the len()
perform.
payout_opt1 = 50 * len(opt1_eligible)
print(f"Possibility 1 payout: ${payout_opt1:,}")
Right here is the output.
# Query 2: Calculating the Complete Payout for Bonus Possibility 2
Within the second query, we’re requested to seek out the full bonus payout utilizing Possibility 2:
$4/journey for all drivers who full 12 journeys, and have a 4.7 or higher ranking.
// Step 1: Filtering the Qualifying Drivers with SQL (pandasql)
First, we use SQL to filter for drivers who meet the Possibility 2 standards: finishing not less than 12 journeys and sustaining a ranking of 4.7 or increased.
# Seize solely the rows that fulfill the Possibility-2 thresholds
opt2_drivers = run("""
SELECT Title,
`Journeys Accomplished`
FROM df
WHERE `Journeys Accomplished` >= 12
AND Ranking >= 4.7;
""")
opt2_drivers.head()
Right here’s what we get.
// Step 2: Ending the Calculation in Pure Pandas
Now let’s carry out the calculation utilizing Pandas. The code computes the full bonus by summing the Journeys Accomplished
column with sum()
after which multiplying the consequence by the $4 bonus per journey.
total_trips = opt2_drivers["Trips Completed"].sum()
option2_bonus = 4 * total_trips
print(f"Complete journeys: {total_trips}, Possibility-2 payout: ${option2_bonus}")
Right here is the consequence.
# Query 3: Figuring out Drivers Who Qualify for Possibility 1 However Not Possibility 2
Within the third query, we’re requested to depend the variety of drivers who qualify for Possibility 1 however not for Possibility 2.
// Step 1: Constructing Two Eligibility Tables with SQL (pandasql)
Within the following SQL code, we create two datasets: one for drivers who meet the Possibility 1 standards and one other for individuals who meet the Possibility 2 standards.
# All Possibility-1 drivers
opt1_drivers = run("""
SELECT Title
FROM df
WHERE `Provide Hours` >= 8
AND `Journeys Accomplished` >= 10
AND `Settle for Fee` >= 90
AND Ranking >= 4.7;
""")
# All Possibility-2 drivers
opt2_drivers = run("""
SELECT Title
FROM df
WHERE `Journeys Accomplished` >= 12
AND Ranking >= 4.7;
""")
// Step 2: Utilizing Python Set Logic to Spot the Distinction
Subsequent, we’ll use Python to determine the drivers who seem in Possibility 1 however not in Possibility 2, and we’ll use set operations for that.
Right here is the code:
only_opt1 = set(opt1_drivers["Name"]) - set(opt2_drivers["Name"])
count_only_opt1 = len(only_opt1)
print(f"Drivers qualifying for Possibility 1 however not Possibility 2: {count_only_opt1}")
Right here is the output.
By combining these strategies, we leverage SQL for filtering and Python’s set logic for evaluating the ensuing datasets.
# Query 4: Discovering Low-Efficiency Drivers with Excessive Scores
In query 4, we’re requested to find out the proportion of drivers who accomplished fewer than 10 journeys, had an acceptance charge beneath 90%, and nonetheless maintained a ranking of 4.7 or increased.
// Step 1: Pulling the Subset with SQL (pandasql)
Within the following code, we choose all drivers who’ve accomplished fewer than 10 journeys, have an acceptance charge of lower than 90%, and maintain a ranking of not less than 4.7.
low_kpi_df = run("""
SELECT *
FROM df
WHERE `Journeys Accomplished` < 10
AND `Settle for Fee` < 90
AND Ranking >= 4.7;
""")
low_kpi_df
Right here is the output.
// Step 2: Calculating the Proportion in Plain Pandas
On this step, we’ll use Python to calculate the proportion of such drivers.
We merely divide the variety of filtered drivers by the full driver depend, then multiply by 100 to get the proportion.
Right here is the code:
num_low_kpi = len(low_kpi_df)
total_drivers = len(df)
proportion = spherical(100 * num_low_kpi / total_drivers, 2)
print(f"{num_low_kpi} out of {total_drivers} drivers ⇒ {proportion}%")
Right here is the output.
# Query 5: Calculating Annual Revenue With out Partnering With Uber
Within the fifth query, we have to calculate the annual earnings of a taxi driver with out partnering with Uber, based mostly on the given price and income parameters.
// Step 1: Pulling Yearly Income and Bills with SQL (pandasql)
Through the use of SQL, we first calculate yearly income from day by day fares and subtract bills for gasoline, lease, and insurance coverage.
taxi_stats = run("""
SELECT
200*6*(52-3) AS annual_revenue,
((200+500)*(52-3) + 400*12) AS annual_expenses
""")
taxi_stats
Right here is the output.
// Step 2: Deriving Revenue and Margin with Pandas
Within the subsequent step, we’ll use Python to compute the revenue and margin the drivers get when not partnering with Uber.
rev = taxi_stats.loc[0, "annual_revenue"]
price = taxi_stats.loc[0, "annual_expenses"]
revenue = rev - price
margin = spherical(100 * revenue / rev, 2)
print(f"Income : ${rev:,}")
print(f"Bills : ${price:,}")
print(f"Revenue : ${revenue:,} (margin: {margin}%)")
Right here’s what we get.
# Query 6: Calculating the Required Fare Improve to Keep Profitability
Within the sixth query, we assume that the identical driver decides to purchase a City Automotive and associate with Uber.
The gasoline bills improve by 5%, insurance coverage decreases by 20%, and rental prices are eradicated, however the driver must cowl the $40,000 price of the automotive. We’re requested to calculate how a lot this driver’s weekly gross fares should improve within the first yr to each repay the automotive and preserve the identical annual revenue margin.
// Step 1: Constructing the New One-Yr Expense Stack with SQL
On this step, we’ll use SQL to calculate the brand new one-year bills with adjusted gasoline and insurance coverage and no rental charges, plus the automotive price.
new_exp = run("""
SELECT
40000 AS automotive,
200*1.05*(52-3) AS gasoline, -- +5 %
400*0.80*12 AS insurance coverage -- –20 %
""")
new_cost = new_exp.sum(axis=1).iloc[0]
new_cost
Right here is the output.
// Step 2: Calculating the Weekly Fare Improve with Pandas
Subsequent, we use Python to calculate how rather more the motive force should earn per week to protect that margin after shopping for the automotive.
# Present values from Query 5
old_rev = 58800
old_profit = 19700
old_margin = old_profit / old_rev
weeks = 49
# new_cost was calculated within the earlier step (54130.0)
# We have to discover the brand new income (new_rev) such that the revenue margin stays the identical:
# (new_rev - new_cost) / new_rev = old_margin
# Fixing for new_rev provides: new_rev = new_cost / (1 - old_margin)
new_rev_required = new_cost / (1 - old_margin)
# The full improve in annual income wanted is the distinction
total_increase = new_rev_required - old_rev
# Divide by the variety of working weeks to get the required weekly improve
weekly_bump = spherical(total_increase / weeks, 2)
print(f"Required weekly gross-fare improve = ${weekly_bump}")
Right here’s what we get.
# Conclusion
Bringing collectively the strengths of SQL and Python, primarily via pandasql
, we solved six totally different issues.
SQL helps in fast filtering and summarizing structured datasets, whereas Python is nice at superior computation and dynamic manipulation.
All through this evaluation, we leveraged each instruments to simplify the workflow and make every step extra interpretable.
Nate Rosidi is an information scientist and in product technique. He is additionally an adjunct professor educating analytics, and is the founding father of StrataScratch, a platform serving to information scientists put together for his or her interviews with actual interview questions from high firms. Nate writes on the newest tendencies within the profession market, provides interview recommendation, shares information science tasks, and covers every part SQL.