Constructing a Regression Mannequin to Predict Supply Durations: A Sensible Information | by Jimin Kang | Dec, 2024

January 28, 2025

100

Information Preparation & Exploratory Evaluation

Now that we’ve outlined our method, let’s check out our information and what sort of options we’re working with.

From the above, we see our information incorporates ~197,000 deliveries, with a wide range of numeric & non-numeric options. Not one of the options are lacking a big share of values (lowest non-null depend ~181,000), so we seemingly received’t have to fret about dropping any options totally.

Let’s examine if our information incorporates any duplicated deliveries, and if there are any observations that we can not compute supply time for.

print(f"Variety of duplicates: {df.duplicated().sum()} n")print(pd.DataFrame({'Lacking Depend': df[['created_at', 'actual_delivery_time']].isna().sum()}))

We see that each one the deliveries are distinctive. Nonetheless, there are 7 deliveries which are lacking a worth for actual_delivery_time, which suggests we received’t be capable to compute the supply length for these orders. Since there’s solely a handful of those, we’ll take away these observations from our information.

Now, let’s create our prediction goal. We need to predict the supply length (in seconds), which is the elapsed time between when the client positioned the order (‘created_at’) and once they recieved the order (‘actual_delivery_time’).

# convert columns to datetime 
df['created_at'] = pd.to_datetime(df['created_at'], utc=True)
df['actual_delivery_time'] = pd.to_datetime(df['actual_delivery_time'], utc=True)# create prediction goal
df['seconds_to_delivery'] = (df['actual_delivery_time'] - df['created_at']).dt.total_seconds()

The very last thing we’ll do earlier than splitting our information into prepare/check is examine for lacking values. We already seen the non-null counts for every characteristic above, however let’s view the proportions to get a greater image.

We see that the market options (‘onshift_dashers’, ‘busy_dashers’, ‘outstanding_orders’) have the very best share of lacking values (~8% lacking). The characteristic with the second-highest lacking information fee is ‘store_primary_category’ (~2%). All different options have < 1% lacking.

Since not one of the options have a excessive lacking depend, we received’t take away any of them. Afterward, we are going to have a look at the characteristic distributions to assist us determine the right way to appropriately cope with lacking observations for every characteristic.

However first, let’s break up our information into prepare/check. We are going to proceed with an 80/20 break up, and we’ll write this check information to a separate file which we received’t contact till evaluating our ultimate mannequin.

from sklearn.model_selection import train_test_split
import os# shuffle
df = df.pattern(frac=1, random_state=42)
df = df.reset_index(drop=True)
# break up
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
# write check information to separate file
listing = 'datasets'
file_name = 'test_data.csv'
file_path = os.path.be part of(listing, file_name)
os.makedirs(listing, exist_ok=True)
test_df.to_csv(file_path, index=False)

Now, let’s dive into the specifics of our prepare information. We’ll set up our numeric & categorical options, to make it clear which columns are being referenced in later exploratory steps.

categorical_feats = [
'market_id',
'store_id',
'store_primary_category',
'order_protocol'
]numeric_feats = [
'total_items',
'subtotal',
'num_distinct_items',
'min_item_price',
'max_item_price',
'total_onshift_dashers',
'total_busy_dashers',
'total_outstanding_orders', 
'estimated_order_place_duration',
'estimated_store_to_consumer_driving_duration'
]

Let’s revisit the explicit options with lacking values (‘market_id’, ‘store_primary_category’, ‘order_protocol’). Since there was little lacking information amongst these options (< 3%), we are going to merely impute these lacking values with an “unknown” class.

This manner, we received’t need to take away information from different options.
Maybe the absence of characteristic values holds some predictive energy for supply length i.e. these options are usually not lacking at random.
Moreover, we are going to add this imputation step to our preprocessing pipeline throughout modeling, in order that we received’t need to manually duplicate this work on our check set.

missing_cols_categorical = ['market_id', 'store_primary_category', 'order_protocol']train_df[missing_cols_categorical] = train_df[missing_cols_categorical].fillna("unknown")

Let’s have a look at our categorical options.

pd.DataFrame({'Cardinality': train_df[categorical_feats].nunique()}).rename_axis('Function')

Since ‘market_id’ & ‘order_protocol’ have low cardinality, we are able to visualize their distributions simply. However, ‘store_id’ & ‘store_primary_category’ are excessive cardinality options. We’ll take a deeper have a look at these later.

import seaborn as sns
import matplotlib.pyplot as pltcategorical_feats_subset = [
'market_id',
'order_protocol'
]
# Arrange the grid
fig, axes = plt.subplots(1, len(categorical_feats_subset), figsize=(13, 5), sharey=True)
# Create barplots for every variable
for i, col in enumerate(categorical_feats_subset):
sns.countplot(x=col, information=train_df, ax=axes[i])
axes[i].set_title(f"Frequencies: {col}")
# Alter format
plt.tight_layout()
plt.present()

Some key issues to notice:

~70% of orders positioned have ‘market_id’ of 1, 2, 4
< 1% of orders have ‘order_protocol’ of 6 or 7

Sadly, we don’t have any further details about these variables, comparable to which ‘market_id’ values are related to which cities/places, and what every ‘order_protocol’ quantity represents. At this level, asking for extra information regarding this info could also be a good suggestion, as it might assist for investigating developments in supply length throughout broader area/location categorizations.

Let’s have a look at our greater cardinality categorical options. Maybe every ‘store_primary_category’ has an related ‘store_id’ vary? If that’s the case, we could not want ‘store_id’, as ‘store_primary_category’ would already encapsulate loads of the details about the shop being ordered from.

store_info = train_df[['store_id', 'store_primary_category']]store_info.groupby('store_primary_category')['store_id'].agg(['min', 'max'])

Clearly not the case: we see that ‘store_id’ ranges overlap throughout ranges of ‘store_primary_category’.

A fast have a look at the distinct values and related frequencies for ‘store_id’ & ‘store_primary_category’ reveals that these options have excessive cardinality and are sparsely distributed. Normally, excessive cardinality categorical options could also be problematic in regression duties, notably for regression algorithms that require solely numeric information. When these excessive cardinality options are encoded, they could enlarge the characteristic area drastically, making the out there information sparse and lowering the mannequin’s potential to generalize to new observations in that characteristic area. For a greater & extra skilled clarification of the phenomena, you’ll be able to learn extra about it right here.

Let’s get a way of how sparsely distributed these options are.

store_id_values = train_df['store_id'].value_counts()# Plot the histogram
plt.determine(figsize=(8, 5))
plt.bar(store_id_values.index, store_id_values.values, colour='skyblue')
# Add titles and labels
plt.title('Worth Counts: store_id', fontsize=14)
plt.xlabel('store_id', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.xticks(rotation=45)  # Rotate x-axis labels for higher readability
plt.tight_layout()
plt.present()

We see that there are a handful of shops which have tons of of orders, however the majority of them have a lot lower than 100.

To deal with the excessive cardinality of ‘store_id’, we’ll create one other characteristic, ‘store_id_freq’, that teams the ‘store_id’ values by frequency.

We’ll group the ‘store_id’ values into 5 completely different percentile bins proven beneath.
‘store_id_freq’ can have a lot decrease cardinality than ‘store_id’, however will retain related info relating to the recognition of the shop the supply was ordered from.
For extra inspiration behind this logic, take a look at this thread.

def encode_frequency(freq, percentiles) -> str:
if freq < percentiles[0]:
return '[0-50)'
elif freq < percentiles[1]:
return '[50-75)'
elif freq < percentiles[2]:
return '[75-90)'
elif freq < percentiles[3]:
return '[90-99)'
else:
return '99+'value_counts = train_df['store_id'].value_counts()
percentiles = np.percentile(value_counts, [50, 75, 90, 99]) 
# apply encode_frequency to every store_id based mostly on their variety of orders
train_df['store_id_freq'] = train_df['store_id'].apply(lambda x: encode_frequency(value_counts[x], percentiles))
pd.DataFrame({'Depend':train_df['store_id_freq'].value_counts()}).rename_axis('Frequency Bin')

Our encoding reveals us that ~60,000 deliveries had been ordered from shops catgorized within the 90–99th percentile by way of recognition, whereas ~12,000 deliveries had been ordered from shops that had been within the 0–fiftieth percentile in recognition.

Now that we’ve (tried) to seize related ‘store_id’ info in a decrease dimension, let’s attempt to do one thing comparable with ‘store_primary_category’.

Let’s have a look at the most well-liked ‘store_primary_category’ ranges.

A fast look reveals us that many of those ‘store_primary_category’ ranges are usually not unique to one another (ex: ‘american’ & ‘burger’). Additional investigation reveals many extra examples of this type of overlap.

So, let’s attempt to map these distinct retailer classes into a couple of fundamental, all-encompassing teams.

store_category_map = {
'american': ['american', 'burger', 'sandwich', 'barbeque'],
'asian': ['asian', 'chinese', 'japanese', 'indian', 'thai', 'vietnamese', 'dim-sum', 'korean', 
'sushi', 'bubble-tea', 'malaysian', 'singaporean', 'indonesian', 'russian'],
'mexican': ['mexican'],
'italian': ['italian', 'pizza'],
}def map_to_category_type(class: str) -> str:
for category_type, classes in store_category_map.gadgets():
if class in classes:
return category_type
return "different"
train_df['store_category_type'] = train_df['store_primary_category'].apply(lambda x: map_to_category_type(x))
value_counts = train_df['store_category_type'].value_counts()
# Plot pie chart
plt.determine(figsize=(6, 6))
value_counts.plot.pie(autopct='%1.1f%%', startangle=90, cmap='viridis', labels=value_counts.index)
plt.title('Class Distribution')
plt.ylabel('')  # Cover y-axis label for aesthetics
plt.present()

This grouping might be brutally easy, and there could very properly be a greater solution to group these retailer classes. Let’s proceed with it for now for simplicity.

We’ve executed a great deal of investigation into our categorical options. Let’s have a look at the distributions for our numeric options.

# Create grid for boxplots
fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(12, 15))  # Alter determine measurement
axes = axes.flatten()  # Flatten the 5x2 axes right into a 1D array for simpler iteration# Generate boxplots for every numeric characteristic
for i, column in enumerate(numeric_feats):
sns.boxplot(y=train_df[column], ax=axes[i])
axes[i].set_title(f"Boxplot for {column}")
axes[i].set_ylabel(column)
# Take away any unused subplots (if any)
for i in vary(len(numeric_feats), len(axes)):
fig.delaxes(axes[i])
# Alter format for higher spacing
plt.tight_layout()
plt.present()

Boxplots for a subset of our numeric options

Most of the distributions seem like extra proper skewed then they’re as a result of presence of outliers.

Particularly, there appears to be an order with 400+ gadgets. This appears unusual as the following largest order is lower than 100 gadgets.

Let’s look extra into that 400+ merchandise order.

train_df[train_df['total_items']==train_df['total_items'].max()]

Constructing a Regression Mannequin to Predict Supply Durations: A Sensible Information | by Jimin Kang | Dec, 2024

Information Preparation & Exploratory Evaluation

Related Articles

Hackers Actively Exploiting 7-Zip Symbolic Hyperlink–Primarily based RCE Vulnerability (CVE-2025-11001)

Robots-Weblog | Open Supply Humanoid pib in neuer Model veröffentlicht

Growing Human Sexuality within the Age of AI

LEAVE A REPLY Cancel reply

Latest Articles

Hackers Actively Exploiting 7-Zip Symbolic Hyperlink–Primarily based RCE Vulnerability (CVE-2025-11001)

Robots-Weblog | Open Supply Humanoid pib in neuer Model veröffentlicht

Growing Human Sexuality within the Age of AI

Colibrium Additive Launches M Line 4 x 1kW System for Aerospace and Protection Functions

Constructing a scalable doc administration system: Classes from separating metadata and content material

About US